### <center> Data Manipulation with R </center>

* **`dplyr`** for manipulating data
* **`tidyr`** for clearning data
* **`%>%`** (Pipe Oprerator) creates a pipeline of sequential operations that manipulate data

In [33]:
library("dplyr")
library("tidyr")
library("nycflights13") # Get flight data

In [34]:
head(x=flights, n=3)

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,1,1,517,515,2,830,819,11,UA,1545,N14228,EWR,IAH,227,1400,5,15,2013-01-01 05:00:00
2013,1,1,533,529,4,850,830,20,UA,1714,N24211,LGA,IAH,227,1416,5,29,2013-01-01 05:00:00
2013,1,1,542,540,2,923,850,33,AA,1141,N619AA,JFK,MIA,160,1089,5,40,2013-01-01 05:00:00


### <center> dplyr() </center>

| Function | Description | Example |
| ---- | ---- | ---- |
|**`filter()`**| Select a subset of rows, by expressions that filter the data frame. You can also use logical operators **`(&, \|, !)`** inside. |**`filter(flights, month==11, day==3, carrier=="AA")`** |
|**`slice()`** | Select rows by position in a data frame | **`slice(flights, 1:3)`** |
|**`arrange()`** | Sort a data frame based on a set of column names | **`arrange(flights, year, month, day, air_time)`** |
|**`desc()`** | Can be used inside the `arrange()` to sort a column in descending order| **`arrange(flights, desc(dep_delay), 3)`** |
|**`select()`** | Select specific columns of a data frame | **`select(flights, carrier)`** |
|**`rename()`** | Rename columns in a data frame.  **This is not in-place!** | **`rename(flights, airline_carrier = carrier)`** |
|**`distinct()`** | Returns the distinct values in a data frame.  Often used with **`select()`** to select specific column | **`distinct(select(flights, carrier))`** |
|**`mutate()`** | Create new columns that are functions of existing columns | **`mutate(flights, total_delay = arr_delay+dep_delay)`** |
|**`transmute()`** |Use transmuate when you only want the new columns | **`transmute(flights, total_delay = arr_delay+dep_delay)`** |
|**`summarise()`** |Use aggregate results of data frames into single rows. Remember **`na.rm=TRUE`** to remove NA values. | **`summarise(flights, avg_air_time=mean(air_time, na.rm=TRUE))`** |
|**`sample_n()`** | Get a random sample of n rows from data frame | **`sample_n(flights, 3)`** | 
|**`sample_frac()`**| Get a random sample of a percentage of a data frame (expressed as decimal)| **`sample_frac(flights, 0.000005)`** |

* Full documentation of **`dplyr`** can be found **[here](https://cran.r-project.org/web/packages/dplyr/dplyr.pdf)**.

### <center> Pipe Operator `%>%` </center> 
**The pipe operator `%>%` allows you to chain together multiple functions on a dataset (i.e. create a pipeline), to avoid long nested operations or doing multiple assignments.**

In [31]:
f1 <- filter(flights, month==11, carrier=="AA")
f2 <- sample_n(f1, size=3) 
f3 <- arrange(f2, desc(hour))
f3

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,11,7,1455,1459,-4,1750,1814,-24,AA,2297,N3FKAA,LGA,MIA,156,1096,14,59,2013-11-07 14:00:00
2013,11,7,954,959,-5,1144,1144,0,AA,1030,N4XDAA,LGA,STL,145,888,9,59,2013-11-07 09:00:00
2013,11,7,931,940,-9,1055,1120,-25,AA,317,N4YEAA,LGA,ORD,115,733,9,40,2013-11-07 09:00:00


In [35]:
flights %>% filter(month==11, carrier=="AA") %>% sample_n(size=3) %>% arrange(desc(hour)) # No need to enter the data name in the functions

year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,11,19,1239,1250,-11,1346,1400,-14,AA,178,N3JVAA,JFK,BOS,44,187,12,50,2013-11-19 12:00:00
2013,11,21,1012,1015,-3,1208,1150,18,AA,325,N455AA,LGA,ORD,120,733,10,15,2013-11-21 10:00:00
2013,11,6,931,940,-9,1209,1120,49,AA,317,N426AA,LGA,ORD,126,733,9,40,2013-11-06 09:00:00


#### <center> `tidyr()` </center> 

|Function | Description | Example |
| ---- | ---- | ---- |
| **`gather(data, key, value)`** | Unpivot - Collapse multiple columns into key-pair values. | **`gather(data=df, key=Quarter, value=Revenue, Qrt1:Qrt4)`** | 
| **`spread(data, key, value)`** | Pivot - Seperate one column into multiple ones. | **`gather(data=df, key=stock, value=price, -time)`** |
| **`seperate(data, col, into)`** | Split - Split one column into multiple ones | **`separate(data=df, col=x, into=c("x1", "x2"))`** | 
|**`unite(data, col, sep)`**| Merge - Unite multiple columns into one| **`unite(data=df, col="X12", c("x1", "x2"), sep=".")`** |