## Section 3 - The five verbs and select in more detail

### Five functions provided in dplyr

* select: Removes columns of a data set
* filter: Removes rows 
* arrange: Reorders rows of the data set
* mutate: Uses the data to build new columns of values
* summarize: Calculates summary statistics

### select & mutate

* Manipulate the variable in the dataset

### filter & arrange

* Manipulate the observations

### sumamrize

* Manipulates the groups of observations

### tidy data - tidyr can help you tidy out your dataset

* A data set which has variables in columns and observations in rows

### Choosing is not losing! The select verb

In [3]:
library(dplyr)
library(hflights)

# hflights is pre-loaded as a tbl, together with the necessary libraries.

# Print out a tbl with the four columns of hflights related to delay
select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay)

# Print out the columns Origin up to Cancelled of hflights
select(hflights, Origin:Cancelled)

# Answer to last question: be concise!
select(hflights, 1:4, 12:21)

Unnamed: 0,ActualElapsedTime,AirTime,ArrDelay,DepDelay
5424,60,40,-10,0
5425,60,45,-9,1
5426,70,48,-8,-8
5427,70,39,3,3
5428,62,44,-3,5
5429,64,45,-7,-1
5430,70,43,-1,-1
5431,59,40,-16,-5
5432,71,41,44,43
5433,70,45,43,43


Unnamed: 0,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled
5424,IAH,DFW,224,7,13,0
5425,IAH,DFW,224,6,9,0
5426,IAH,DFW,224,5,17,0
5427,IAH,DFW,224,9,22,0
5428,IAH,DFW,224,9,9,0
5429,IAH,DFW,224,6,13,0
5430,IAH,DFW,224,12,15,0
5431,IAH,DFW,224,7,12,0
5432,IAH,DFW,224,8,22,0
5433,IAH,DFW,224,6,19,0


Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted
5424,2011,1,1,6,-10,0,IAH,DFW,224,7,13,0,,0
5425,2011,1,2,7,-9,1,IAH,DFW,224,6,9,0,,0
5426,2011,1,3,1,-8,-8,IAH,DFW,224,5,17,0,,0
5427,2011,1,4,2,3,3,IAH,DFW,224,9,22,0,,0
5428,2011,1,5,3,-3,5,IAH,DFW,224,9,9,0,,0
5429,2011,1,6,4,-7,-1,IAH,DFW,224,6,13,0,,0
5430,2011,1,7,5,-1,-1,IAH,DFW,224,12,15,0,,0
5431,2011,1,8,6,-16,-5,IAH,DFW,224,7,12,0,,0
5432,2011,1,9,7,44,43,IAH,DFW,224,8,22,0,,0
5433,2011,1,10,1,43,43,IAH,DFW,224,6,19,0,,0


### Helper functions for variable selection

* starts_with("X"): every name that starts with "X",
* ends_with("X"): every name that ends with "X",
* contains("X"): every name that contains "X",
* matches("X"): every name that matches "X", where "X" can be a regular expression,
* num_range("x", 1:5): the variables named x01, x02, x03, x04 and x05,
* one_of(x): every name that appears in x, which should be a character vector.

In [4]:
# As usual, hflights is pre-loaded as a tbl, together with the necessary libraries.

# Print out a tbl containing just ArrDelay and DepDelay
select(hflights, contains("Delay"))

# Print out a tbl as described in the second instruction, using both helper functions and variable names
select(hflights, UniqueCarrier, contains("Num"), contains("Cancel"))

# Print out a tbl as described in the third instruction, using only helper functions.
select(hflights, contains("Time"), contains("Delay"))

Unnamed: 0,ArrDelay,DepDelay
5424,-10,0
5425,-9,1
5426,-8,-8
5427,3,3
5428,-3,5
5429,-7,-1
5430,-1,-1
5431,-16,-5
5432,44,43
5433,43,43


Unnamed: 0,UniqueCarrier,FlightNum,TailNum,Cancelled,CancellationCode
5424,AA,428,N576AA,0,
5425,AA,428,N557AA,0,
5426,AA,428,N541AA,0,
5427,AA,428,N403AA,0,
5428,AA,428,N492AA,0,
5429,AA,428,N262AA,0,
5430,AA,428,N493AA,0,
5431,AA,428,N477AA,0,
5432,AA,428,N476AA,0,
5433,AA,428,N504AA,0,


Unnamed: 0,DepTime,ArrTime,ActualElapsedTime,AirTime,ArrDelay,DepDelay
5424,1400,1500,60,40,-10,0
5425,1401,1501,60,45,-9,1
5426,1352,1502,70,48,-8,-8
5427,1403,1513,70,39,3,3
5428,1405,1507,62,44,-3,5
5429,1359,1503,64,45,-7,-1
5430,1359,1509,70,43,-1,-1
5431,1355,1454,59,40,-16,-5
5432,1443,1554,71,41,44,43
5433,1443,1553,70,45,43,43


### Comparison to base R

In [5]:
# both hflights and dplyr are available

# Finish select call so that ex1d matches ex1r
ex1r <- hflights[c("TaxiIn", "TaxiOut", "Distance")]
ex1d <- select(hflights, contains("Taxi"), Distance)

# Finish select call so that ex2d matches ex2r
ex2r <- hflights[c("Year", "Month", "DayOfWeek", "DepTime", "ArrTime")]
ex2d <- select(hflights, Year, Month, DayOfWeek, contains("Time"))

# Finish select call so that ex3d matches ex3r
ex3r <- hflights[c("TailNum", "TaxiIn", "TaxiOut")]
ex3d <- select(hflights, TailNum, contains("Taxi"))

## Section 4 - The second of five verbs: mutate

### mutate

* mutate(tbl, new column name = expression)

### Mutating is creating

In [6]:
# hflights and dplyr are loaded and ready to serve you.

# Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)

# Add the new variable GroundTime to g1. Save the result as g2.
g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)

# Add the new variable AverageSpeed to g2. Save the result as g3.
g3 <- mutate(g2, AverageSpeed = Distance / AirTime * 60)

# Print out g3
print(g3)

       Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
1      2011     1          1         6    1400    1500            AA       428
2      2011     1          2         7    1401    1501            AA       428
3      2011     1          3         1    1352    1502            AA       428
4      2011     1          4         2    1403    1513            AA       428
5      2011     1          5         3    1405    1507            AA       428
6      2011     1          6         4    1359    1503            AA       428
7      2011     1          7         5    1359    1509            AA       428
8      2011     1          8         6    1355    1454            AA       428
9      2011     1          9         7    1443    1554            AA       428
10     2011     1         10         1    1443    1553            AA       428
11     2011     1         11         2    1429    1539            AA       428
12     2011     1         12         3    1419    15

### Add multiple variables using mutate

In [7]:
# hflights and dplyr are ready, are you?

# Add a second variable loss_ratio to the dataset: m1
m1 <- mutate(hflights, loss = ArrDelay - DepDelay,
             loss_ratio = loss / DepDelay)

# Add the three variables as described in the third instruction: m2
m2 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut,
             ActualGroundTime = ActualElapsedTime - AirTime,
             Diff = TotalTaxi - ActualGroundTime)