In [None]:
library(dplyr)
library(ggplot2)
library(statsr)

### Seven verbs

The `dplyr` package offers seven verbs (functions) for basic data 
manipulation:

- `filter()`
- `arrange()`
- `select()` 
- `distinct()`
- `mutate()`
- `summarise()`
- `sample_n()`

In [None]:
data(nycflights)

In [None]:
names(nycflights)

In [None]:
str(nycflights)

In [None]:
ggplot(data = nycflights, aes(x = dep_delay)) + geom_histogram()

### Filter
If we want to focus on departure delays of flights headed to RDU only, we need to
first `filter` the data for flights headed to RDU (`dest == "RDU"`) and then make
a histogram of only departure delays of only those flights.

**Logical operators: ** Filtering for certain observations (e.g. flights from a 
particular airport) is often of interest in data frames where we might want to 
examine observations with certain characteristics separately from the rest of 
the data. To do so we use the `filter` function and a series of 
**logical operators**. The most commonly used logical operators for data 
analysis are as follows:

- `==` means "equal to"
- `!=` means "not equal to"
- `>` or `<` means "greater than" or "less than"
- `>=` or `<=` means "greater than or equal to" or "less than or equal to"


If we want to focus on departure delays of flights headed to RDU only, we need to
first `filter` the data for flights headed to RDU (`dest == "RDU"`) and then make
a histogram of only departure delays of only those flights.

In [None]:
rdu_flights <- nycflights %>%
  filter(dest == "RDU")
ggplot(data = rdu_flights, aes(x = dep_delay)) + geom_histogram()

In [None]:
rdu_flights %>%
  summarise(mean_dd = mean(dep_delay), sd_dd = sd(dep_delay), n = n())

We can also filter based on multiple criteria. Suppose we are interested in flights headed to San Francisco (SFO) in February:

Note that we can separate the conditions using commas if we want flights that are both headed to SFO **and** in February. If we are interested in either
flights headed to SFO **or** in February we can use the `|` instead of the comma.

In [None]:
sfo_feb_flights <- nycflights %>%
  filter(dest == "SFO", month == 2)

In [None]:
head(sfo_feb_flights)

### Summarise


Note that in the `summarise` function we created a list of two elements. The 
names of these elements are user defined, like `mean_dd`, `sd_dd`, `n`, and 
you could customize these names as you like (just don't use spaces in your 
names). Calculating these summary statistics also require that you know the 
function calls. Note that `n()` reports the sample size.


**Summary statistics: ** Some useful function calls for summary statistics for a 
single numerical variable are as follows:

- `mean`
- `median`
- `sd`
- `var`
- `IQR`
- `range`
- `min`
- `max`

In [None]:
sfo_feb_flights %>%
  summarise(mean_dd = mean(dep_delay), sd_dd = sd(dep_delay), n = n(), IQR_dd=IQR(dep_delay))

Another useful functionality is being able to quickly calculate summary statistics for various groups in your data frame. For example, we can modify the 
above command using the `group_by` function to get the same summary stats for each origin airport:

In [None]:
sfo_feb_flights %>%
  group_by(origin) %>%
  summarise(mean_ad = mean(arr_delay), median_ad=median(arr_delay),sd_ad = sd(arr_delay), IQR_ad=IQR(arr_delay),n = n())

In [None]:
sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(mean_ad = mean(arr_delay), median_ad=median(arr_delay),sd_ad = sd(arr_delay), IQR_ad=IQR(arr_delay),n = n())

Which month would you expect to have the highest average delay departing 
from an NYC airport?

Let's think about how we would answer this question:

- First, calculate monthly averages for departure delays. With the new language
we are learning, we need to
    + `group_by` months, then
    + `summarise` mean departure delays.
- Then, we need to `arrange` these average delays in `desc`ending order

In [None]:
nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay)) %>%
  arrange(desc(mean_dd))

In [None]:
nycflights %>%
  group_by(month) %>%
  summarise(median_dd = median(dep_delay)) %>%
  arrange(desc(median_dd))

We can also visualize the distributions of departure delays across months using 
side-by-side box plots:

There is some new syntax here: We want departure delays on the y-axis and the
months on the x-axis to produce side-by-side box plots. Side-by-side box plots
require a categorical variable on the x-axis, however in the data frame `month` is 
stored as a numerical variable (numbers 1 - 12). Therefore we can force R to treat
this variable as categorical, what R calls a **factor**, variable with 
`factor(month)`.

In [None]:
ggplot(nycflights, aes(x = factor(month), y = dep_delay)) +
  geom_boxplot()

### Mutate

Suppose you will be flying out of NYC and want to know which of the three major NYC airports has the best on time departure rate of departing flights. Suppose also that for you a flight that is delayed for less than 5 minutes is basically "on time". You consider any flight delayed for 5 minutes of more to be "delayed".

In order to determine which airport has the best on time departure rate, 
we need to 

- first classify each flight as "on time" or "delayed",
- then group flights by origin airport,
- then calculate on time departure rates for each origin airport,
- and finally arrange the airports in descending order for on time departure
percentage.

Let's start with classifying each flight as "on time" or "delayed" by creating a new variable with the `mutate` function.

In [None]:
nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

The first argument in the `mutate` function is the name of the new variablewe want to create, in this case `dep_type`. Then if `dep_delay < 5` we classify the flight as `"on time"` and `"delayed"` if not, i.e. if the flight is delayed for 5 or more minutes.

Note that we are also overwriting the `nycflights` data frame with the new version of this data frame that includes the new `dep_type` variable.

We can handle all the remaining steps in one code chunk:

In [None]:
nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(ot_dep_rate))

We can also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.

In [None]:
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) + geom_bar()

Mutate the data frame so that it includes a new variable that contains the average speed, `avg_speed` traveled by the plane for each flight (in mph). What is the tail number of the plane with the fastest `avg_speed`? **Hint:** Average speed can be calculated as distance divided by number of hours of travel, and note that `air_time` is given in minutes. If you just want to show the `avg_speed` and `tailnum` and none of the other variables, use the select function at the end of your pipe to select just these two variables with `select(avg_speed, tailnum)`. You can Google this tail number to find out more about the aircraft. 

In [None]:
nycflights <- nycflights %>%
  mutate(avg_speed = distance /(air_time/60))

In [None]:
maxflight <- nycflights %>%
    select (max(avg_speed))

In [None]:
# maxflight