In this notebook, we will cover:

* [Grouped Summaries](#Grouped-Summaries)
* [Pipes](#Pipes)

Let us load up the `tidyverse` and `nycflights13` packages.

In [2]:
install.packages("nycflights13")
library(tidyverse)
library(nycflights13)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



# Grouped Summaries

`summarize()` can be used to summarize entire data frames by collapsing them into single number summaries.

In [None]:
summarize(flights, delay = mean(dep_delay))

Oops, we got `NA` since most operations involving missing values yield missing values. We can ignore missing values like this.

In [None]:
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))

The usefulness of `summarize()` is greater when used in conjunction with `group_by()`.

In [None]:
by_month <- group_by(flights, month)
(monthly_delays <- summarize(by_month, delay = mean(dep_delay, na.rm = TRUE)))

Many summarization functions are available:

* Center: `mean(), median()`
* Spread: `sd(), IQR(), mad()`
* Range: `min(), max(), quantile()`
* Position: `first(), last(), nth()`
* Count: `n(), n_distinct()`
* Logical: `any(), all()`

# Pipes

The above code can be written without the use of the intermediate variable `by_month` using pipes.

In [None]:
group_by(flights, month) %>%
    summarize(delay = mean(dep_delay, na.rm = TRUE))

Pipes make it easy for the author and reader of the code to focus on which transformations are occuring.

In [None]:
# Without pipes
by_dest <- group_by(flights, dest)
dest_summary <- summarize(by_dest, count = n(), delay = mean(dep_delay, na.rm = TRUE))
(dest_summary_final <- arrange(dest_summary, desc(count)))

In [None]:
# With pipes
group_by(flights, dest) %>%
    summarize(count = n(), delay = mean(dep_delay, na.rm = TRUE)) %>%
    arrange(desc(count))

Under the hood, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on.

You can even plot the data by adding a `ggplot` command at the end.

In [None]:
#options(repr.plot.width=6, repr.plot.height=4) # to ensure we do not get very large plots

group_by(flights, month) %>%
    summarize(delay = mean(dep_delay, na.rm = TRUE)) %>%
    ggplot() +
        geom_bar(mapping = aes(x=factor(month), y=delay), stat = "identity") +
        labs(x = "month", y = "average delay (in minutes)")

In [None]:
# can you fill in this code to get a bar plot of
# average arrival delay by destination airport
# for the top 10 airports by traffic volume?

group_by(flights, dest) %>%
    summarize(
                mean_delay = mean(arr_delay, na.rm=TRUE),
                count = n()
            ) %>%
    arrange(desc(count)) %>%
    slice(1:10) %>%
    ggplot() +
        geom_bar(mapping = aes(x = dest, y = mean_delay), stat = "identity") +
        xlab("destination airport") +
        ylab("average arrival delay in minutes")

In [None]:
# airports, total flights, mean distance, and standard deviation of distance
# sorted in descending order of mean distance
group_by(flights, dest) %>%
    summarize(count = n(), sd = sd(distance), mean_distance = mean(distance)) %>%
    arrange(desc(mean_distance))

In [None]:
# first attempt at a scatter plot of
# distance vs. arrival delay

ggplot(flights) +
    geom_point(mapping = aes(x = distance, y = arr_delay))

In [None]:
# can you fill in this code to get a scatter plot of
# airport distance vs. average arrival delay after
# grouping by destination airport?
# also superimpose on the scatter plot a smoothed plot

# change 1: skip Honolulu (HNL)
# change 2: use only airports less than 4000 miles away
# change 3: use only airports less than 1000 miles away

group_by(flights, dest) %>%
    summarize(
                mean_distance = mean(distance, na.rm=TRUE),
                mean_delay = mean(arr_delay, na.rm=TRUE)
            ) %>%
    ggplot(mapping = aes(x = mean_distance, y = mean_delay)) +
        geom_point() +
        geom_smooth() +
        xlab("distance (in miles)") +
        ylab("average arrival delay (in minutes)")