# dplyr Demo

This is a demo to walk through dplyr.  Dplyr is another of the tidyverse packages, and we've already seen a little bit of it.  This demo is based on a [blog post by Gerald Belton](https://www.r-bloggers.com/more-tidyverse-using-dplyr-functions/), along with my additions.


In [None]:
if(!require(tidyverse)) {
    install.packages("tidyverse", repos = "http://cran.us.r-project.org")
    library(tidyverse)
}

if(!require(gapminder)) {
    install.packages("gapminder", repos = "http://cran.us.r-project.org")
    library(gapminder)
}

First we can perform some basic filters of the data to get an idea of some of the extremes.

In [None]:
filter(gapminder, lifeExp < 30)
filter(gapminder, lifeExp > 81.5)

dplyr does not have an AND clause; instead, we can chain together logical AND statements with commas.

In [None]:
filter(gapminder, pop > 20000000 & gdpPercap > 36000)

Similarly, the single pipe (|) acts as the OR operator.

In [None]:
filter(gapminder, pop > 1200000000 | gdpPercap > 100000)

The `select` function allows you to select specific variables from the data frame.

In [None]:
select(gapminder, year, lifeExp)

The pipe operator `%>%` is one of the most important operators in dplyr.  It allows you to take the ouptut from one operation and make it the input for the next operation.

In [None]:
gapminder %>%
  filter(pop > 1200000000) %>%
  select(country, year, lifeExp)

The `mutate` function lets you modify data frames.  In this example, we will add a new calculated variable.

In [None]:
new_gap <- gapminder %>%
  mutate(gdp = pop * gdpPercap)
head(new_gap)

We will sort the data using `arrange`.  This function will sort ascending by default.

In [None]:
new_gap %>%
  arrange(gdp, year) %>%
  head(10)

If we want to sort in descending order, we can use the `desc()` function inside `arrange`.  The following example shows sorting by GDP descending and then continent ascending.

In [None]:
new_gap %>%
  arrange(desc(gdp), continent) %>%
  head(10)

Grouping and summarizing data are vital to analysis.  The following code will group the data by continent and then include the count of observations per continent in the data frame.

In [None]:
new_gap %>%
  group_by(continent) %>%
  summarize(n = n())

We can use grouping and summarizing to look at data at a higher level of granularity than the data itself.  For example, the following code returns the average, by continent, of all life expectancies in the series.

In [None]:
new_gap %>%
  group_by(continent) %>%
  summarize(avg_lifeExp = mean(lifeExp))

We can see the value of the pipe operator even more clearly in the following examples as we filter down to a specific year first and then group and summarize the data to generate life expectancies by continent for a particular year.  First, 1952:

In [None]:
new_gap %>%
  filter(year == 1952) %>%
  group_by(continent) %>%
  summarize(avg_lifeExp = mean(lifeExp))

Now 2007:

In [None]:
new_gap %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarize(avg_lifeExp = mean(lifeExp))

Let's put it all together and look for the single biggest drop in life expectancy by continent from one year to the next.  We will use the `lag()` function to look at the previous record in a particular window (i.e., continent-country pair, sorted by year).  This also allows us to see the `top_n()` function.

In [None]:
new_gap %>%
  select(country, year, continent, lifeExp) %>%
  group_by(continent, country) %>%
  ## within country, take (lifeExp in year i) - (lifeExp in year i - 1)
  ## positive means lifeExp went up, negative means it went down
  mutate(le_delta = lifeExp - lag(lifeExp)) %>% 
  ## within country, retain the worst lifeExp change = smallest or most negative
  summarize(worst_le_delta = min(le_delta, na.rm = TRUE))  %>% 
  ## within continent, retain the row with the lowest worst_le_delta
  top_n(1, wt = -1 * worst_le_delta) %>% 
  arrange(worst_le_delta)

Let's follow that up by looking at the largest single-year gains in life expectancy by continent.

In [None]:
new_gap %>%
  select(country, year, continent, lifeExp) %>%
  group_by(continent, country) %>%
  mutate(le_delta = lifeExp - lag(lifeExp)) %>% 
  summarize(best_le_delta = max(le_delta, na.rm = TRUE)) %>% 
  top_n(1, wt = best_le_delta) %>% 
  arrange(best_le_delta)

To wrap this up, what are the biggest single-year gains in GDP by continent?

In [None]:
new_gap %>%
  select(country, year, continent, gdpPercap) %>%
  group_by(continent, country) %>%
  mutate(gdp_delta = gdpPercap - lag(gdpPercap)) %>% 
  summarize(best_gdp_delta = max(gdp_delta, na.rm = TRUE)) %>% 
  top_n(1, wt = best_gdp_delta) %>% 
  arrange(best_gdp_delta)