# dplyr Demo

This is a demo to walk through dplyr.  Dplyr is another of the tidyverse packages, and we've already seen a little bit of it.  This demo is based on a [blog post by Gerald Belton](https://www.r-bloggers.com/more-tidyverse-using-dplyr-functions/), along with my additions.


In [None]:
#I needed to install these packages separately to get tidyverse and gapminder to load.
install.packages('nlme', repos = "http://cran.us.r-project.org", dependencies = TRUE)
install.packages('foreign', repos = "http://cran.us.r-project.org", dependencies = TRUE)

install.packages('tidyverse', repos = "http://cran.us.r-project.org", dependencies = TRUE)
install.packages('gapminder', repos = "http://cran.us.r-project.org", dependencies = TRUE)

In [1]:
library(tidyverse)
library(gapminder)

"package 'tidyverse' was built under R version 3.3.3"Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
"package 'dplyr' was built under R version 3.3.3"Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
"package 'gapminder' was built under R version 3.3.3"

First we can perform some basic filters of the data to get an idea of some of the extremes.

In [2]:
filter(gapminder, lifeExp < 30)
filter(gapminder, lifeExp > 81.5)

country,continent,year,lifeExp,pop,gdpPercap
Afghanistan,Asia,1952,28.801,8425333,779.4453
Rwanda,Africa,1992,23.599,7290203,737.0686


country,continent,year,lifeExp,pop,gdpPercap
"Hong Kong, China",Asia,2007,82.208,6980412,39724.98
Iceland,Europe,2007,81.757,301931,36180.79
Japan,Asia,2002,82.0,127065841,28604.59
Japan,Asia,2007,82.603,127467972,31656.07
Switzerland,Europe,2007,81.701,7554661,37506.42


dplyr does not have an AND clause; instead, we can chain together logical AND statements with commas.

In [3]:
filter(gapminder, pop > 20000000, gdpPercap > 36000)

country,continent,year,lifeExp,pop,gdpPercap
Canada,Americas,2007,80.653,33390141,36319.24
United States,Americas,2002,77.31,287675526,39097.1
United States,Americas,2007,78.242,301139947,42951.65


Similarly, the single pipe (|) acts as the OR operator.

In [4]:
filter(gapminder, pop > 1200000000 | gdpPercap > 100000)

country,continent,year,lifeExp,pop,gdpPercap
China,Asia,1997,70.426,1230075000,2289.234
China,Asia,2002,72.028,1280400000,3119.281
China,Asia,2007,72.961,1318683096,4959.115
Kuwait,Asia,1952,55.565,160000,108382.353
Kuwait,Asia,1957,58.033,212846,113523.133
Kuwait,Asia,1972,67.712,841934,109347.867


The `select` function allows you to select specific variables from the data frame.

In [5]:
select(gapminder, year, lifeExp)

year,lifeExp
1952,28.801
1957,30.332
1962,31.997
1967,34.020
1972,36.088
1977,38.438
1982,39.854
1987,40.822
1992,41.674
1997,41.763


The pipe operator `%>%` is one of the most important operators in dplyr.  It allows you to take the ouptut from one operation and make it the input for the next operation.

In [6]:
gapminder %>%
  filter(pop > 1200000000) %>%
  select(country, year, lifeExp)

country,year,lifeExp
China,1997,70.426
China,2002,72.028
China,2007,72.961


The `mutate` function lets you modify data frames.  In this example, we will add a new calculated variable.

In [7]:
new_gap <- gapminder %>%
  mutate(gdp = pop * gdpPercap)
head(new_gap)

country,continent,year,lifeExp,pop,gdpPercap,gdp
Afghanistan,Asia,1952,28.801,8425333,779.4453,6567086330
Afghanistan,Asia,1957,30.332,9240934,820.853,7585448670
Afghanistan,Asia,1962,31.997,10267083,853.1007,8758855797
Afghanistan,Asia,1967,34.02,11537966,836.1971,9648014150
Afghanistan,Asia,1972,36.088,13079460,739.9811,9678553274
Afghanistan,Asia,1977,38.438,14880372,786.1134,11697659231


We will sort the data using `arrange`.  This function will sort ascending by default.

In [8]:
new_gap %>%
  arrange(gdp, year) %>%
  head(10)

country,continent,year,lifeExp,pop,gdpPercap,gdp
Sao Tome and Principe,Africa,1952,46.471,60011,879.5836,52784691
Sao Tome and Principe,Africa,1957,48.945,61325,860.7369,52784691
Sao Tome and Principe,Africa,1962,51.893,65345,1071.5511,70020508
Equatorial Guinea,Africa,1952,34.482,216964,375.6431,81501035
Sao Tome and Principe,Africa,1967,54.425,70787,1384.8406,98028711
Equatorial Guinea,Africa,1957,35.983,232922,426.0964,99247228
Sao Tome and Principe,Africa,1972,56.48,76595,1532.9853,117419006
Gambia,Africa,1952,30.0,284320,485.2307,137960781
Equatorial Guinea,Africa,1962,37.485,249220,582.842,145255876
Sao Tome and Principe,Africa,1977,58.55,86796,1737.5617,150813402


If we want to sort in descending order, we can use the `desc()` function inside `arrange`.  The following example shows sorting by GDP descending and then year ascending.

In [9]:
new_gap %>%
  arrange(desc(gdp), continent) %>%
  head(10)

country,continent,year,lifeExp,pop,gdpPercap,gdp
United States,Americas,2007,78.242,301139947,42951.653,12934460000000.0
United States,Americas,2002,77.31,287675526,39097.1,11247280000000.0
United States,Americas,1997,76.81,272911760,35767.433,9761353000000.0
United States,Americas,1992,76.09,256894189,32003.932,8221624000000.0
United States,Americas,1987,75.02,242803533,29884.35,7256026000000.0
China,Asia,2007,72.961,1318683096,4959.115,6539501000000.0
United States,Americas,1982,74.65,232187835,25009.559,5806915000000.0
United States,Americas,1977,73.38,220239000,24072.632,5301732000000.0
United States,Americas,1972,71.34,209896000,21806.036,4577000000000.0
Japan,Asia,2007,82.603,127467972,31656.068,4035135000000.0


Grouping and summarizing data are vital to analysis.  The following code will group the data by continent and then include the count of observations per continent in the data frame.

In [10]:
new_gap %>%
  group_by(continent) %>%
  summarize(n = n())

continent,n
Africa,624
Americas,300
Asia,396
Europe,360
Oceania,24


We can use grouping and summarizing to look at data at a higher level of granularity than the data itself.  For example, the following code returns the average, by continent, of all life expectancies in the series.

In [11]:
new_gap %>%
  group_by(continent) %>%
  summarize(avg_lifeExp = mean(lifeExp))

continent,avg_lifeExp
Africa,48.86533
Americas,64.65874
Asia,60.0649
Europe,71.90369
Oceania,74.32621


We can see the value of the pipe operator even more clearly in the following examples as we filter down to a specific year first and then group and summarize the data to generate life expectancies by continent for a particular year.  First, 1952:

In [12]:
new_gap %>%
  filter(year == 1952) %>%
  group_by(continent) %>%
  summarize(avg_lifeExp = mean(lifeExp))

continent,avg_lifeExp
Africa,39.1355
Americas,53.27984
Asia,46.31439
Europe,64.4085
Oceania,69.255


Now 2007:

In [13]:
new_gap %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarize(avg_lifeExp = mean(lifeExp))

continent,avg_lifeExp
Africa,54.80604
Americas,73.60812
Asia,70.72848
Europe,77.6486
Oceania,80.7195


Let's put it all together and look for the single biggest drop in life expectancy by continent from one year to the next.  We will use the `lag()` function to look at the previous record in a particular window (i.e., continent-country pair, sorted by year).  This also allows us to see the `top_n()` function.

In [14]:
new_gap %>%
  select(country, year, continent, lifeExp) %>%
  group_by(continent, country) %>%
  ## within country, take (lifeExp in year i) - (lifeExp in year i - 1)
  ## positive means lifeExp went up, negative means it went down
  mutate(le_delta = lifeExp - lag(lifeExp)) %>% 
  ## within country, retain the worst lifeExp change = smallest or most negative
  summarize(worst_le_delta = min(le_delta, na.rm = TRUE))  %>% 
  ## within continent, retain the row with the lowest worst_le_delta
  top_n(1, wt = -1 * worst_le_delta) %>% 
  arrange(worst_le_delta)

continent,country,worst_le_delta
Africa,Rwanda,-20.421
Asia,Cambodia,-9.097
Americas,El Salvador,-1.511
Europe,Montenegro,-1.464
Oceania,Australia,0.17


Let's follow that up by looking at the largest single-year gains in life expectancy by continent.

In [15]:
new_gap %>%
  select(country, year, continent, lifeExp) %>%
  group_by(continent, country) %>%
  mutate(le_delta = lifeExp - lag(lifeExp)) %>% 
  summarize(best_le_delta = max(le_delta, na.rm = TRUE)) %>% 
  top_n(1, wt = best_le_delta) %>% 
  arrange(best_le_delta)

continent,country,best_le_delta
Oceania,New Zealand,2.01
Americas,El Salvador,6.55
Europe,Bulgaria,7.01
Africa,Rwanda,12.488
Asia,Cambodia,19.737


To wrap this up, what are the biggest single-year gains in GDP by continent?

In [16]:
new_gap %>%
  select(country, year, continent, gdpPercap) %>%
  group_by(continent, country) %>%
  mutate(gdp_delta = gdpPercap - lag(gdpPercap)) %>% 
  summarize(best_gdp_delta = max(gdp_delta, na.rm = TRUE)) %>% 
  top_n(1, wt = best_gdp_delta) %>% 
  arrange(best_gdp_delta)

continent,country,best_gdp_delta
Oceania,Australia,3747.613
Americas,Trinidad and Tobago,6547.909
Europe,Ireland,9555.102
Africa,Libya,12015.721
Asia,Kuwait,28452.984
