<a href="https://colab.research.google.com/github/chathasphere/chathasphere.github.io/blob/main/teaching/306_materials/003_lab3_solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3: Data Wrangling in dplyr

## January 24th, 2022

In [1]:
library(tidyverse)

install.packages("gapminder") # this is how you import a non-default package 
library(gapminder) # this package contains the data set we'll be using

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.4     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



# 1. Review 
### `filter`, `arrange` and piping syntax

Recall that `dplyr` has five fundamental functions for manipulating and cleaning ("wrangling") data: `filter`, `arrange`, `select`, `mutate` and `summarise`.

## 1.1 Piping

Recall that the pipe operator `%>%` allows you to apply one or more functions to a tibble. Sample syntax:

`myTibble %>% someFunction(secondArg, thirdArg, ...)`

is equivalent to 

`someFunction(myTibble, secondArg, thirdArg, ...)`.

In [2]:
# head(gapminder, 10)

gapminder %>% head(10) # get the first 10 rows of the gapminder data set

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
Afghanistan,Asia,1977,38.438,14880372,786.1134
Afghanistan,Asia,1982,39.854,12881816,978.0114
Afghanistan,Asia,1987,40.822,13867957,852.3959
Afghanistan,Asia,1992,41.674,16317921,649.3414
Afghanistan,Asia,1997,41.763,22227415,635.3414


## 1.2 Filter

Obtain **rows** that satisfy some logical condition

In [3]:
# what if I want data from the year 1982?
gapminder %>% filter(year == 1982) %>% head

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1982,39.854,12881816,978.0114
Albania,Europe,1982,70.42,2780097,3630.8807
Algeria,Africa,1982,61.368,20033753,5745.1602
Angola,Africa,1982,39.942,7016384,2756.9537
Argentina,Americas,1982,69.942,29341374,8997.8974
Australia,Oceania,1982,74.74,15184200,19477.0093


Exercise 1: Return all rows corresponding to Asian countries with a life expectancy of at least 75 years in the year 2007.

In [4]:
gapminder %>% filter(continent == "Asia" & lifeExp >= 75 & year == 2007)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Bahrain,Asia,2007,75.635,708573,29796.05
"Hong Kong, China",Asia,2007,82.208,6980412,39724.98
Israel,Asia,2007,80.745,6426679,25523.28
Japan,Asia,2007,82.603,127467972,31656.07
"Korea, Rep.",Asia,2007,78.623,49044790,23348.14
Kuwait,Asia,2007,77.588,2505559,47306.99
Oman,Asia,2007,75.64,3204897,22316.19
Singapore,Asia,2007,79.972,4553009,47143.18
Taiwan,Asia,2007,78.4,23174294,28718.28


## 1.3 Arrange

Sort a tibble's rows (ascending order by default) based on one or more columns

In [6]:
# get the rows with the largest gdp per capita
gapminder %>% arrange(desc(gdpPercap)) %>% head
# note that it's all the same country (Kuwait) over different eyars

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
"Congo, Dem. Rep.",Africa,2002,44.966,55379852,241.1659
"Congo, Dem. Rep.",Africa,2007,46.462,64606759,277.5519
Lesotho,Africa,1952,42.138,748747,298.8462
Guinea-Bissau,Africa,1952,32.5,580653,299.8503
"Congo, Dem. Rep.",Africa,1997,42.587,47798986,312.1884
Eritrea,Africa,1952,35.928,1438760,328.9406


In [8]:
# you can also arrange based on a function of a column(s)
# gapminder %>% arrange(year, gdpPercap) %>% head
gapminder %>% arrange(desc(lifeExp / gdpPercap)) %>% head

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
"Congo, Dem. Rep.",Africa,2002,44.966,55379852,241.1659
Myanmar,Asia,1992,59.32,40546538,347.0
"Congo, Dem. Rep.",Africa,2007,46.462,64606759,277.5519
Myanmar,Asia,1987,58.339,38028578,385.0
Myanmar,Asia,1977,56.059,31528087,371.0
Myanmar,Asia,1972,53.07,28466390,357.0


Exercise 2: For each year, starting from the most recent year (2007) going backwards, sort the data alphabetically by country name. 

*Hint: the first two rows should be "Afghanistan/2007" and "Albania/2007" *

In [10]:
# your code here
gapminder %>% arrange( desc(year), country ) %>% head

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,2007,43.828,31889923,974.5803
Albania,Europe,2007,76.423,3600523,5937.0295
Algeria,Africa,2007,72.301,33333216,6223.3675
Angola,Africa,2007,42.731,12420476,4797.2313
Argentina,Americas,2007,75.32,40301927,12779.3796
Australia,Oceania,2007,81.235,20434176,34435.3674


## 1.4 Chaining Dplyr Operations

Each of the five `dplyr` functions take a tibble as their first argument and return a tibble.

Therefore, we can chain together multiple `dplyr` operations. Think of fitting a bunch of pipe segments together--the data flows through the pipe and comes out transformed.

In [11]:
# sort countries by population (large to small) based on 1952 pop
# then get the top 10
gapminder %>% filter(year == 1952) %>% arrange(desc(pop)) %>% head(10)

# USSR would have been number 3
# Germany was two countries (DDR/BRD)
# Bangladesh was part of Pakistan
# clearly, tracking countries over several decades is a little confusing...

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
China,Asia,1952,44.0,556263527,400.4486
India,Asia,1952,37.373,372000000,546.5657
United States,Americas,1952,68.44,157553000,13990.4821
Japan,Asia,1952,63.03,86459025,3216.9563
Indonesia,Asia,1952,37.468,82052000,749.6817
Germany,Europe,1952,67.5,69145952,7144.1144
Brazil,Americas,1952,50.917,56602560,2108.9444
United Kingdom,Europe,1952,69.18,50430000,9979.5085
Italy,Europe,1952,65.94,47666000,4931.4042
Bangladesh,Asia,1952,37.484,46886859,684.2442


Exercise 3: For the year 2007, return the 10 African countries with the highest life expectancy (sorted high to low)

In [12]:
# your code here
gapminder %>% filter(year == 2007, continent == "Africa") %>%
  arrange(desc(lifeExp)) %>% head(10)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Reunion,Africa,2007,76.442,798094,7670.1226
Libya,Africa,2007,73.952,6036914,12057.4993
Tunisia,Africa,2007,73.923,10276158,7092.923
Mauritius,Africa,2007,72.801,1250882,10956.9911
Algeria,Africa,2007,72.301,33333216,6223.3675
Egypt,Africa,2007,71.338,80264543,5581.181
Morocco,Africa,2007,71.164,33757175,3820.1752
Sao Tome and Principe,Africa,2007,65.528,199579,1598.4351
Comoros,Africa,2007,65.152,710960,986.1479
Mauritania,Africa,2007,64.164,3270065,1803.1515


Exercise 4: For each available year from 1952-2007, sort countries in the Americas by gdp per capita from lowest to highest

In [13]:
gapminder %>% filter(continent == "Americas") %>%
  arrange(year, gdpPercap)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Dominican Republic,Americas,1952,45.928,2491346,1397.717
Haiti,Americas,1952,37.579,3201488,1840.367
Paraguay,Americas,1952,62.649,1555876,1952.309
Brazil,Americas,1952,50.917,56602560,2108.944
Colombia,Americas,1952,50.643,12350771,2144.115
Honduras,Americas,1952,41.912,1517453,2194.926
Guatemala,Americas,1952,42.023,3146381,2428.238
Panama,Americas,1952,55.191,940080,2480.380
Costa Rica,Americas,1952,57.206,926317,2627.009
Bolivia,Americas,1952,40.414,2883315,2677.326


# 2. Select

Return one or more columns of a tibble. (In comparison, `filter` returns rows that match some criterion.)

In [14]:
# simplest use case
gapminder %>% select(year) %>% head

year
<int>
1952
1957
1962
1967
1972
1977


In [19]:
# we can use vector-like notation to select multiple columns
gapminder %>% select(continent:pop) %>% head

# or just list them separated by commas
gapminder %>% select(pop, country, lifeExp) %>% head

# select based on logical conditions
# ends_with and starts_with 
# can be useful for "wide" data sets
gapminder %>% select(starts_with("co")) %>% head

# select columns by a specific type
# is.factor, is.integer, is.double, and is.character
# can come in handy here!
gapminder %>% select(where(is.factor)) %>% head

continent,year,lifeExp,pop
<fct>,<int>,<dbl>,<int>
Asia,1952,28.801,8425333
Asia,1957,30.332,9240934
Asia,1962,31.997,10267083
Asia,1967,34.02,11537966
Asia,1972,36.088,13079460
Asia,1977,38.438,14880372


pop,country,lifeExp
<int>,<fct>,<dbl>
8425333,Afghanistan,28.801
9240934,Afghanistan,30.332
10267083,Afghanistan,31.997
11537966,Afghanistan,34.02
13079460,Afghanistan,36.088
14880372,Afghanistan,38.438


country,continent
<fct>,<fct>
Afghanistan,Asia
Afghanistan,Asia
Afghanistan,Asia
Afghanistan,Asia
Afghanistan,Asia
Afghanistan,Asia


country,continent
<fct>,<fct>
Afghanistan,Asia
Afghanistan,Asia
Afghanistan,Asia
Afghanistan,Asia
Afghanistan,Asia
Afghanistan,Asia


In [20]:
# rename columns if you wish
# syntax: new_name = old_name
gapminder %>% select(life_expectancy = lifeExp) %>% head

life_expectancy
<dbl>
28.801
30.332
31.997
34.02
36.088
38.438


Exercise 5: Without explicitly naming any columns, select the columns that have type "double."

In [49]:
# your code here
gapminder %>% select(where(is.double)) %>% head

lifeExp,gdpPercap
<dbl>,<dbl>
28.801,779.4453
30.332,820.853
31.997,853.1007
34.02,836.1971
36.088,739.9811
38.438,786.1134


# 3. Mutate

Apply a function to a column to create a new column.

In [26]:
# I want to read population in the millions
gapminder %>% arrange(desc(pop)) %>% mutate(pop_in_millions = pop / 1000000) %>% head

# convert years to French Republican calendar (lol)
gapminder %>% sample_n(7) %>% mutate(new_calendar_year = year - 1792) %>% 
  select(country, ends_with("year"))

country,continent,year,lifeExp,pop,gdpPercap,pop_in_millions
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>,<dbl>
China,Asia,2007,72.961,1318683096,4959.115,1318.683
China,Asia,2002,72.028,1280400000,3119.281,1280.4
China,Asia,1997,70.426,1230075000,2289.234,1230.075
China,Asia,1992,68.69,1164970000,1655.784,1164.97
India,Asia,2007,64.698,1110396331,2452.21,1110.396
China,Asia,1987,67.274,1084035000,1378.904,1084.035


country,year,new_calendar_year
<fct>,<int>,<dbl>
Cameroon,1952,160
Slovenia,1967,175
Syria,1987,195
Puerto Rico,1982,190
Senegal,1987,195
Colombia,1997,205
Chad,1957,165


In [27]:
# you don't have to explicitly rename the column yourself
# (though it's generally a good idea to give meaningful names)
gapminder %>% mutate(lifeExp * 12) %>% head

country,continent,year,lifeExp,pop,gdpPercap,lifeExp * 12
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453,345.612
Afghanistan,Asia,1957,30.332,9240934,820.853,363.984
Afghanistan,Asia,1962,31.997,10267083,853.1007,383.964
Afghanistan,Asia,1967,34.02,11537966,836.1971,408.24
Afghanistan,Asia,1972,36.088,13079460,739.9811,433.056
Afghanistan,Asia,1977,38.438,14880372,786.1134,461.256


Exercise 6: Create new columns corresponding to log-scaled versions of the decimal columns and return only these columns.

Imaginary bonus points for compact notation.

In [48]:
# your code here

gapminder %>% mutate(log_life_exp = log(lifeExp), log_gdp_per_cap = log(gdpPercap)) %>% 
  select(starts_with("log")) %>% head

log_life_exp,log_gdp_per_cap
<dbl>,<dbl>
3.36041,6.658583
3.412203,6.710344
3.465642,6.748878
3.526949,6.728864
3.58596,6.606625
3.649047,6.667101


In [36]:
gapminder %>% mutate(across(where(is.double), log)) %>%
  select(where(is.double)) %>% head

lifeExp,gdpPercap
<dbl>,<dbl>
3.36041,6.658583
3.412203,6.710344
3.465642,6.748878
3.526949,6.728864
3.58596,6.606625
3.649047,6.667101


In [37]:
# there's a flashy way of applying mutations across multiple columns
# this takes the square root of each of the numeric columns
gapminder %>% mutate(across(where(is.numeric), sqrt)) %>% head

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Afghanistan,Asia,44.18144,5.366656,2902.642,27.91855
Afghanistan,Asia,44.23799,5.50745,3039.89,28.65053
Afghanistan,Asia,44.29447,5.656589,3204.229,29.20789
Afghanistan,Asia,44.35087,5.832667,3396.758,28.91707
Afghanistan,Asia,44.40721,6.007329,3616.554,27.20259
Afghanistan,Asia,44.46347,6.199839,3857.509,28.03771


# 4. Summarise

Calculate a summary statistic of a tibble, returning one observation.

When given grouped data (using the `group_by` function) we get one observation per group.

In [38]:
# this is not particularly interesting, but hey
gapminder %>% filter(continent == "Asia" & year == "1977") %>%
  summarise(median_life_exp = median(lifeExp))

median_life_exp
<dbl>
60.765


In [40]:
# grouping data makes things a bit more fun!
# calculates the median life expectancy for countries in each continent
gapminder %>% filter(year == "1977") %>% group_by(continent) %>%
  summarise(median_life_exp = median(lifeExp))

# can get multiple summary statistics for each group
gapminder %>% filter(year == "1977") %>% group_by(continent) %>%
  summarise(min = min(pop), med = median(pop), max = max(pop), mean_pop = mean(pop))

continent,median_life_exp
<fct>,<dbl>
Africa,49.2725
Americas,66.353
Asia,60.765
Europe,72.335
Oceania,72.855


continent,min,med,max,mean_pop
<fct>,<int>,<dbl>,<int>,<dbl>
Africa,86796,4522666,62209173,8328097
Americas,1039009,5302800,220239000,23122708
Asia,297410,13933198,943455000,72257987
Europe,221823,8741694,78160773,17238818
Oceania,3164900,8619500,14074100,8619500


In [41]:
# we can count how many countries are in each continent for a given year
gapminder %>% filter(year == "2002") %>% 
  group_by(continent) %>% summarise(n_countries = n())

# basically equivalent to the following:
# using basic R
gapminder_2002 <- gapminder %>% filter(year == "2002")
table(gapminder_2002$continent)

continent,n_countries
<fct>,<int>
Africa,52
Americas,25
Asia,33
Europe,30
Oceania,2



  Africa Americas     Asia   Europe  Oceania 
      52       25       33       30        2 

In [42]:
# group by multiple columns
gapminder %>% filter(year > 1990) %>%
  group_by(continent, year) %>% 
  summarise(med_life_exp = median(lifeExp))

`summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.



continent,year,med_life_exp
<fct>,<int>,<dbl>
Africa,1992,52.429
Africa,1997,52.759
Africa,2002,51.2355
Africa,2007,52.9265
Americas,1992,69.862
Americas,1997,72.146
Americas,2002,72.047
Americas,2007,72.899
Asia,1992,68.69
Asia,1997,70.265


In [43]:
# example of advanced dplyr
# (probably not going to be doing stuff life this)
# Which countries have had the most gdp per capita gain since 1952?
gapminder %>% group_by(country) %>%
  arrange(year) %>%
  # gdp per cap in 2007 - gdp per cap in 1952, for *each* country
  mutate(gdp_gain = last(gdpPercap) - first(gdpPercap)) %>%
  arrange(desc(gdp_gain)) %>% filter(year == 2002) %>%
  select(country, gdp_gain) %>% head(10)

country,gdp_gain
<fct>,<dbl>
Singapore,44828.04
Norway,39261.77
"Hong Kong, China",36670.56
Ireland,35465.72
Austria,29989.42
United States,28961.17
Iceland,28913.1
Japan,28439.11
Netherlands,27856.36
Taiwan,27511.33


Exercise 7: Find the interquartile range in life expectancy for each continent by year.

In [47]:
# your code here
gapminder %>% group_by(continent, year) %>%
  # summarise(iqr_range = quantile(lifeExp, 0.75) - quantile(lifeExp, 0.25))
  summarise(iqr_range = IQR(lifeExp))

`summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.



continent,year,iqr_range
<fct>,<int>,<dbl>
Africa,1952,6.306
Africa,1957,7.416
Africa,1962,8.27825
Africa,1967,8.158
Africa,1972,8.248
Africa,1977,9.358
Africa,1982,10.97025
Africa,1987,12.618
Africa,1992,11.9035
Africa,1997,11.92825
