<a href="https://colab.research.google.com/github/chathasphere/chathasphere.github.io/blob/main/teaching/306_materials/003_lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3: Data Wrangling in dplyr

## January 24th, 2022

In [1]:
library(tidyverse)

install.packages("gapminder") # this is how you import a non-default package 
library(gapminder) # this package contains the data set we'll be using

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.4     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



# 1. Review 
### `filter`, `arrange` and piping syntax

Recall that `dplyr` has five fundamental functions for manipulating and cleaning ("wrangling") data: `filter`, `arrange`, `select`, `mutate` and `summarise`.

## 1.1 Piping

Recall that the pipe operator `%>%` allows you to apply one or more functions to a tibble. Sample syntax:

`myTibble %>% someFunction(secondArg, thirdArg, ...)`

is equivalent to 

`someFunction(myTibble, secondArg, thirdArg, ...)`.

In [7]:
gapminder %>% head(10) # get the first 10 rows of the gapminder data set

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
Afghanistan,Asia,1977,38.438,14880372,786.1134
Afghanistan,Asia,1982,39.854,12881816,978.0114
Afghanistan,Asia,1987,40.822,13867957,852.3959
Afghanistan,Asia,1992,41.674,16317921,649.3414
Afghanistan,Asia,1997,41.763,22227415,635.3414


## 1.2 Filter

Obtain **rows** that satisfy some logical condition

In [9]:
# what if I want data from the year 1982?
gapminder %>% filter(year == 1982) %>% head

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1982,39.854,12881816,978.0114
Albania,Europe,1982,70.42,2780097,3630.8807
Algeria,Africa,1982,61.368,20033753,5745.1602
Angola,Africa,1982,39.942,7016384,2756.9537
Argentina,Americas,1982,69.942,29341374,8997.8974
Australia,Oceania,1982,74.74,15184200,19477.0093


Exercise 1: Return all rows corresponding to Asian countries with a life expectancy of at least 75 years in the year 2007.

In [12]:
# your code here

## 1.3 Arrange

Sort a tibble's rows (ascending order by default) based on one or more columns

In [15]:
# get the rows with the largest gdp per capita
gapminder %>% arrange(desc(gdpPercap)) %>% head
# note that it's all the same country (Kuwait) over different eyars

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Kuwait,Asia,1957,58.033,212846,113523.13
Kuwait,Asia,1972,67.712,841934,109347.87
Kuwait,Asia,1952,55.565,160000,108382.35
Kuwait,Asia,1962,60.47,358266,95458.11
Kuwait,Asia,1967,64.624,575003,80894.88
Kuwait,Asia,1977,69.343,1140357,59265.48


In [40]:
# you can also arrange based on a function of a column(s)
gapminder %>% arrange(desc(lifeExp / gdpPercap)) %>% head

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
"Congo, Dem. Rep.",Africa,2002,44.966,55379852,241.1659
Myanmar,Asia,1992,59.32,40546538,347.0
"Congo, Dem. Rep.",Africa,2007,46.462,64606759,277.5519
Myanmar,Asia,1987,58.339,38028578,385.0
Myanmar,Asia,1977,56.059,31528087,371.0
Myanmar,Asia,1972,53.07,28466390,357.0


Exercise 2: For each year, starting from the most recent year (2007) going backwards, sort the data alphabetically by country name. 

*Hint: the first two rows should be "Afghanistan/2007" and "Albania/2007" *

In [19]:
# your code here

## 1.4 Chaining Dplyr Operations

Each of the five `dplyr` functions take a tibble as their first argument and return a tibble.

Therefore, we can chain together multiple `dplyr` operations. Think of fitting a bunch of pipe segments together--the data flows through the pipe and comes out transformed.

In [23]:
# sort countries by population (large to small) based on 1952 pop
# then get the top 10
gapminder %>% filter(year == 1952) %>% arrange(desc(pop)) %>% head(10)

# USSR would have been number 3
# Germany was two countries (DDR/BRD)
# Bangladesh was part of Pakistan
# clearly, tracking countries over several decades is a little confusing...

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
China,Asia,1952,44.0,556263527,400.4486
India,Asia,1952,37.373,372000000,546.5657
United States,Americas,1952,68.44,157553000,13990.4821
Japan,Asia,1952,63.03,86459025,3216.9563
Indonesia,Asia,1952,37.468,82052000,749.6817
Germany,Europe,1952,67.5,69145952,7144.1144
Brazil,Americas,1952,50.917,56602560,2108.9444
United Kingdom,Europe,1952,69.18,50430000,9979.5085
Italy,Europe,1952,65.94,47666000,4931.4042
Bangladesh,Asia,1952,37.484,46886859,684.2442


Exercise 3: For the year 2007, return the 10 African countries with the highest life expectancy (sorted high to low)

In [24]:
# your code here

Exercise 4: For each available year from 1952-2007, sort countries in the Americas by gdp per capita from lowest to highest

In [27]:
# your code here

# 2. Select

Return one or more columns of a tibble. (In comparison, `filter` returns rows that match some criterion.)

In [31]:
# simplest use case
gapminder %>% select(year) %>% head

year
<int>
1952
1957
1962
1967
1972
1977


In [62]:
# we can use vector-like notation to select multiple columns
gapminder %>% select(continent:pop) %>% head

# or just list them separated by commas
gapminder %>% select(country, pop) %>% head

# select based on logical conditions
# ends_with and starts_with 
# can be useful for "wide" data sets
gapminder %>% select(ends_with("p")) %>% head

# select columns by a specific type
# is.factor, is.integer, is.double, and is.character
# can come in handy here!
gapminder %>% select(where(is.factor)) %>% head

continent,year,lifeExp,pop
<fct>,<int>,<dbl>,<int>
Asia,1952,28.801,8425333
Asia,1957,30.332,9240934
Asia,1962,31.997,10267083
Asia,1967,34.02,11537966
Asia,1972,36.088,13079460
Asia,1977,38.438,14880372


country,pop
<fct>,<int>
Afghanistan,8425333
Afghanistan,9240934
Afghanistan,10267083
Afghanistan,11537966
Afghanistan,13079460
Afghanistan,14880372


lifeExp,pop,gdpPercap
<dbl>,<int>,<dbl>
28.801,8425333,779.4453
30.332,9240934,820.853
31.997,10267083,853.1007
34.02,11537966,836.1971
36.088,13079460,739.9811
38.438,14880372,786.1134


year,lifeExp,pop,gdpPercap
<int>,<dbl>,<int>,<dbl>
1952,28.801,8425333,779.4453
1957,30.332,9240934,820.853
1962,31.997,10267083,853.1007
1967,34.02,11537966,836.1971
1972,36.088,13079460,739.9811
1977,38.438,14880372,786.1134


In [43]:
# rename columns if you wish
# syntax: new_name = old_name
gapminder %>% select(life_expectancy = lifeExp) %>% head

life_expectancy
<dbl>
28.801
30.332
31.997
34.02
36.088
38.438


Exercise 5: Without explicitly naming any columns, select the columns that have type "factor."

In [67]:
# your code here

# 3. Mutate

Apply a function to a column to create a new column.

In [51]:
# I want to read population in the millions
gapminder %>% arrange(desc(pop)) %>% mutate(pop_mil = pop / 1000000) %>% head

# convert years to French Republican calendar (lol)
gapminder %>% sample_n(7) %>% mutate(new_calendar_year = year - 1792) %>% 
  select(country, ends_with("year"))

country,continent,year,lifeExp,pop,gdpPercap,pop_mil
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>,<dbl>
China,Asia,2007,72.961,1318683096,4959.115,1318.683
China,Asia,2002,72.028,1280400000,3119.281,1280.4
China,Asia,1997,70.426,1230075000,2289.234,1230.075
China,Asia,1992,68.69,1164970000,1655.784,1164.97
India,Asia,2007,64.698,1110396331,2452.21,1110.396
China,Asia,1987,67.274,1084035000,1378.904,1084.035


country,year,new_calendar_year
<fct>,<int>,<dbl>
Ireland,1977,185
Burkina Faso,1992,200
Djibouti,1977,185
India,1962,170
Cameroon,1957,165
Thailand,2007,215
United Kingdom,1967,175


In [58]:
# you don't have to explicitly rename the column yourself
# (though it's generally a good idea to give meaningful names)
gapminder %>% mutate(lifeExp * 12) %>% head

country,continent,year,lifeExp,pop,gdpPercap,lifeExp * 12
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453,345.612
Afghanistan,Asia,1957,30.332,9240934,820.853,363.984
Afghanistan,Asia,1962,31.997,10267083,853.1007,383.964
Afghanistan,Asia,1967,34.02,11537966,836.1971,408.24
Afghanistan,Asia,1972,36.088,13079460,739.9811,433.056
Afghanistan,Asia,1977,38.438,14880372,786.1134,461.256


Exercise 6: Create new columns corresponding to log-scaled versions of the decimal columns and return only these columns.

Imaginary bonus points for compact notation.

In [54]:
# your code here

In [60]:
# there's a flashy way of applying mutations across multiple columns
# this takes the square root of each of the numeric columns
gapminder %>% mutate(across(where(is.numeric), sqrt)) %>% head

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Afghanistan,Asia,44.18144,5.366656,2902.642,27.91855
Afghanistan,Asia,44.23799,5.50745,3039.89,28.65053
Afghanistan,Asia,44.29447,5.656589,3204.229,29.20789
Afghanistan,Asia,44.35087,5.832667,3396.758,28.91707
Afghanistan,Asia,44.40721,6.007329,3616.554,27.20259
Afghanistan,Asia,44.46347,6.199839,3857.509,28.03771


# 4. Summarise

Calculate a summary statistic of a tibble, returning one observation.

When given grouped data (using the `group_by` function) we get one observation per group.

In [68]:
# this is not particularly interesting, but hey
gapminder %>% filter(continent == "Asia" & year == "1977") %>%
  summarise(median_life_exp = median(lifeExp))

median_life_exp
<dbl>
60.765


In [71]:
# grouping data makes things a bit more fun!
# calculates the median life expectancy for countries in each continent
gapminder %>% filter(year == "1977") %>% group_by(continent) %>%
  summarise(median_life_exp = median(lifeExp))

# can get multiple summary statistics for each group
gapminder %>% filter(year == "1977") %>% group_by(continent) %>%
  summarise(min = min(pop), med = median(pop), max = max(pop))

continent,median_life_exp
<fct>,<dbl>
Africa,49.2725
Americas,66.353
Asia,60.765
Europe,72.335
Oceania,72.855


continent,min,med,max
<fct>,<int>,<dbl>,<int>
Africa,86796,4522666,62209173
Americas,1039009,5302800,220239000
Asia,297410,13933198,943455000
Europe,221823,8741694,78160773
Oceania,3164900,8619500,14074100


In [81]:
# we can count how many countries are in each continent for a given year
gapminder %>% filter(year == "2002") %>% 
  group_by(continent) %>% summarise(n_countries = n())

# basically equivalent to the following:
gapminder_2002 <- gapminder %>% filter(year == "2002")
table(gapminder_2002$continent)

continent,n_countries
<fct>,<int>
Africa,52
Americas,25
Asia,33
Europe,30
Oceania,2



  Africa Americas     Asia   Europe  Oceania 
      52       25       33       30        2 

In [86]:
# group by multiple columns
gapminder %>% filter(year > 1990) %>%
  group_by(continent, year) %>% 
  summarise(med_life_exp = median(lifeExp))

`summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.



continent,year,med_life_exp
<fct>,<int>,<dbl>
Africa,1992,52.429
Africa,1997,52.759
Africa,2002,51.2355
Africa,2007,52.9265
Americas,1992,69.862
Americas,1997,72.146
Americas,2002,72.047
Americas,2007,72.899
Asia,1992,68.69
Asia,1997,70.265


country,gdp_gain
<fct>,<dbl>
Singapore,44828.04
Norway,39261.77
"Hong Kong, China",36670.56
Ireland,35465.72
Austria,29989.42
United States,28961.17


In [91]:
# example of advanced dplyr
# (probably not going to be doing stuff life this)
# Which countries have had the most gdp per capita gain since 1952?
gapminder %>% group_by(country) %>%
  arrange(year) %>%
  mutate(gdp_gain = last(gdpPercap) - first(gdpPercap)) %>%
  arrange(desc(gdp_gain)) %>% filter(year == 2002) %>%
  select(country, gdp_gain) %>% head(10)

country,gdp_gain
<fct>,<dbl>
Singapore,44828.04
Norway,39261.77
"Hong Kong, China",36670.56
Ireland,35465.72
Austria,29989.42
United States,28961.17
Iceland,28913.1
Japan,28439.11
Netherlands,27856.36
Taiwan,27511.33


Exercise 7: Find the interquartile range in life expectancy for each continent by year.

In [None]:
# your code here