<a href="https://colab.research.google.com/github/chathasphere/chathasphere.github.io/blob/main/teaching/306_materials/003_lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3: Data Wrangling in dplyr

## January 24th, 2022

In [None]:
library(tidyverse)

install.packages("gapminder") # this is how you import a non-default package 
library(gapminder) # this package contains the data set we'll be using

# 1. Review 
### `filter`, `arrange` and piping syntax

Recall that `dplyr` has five fundamental functions for manipulating and cleaning ("wrangling") data: `filter`, `arrange`, `select`, `mutate` and `summarise`.

## 1.1 Piping

Recall that the pipe operator `%>%` allows you to apply one or more functions to a tibble. Sample syntax:

`myTibble %>% someFunction(secondArg, thirdArg, ...)`

is equivalent to 

`someFunction(myTibble, secondArg, thirdArg, ...)`.

In [None]:
gapminder %>% head(10) # get the first 10 rows of the gapminder data set

## 1.2 Filter

Obtain **rows** that satisfy some logical condition

In [None]:
# what if I want data from the year 1982?
gapminder %>% filter(year == 1982) %>% head

Exercise 1: Return all rows corresponding to Asian countries with a life expectancy of at least 75 years in the year 2007.

In [None]:
# your code here

## 1.3 Arrange

Sort a tibble's rows (ascending order by default) based on one or more columns

In [None]:
# get the rows with the largest gdp per capita
gapminder %>% arrange(desc(gdpPercap)) %>% head
# note that it's all the same country (Kuwait) over different eyars

In [None]:
# you can also arrange based on a function of a column(s)
gapminder %>% arrange(desc(lifeExp / gdpPercap)) %>% head

Exercise 2: For each year, starting from the most recent year (2007) going backwards, sort the data alphabetically by country name. 

*Hint: the first two rows should be "Afghanistan/2007" and "Albania/2007" *

In [None]:
# your code here

## 1.4 Chaining Dplyr Operations

Each of the five `dplyr` functions take a tibble as their first argument and return a tibble.

Therefore, we can chain together multiple `dplyr` operations. Think of fitting a bunch of pipe segments together--the data flows through the pipe and comes out transformed.

In [None]:
# sort countries by population (large to small) based on 1952 pop
# then get the top 10
gapminder %>% filter(year == 1952) %>% arrange(desc(pop)) %>% head(10)

# USSR would have been number 3
# Germany was two countries (DDR/BRD)
# Bangladesh was part of Pakistan
# clearly, tracking countries over several decades is a little confusing...

Exercise 3: For the year 2007, return the 10 African countries with the highest life expectancy (sorted high to low)

In [None]:
# your code here

Exercise 4: For each available year from 1952-2007, sort countries in the Americas by gdp per capita from lowest to highest

In [None]:
# your code here

# 2. Select

Return one or more columns of a tibble. (In comparison, `filter` returns rows that match some criterion.)

In [None]:
# simplest use case
gapminder %>% select(year) %>% head

In [None]:
# we can use vector-like notation to select multiple columns
gapminder %>% select(continent:pop) %>% head

# or just list them separated by commas
gapminder %>% select(country, pop) %>% head

# select based on logical conditions
# ends_with and starts_with 
# can be useful for "wide" data sets
gapminder %>% select(ends_with("p")) %>% head

# select columns by a specific type
# is.factor, is.integer, is.double, and is.character
# can come in handy here!
gapminder %>% select(where(is.factor)) %>% head

In [None]:
# rename columns if you wish
# syntax: new_name = old_name
gapminder %>% select(life_expectancy = lifeExp) %>% head

Exercise 5: Without explicitly naming any columns, select the columns that have type "factor."

In [None]:
# your code here

# 3. Mutate

Apply a function to a column to create a new column.

In [None]:
# I want to read population in the millions
gapminder %>% arrange(desc(pop)) %>% mutate(pop_mil = pop / 1000000) %>% head

# convert years to French Republican calendar (lol)
gapminder %>% sample_n(7) %>% mutate(new_calendar_year = year - 1792) %>% 
  select(country, ends_with("year"))

In [None]:
# you don't have to explicitly rename the column yourself
# (though it's generally a good idea to give meaningful names)
gapminder %>% mutate(lifeExp * 12) %>% head

Exercise 6: Create new columns corresponding to log-scaled versions of the decimal columns and return only these columns.

Imaginary bonus points for compact notation.

In [None]:
# your code here

In [None]:
# there's a flashy way of applying mutations across multiple columns
# this takes the square root of each of the numeric columns
gapminder %>% mutate(across(where(is.numeric), sqrt)) %>% head

# 4. Summarise

Calculate a summary statistic of a tibble, returning one observation.

When given grouped data (using the `group_by` function) we get one observation per group.

In [None]:
# this is not particularly interesting, but hey
gapminder %>% filter(continent == "Asia" & year == "1977") %>%
  summarise(median_life_exp = median(lifeExp))

In [None]:
# grouping data makes things a bit more fun!
# calculates the median life expectancy for countries in each continent
gapminder %>% filter(year == "1977") %>% group_by(continent) %>%
  summarise(median_life_exp = median(lifeExp))

# can get multiple summary statistics for each group
gapminder %>% filter(year == "1977") %>% group_by(continent) %>%
  summarise(min = min(pop), med = median(pop), max = max(pop))

In [None]:
# we can count how many countries are in each continent for a given year
gapminder %>% filter(year == "2002") %>% 
  group_by(continent) %>% summarise(n_countries = n())

# basically equivalent to the following:
gapminder_2002 <- gapminder %>% filter(year == "2002")
table(gapminder_2002$continent)

In [None]:
# group by multiple columns
gapminder %>% filter(year > 1990) %>%
  group_by(continent, year) %>% 
  summarise(med_life_exp = median(lifeExp))

In [None]:
# example of advanced dplyr
# (probably not going to be doing stuff life this)
# Which countries have had the most gdp per capita gain since 1952?
gapminder %>% group_by(country) %>%
  arrange(year) %>%
  mutate(gdp_gain = last(gdpPercap) - first(gdpPercap)) %>%
  arrange(desc(gdp_gain)) %>% filter(year == 2002) %>%
  select(country, gdp_gain) %>% head(10)

Exercise 7: Find the interquartile range in life expectancy for each continent by year.

In [None]:
# your code here