<a href="https://colab.research.google.com/github/chathasphere/chathasphere.github.io/blob/main/teaching/306_materials/003_lab4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 4: Wrapping up dplyr Operations

## January 31st, 2022

In [None]:
library(tidyverse)
options(repr.plot.height=4)

# 1. dplyr Loose Ends
### review of `select`, `mutate`, `summarise`
### & introducing a few more simple commands

These are a family of "one table verbs" in dplyr. (Joins are an example of a "two table verb.")
For reference: https://dplyr.tidyverse.org/reference/index.html#section-one-table-verbs

In [None]:
str(midwest)


## 1.1 Select

Used for returning specific columns or columns that match a logical criterion.

In [None]:
midwest %>% select(2:5) %>% head

In [None]:
midwest %>% select(starts_with("perc")) %>% head

## 1.2 Rename

We can also use `select` to rename columns, though there is a designated operation `rename` that accomplishes this *without* losing the other columns.

In [None]:
# renaming with select keeps only the renamed column
midwest %>% select(percentage_white = percwhite) %>% head()

In [None]:
# renaming with rename keeps everything
midwest %>% rename(percentage_white = percwhite) %>% select(starts_with("perc")) %>%
  head

The function `rename_with` lets you apply a (string) function to all columns by default or a specific subset

In [None]:
# capitalize all columns
midwest %>% rename_with(toupper) %>% str

In [None]:
# capitalize only columns of chr (string) type
midwest %>% rename_with(toupper, where(is.character)) %>% str

## 1.3 Mutate

Apply a function to an old column to obtain a new one

In [None]:
midwest %>% mutate(county = tolower(county)) %>% select(county, state) %>% head

In [None]:
midwest %>% mutate(logpop = log(poptotal)) %>% select(1:5, 29) %>% head

We didn't go over this last time, but it's possible to apply the same mutation to multiple columns. This gets syntactically a bit confusing...

In [None]:
midwest %>% mutate(across(starts_with("perc"), ~ . / 100)) %>% select(starts_with("perc")) %>% head

Breaking down the syntax above...
* `across`: specifies a set of columns to mutate
* `~` indicates the start of a formula to apply across several columns
* `.` dummy variable standing in for each column

How about renaming the percentage columns (since they're no longer percentages)?

In [None]:
midwest_scaled <- midwest %>%
   mutate(across(starts_with("perc"), ~ . / 100))
midwest_scaled %>% rename_with(~ str_replace(., pattern="perc", replacement = "prop_")) %>%
  select(starts_with("prop_")) %>% head

Annoyingly, they decided to name the column `percollege` instead of `perccollege.`

I personally find the documentation for `mutate` unhelpful...Google searches and StackOverflow are your friends here (and of course you can ask the GSI team for help!)

## 1.4 Summarise

Not very useful on its own, `summarise` shines when we are grouping data.

In [None]:
midwest %>% group_by(state) %>% summarise(pop = sum(poptotal)) %>% 
  mutate(pop_millions = pop / 1000000) %>% arrange(desc(pop))

In [None]:
midwest %>% group_by(state) %>% summarise(area = sum(area), pop = sum(poptotal)) %>%
  mutate(pop_density = (pop / 1000) / area) %>% arrange(pop_density)

In [None]:
head(midwest)

In [None]:
midwest %>% group_by(state) %>% summarise(n_counties = n())

In [None]:
midwest %>% group_by(state) %>% summarise(n_categories = n_distinct(category))

In [None]:
midwest %>% group_by(category) %>% summarise(med_density = median(popdensity)) %>%
  arrange(med_density)

It's somewhat mysterious what these categories mean, since quoting the docs, "The original descriptions were not documented and the current descriptions here are based on speculation." Cool cool cool.

## 1.5 Count

An alternative way of getting a quick count of rows per group.

In [None]:
midwest %>% count(state)

# equivalent to 
# midwest %>% group_by(state) %>% summarise(n = n())

## 1.6 Slice

Similar to filter, except you return rows by specific indices and not logical criteria.

In [None]:
midwest %>% slice(100:110)

# 2. Visualization/ Exercises

Exercise 1: Create a bar plot of population per state, with each bar stacked by category

In [None]:
# your code here

Exercise 2: Plot the relationship between adult poverty and percentage of college grads on the log10 scale, coloring by state.

In [None]:
# your code here

Exercise 3a: Create a new factor column called `poverty_level` with levels:
* "critical" if poverty rate is above 25%
* "severe" if poverty rate is between 15% and 25% 
* "normal" otherwise.

Hint: use either the `cut` function with appropriate breaks and labels, or else write your own binning function and apply it with `mutate`.

In [None]:
# your code here

Exercise 3b: Visualize a racial breakdown of counties in each of these levels. For instance, you could do a proportional barplot based on percentage of each race, or a `geom_col` plot as in exercise 1 that is stacked by race.

In [None]:
# your code here

Exercise 4: Create tibbles corresponding to the 40 counties with the highest population density and lowest population density. Choose two variables for a scatter plot (such as adult poverty and percent college grads) and visually compare the relationship for the most dense and least dense counties.

In [None]:
# your code here

Exercise 5: What do the categories mean? Try creating some grouped plots (e.g. stacked barplots, colored scatter plots) that shed some light on how county categories differ from one another.

*This exercise is open ended/ambiguous, but it's the sort of Data Science question that people work on in real life.*

In [None]:
# your code here