<a href="https://colab.research.google.com/github/5harad/API-201-2023/blob/main/lecture/grammar-graphics-exercises.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **API-201: The grammar of graphics**

In this class session, we'll see how we can accomplish sophisticated data visualization using the *grammar of graphics*. In the process, we'll recreate many of the plots that saw in our last class.

But a word of caution: there is a conceptual elegance to how visualization works in `tidyverse`, but the details can get messy. It is possible to create almost any plot you'd like, but it's not necessarily straightforward. Trial and error (and Google!) is your friend!

**Getting started**

Before you start, create a copy of this Jupyter notebook in your own Google Drive by clicking `Copy to Drive` in the menubar. If you do not do this your work will not be saved!

We recommend completing this problem set in Google Chrome.

We begin by loading the various libraries that we'll use.

In [None]:
library(tidyverse)
library(scales)

## Part 1: Intergenerational mobility

This illustration is inspired by: https://www.youtube.com/watch?v=fSgEeI2Xpdc. Original data is from the paper Alesina, A, S Hohmann, S Michalopoulos and E Papaioannou (2019a), “Intergenerational mobility in Africa”, CEPR Discussion Paper 13497.

Run the cell below to load the data.

In [None]:
# We start by loading the data
# from https://onlinelibrary.wiley.com/doi/epdf/10.3982/ECTA17018
mobility <-
    tribble(
        ~country, ~avg, ~districts, ~median,   ~sd,  ~min,  ~max,
        "South Africa",  0.79,        216,     0.8, 0.075, 0.555, 0.896,
        "Botswana",   0.7,         23,   0.714, 0.079, 0.554, 0.909,
        "Zimbabwe",  0.63,         88,   0.726, 0.148,   0.4,     1,
        "Nigeria",  0.64,         37,   0.765, 0.201,  0.33, 0.963,
        "Egypt",  0.65,        236,   0.694, 0.105, 0.418, 0.914,
        "Tanzania",   0.6,        113,   0.611, 0.094, 0.408, 0.854,
        "Ghana",  0.58,        110,    0.65, 0.158, 0.181,  0.82,
        "Cameroon",  0.52,        230,   0.588, 0.203, 0.088, 0.896,
        "Kenya",  0.45,        173,   0.514, 0.187, 0.048, 0.873,
        "Zambia",  0.49,         72,   0.467, 0.127, 0.284, 0.785,
        "Morocco",  0.43,         59,   0.424, 0.145,  0.16, 0.723,
        "Lesotho",  0.44,         10,   0.437,  0.06, 0.317, 0.492,
        "Uganda",  0.37,        161,   0.382, 0.128, 0.015, 0.696,
        "Benin",  0.41,         77,   0.381, 0.132, 0.111, 0.649,
        "Rwanda",  0.29,         30,    0.28, 0.063,  0.22, 0.469,
        "Senegal",  0.29,         34,   0.209, 0.164, 0.078, 0.616,
        "Sierra Leone",  0.26,        107,   0.185, 0.149, 0.032, 0.694,
        "Ethiopia",  0.13,         97,   0.119, 0.235,     0, 0.865,
        "Malawi",  0.16,        227,   0.163, 0.115, 0.052, 0.643,
        "Liberia",  0.22,         47,    0.18,  0.08, 0.034, 0.345,
        "Guinea",  0.23,         34,   0.179, 0.085, 0.063, 0.491,
        "Sudan",  0.11,        129,   0.097, 0.144, 0.001, 0.551,
        "Mali",  0.21,        242,   0.128, 0.096, 0.013, 0.578,
        "Burkina Faso",  0.17,         45,   0.123,  0.08, 0.029, 0.526,
        "Mozambique",   0.1,        144,   0.064, 0.086, 0.015, 0.707,
        "South Sudan",  0.04,         72,   0.024, 0.056,     0, 0.319
    )

We'll start by making a simple dot plot showing intergenerational mobility by country.

Notice that we use `aes` to specify how to map the data to plotting parameters.

In [None]:
ggplot(mobility, aes(x = avg, y = country)) +
  geom_point()

### Exercise

To create a bar plot instead of a dot plot, replace `geom_point` with `geom_col`.



In [None]:
# Your answer here!



### Exercise

By default, categorical data (e.g., country names) are ordered alphabetically. We can recorder data with `fct_reorder`:
> fct_reorder(vector, val)
specifies that the levels of `vector` are sorted in increasing order of `val`.

Modify the plot above so that countries are appropriately ordered.

Hint: replace `y = country` with `y = fct_reorder(...)`, selecting the right parameters for `fct_reorder`.

In [None]:
# Your answer here!



### Exercise

We can additionally color the bars such that their color indicates intergenerational mobility. (Though that's not necessarily a good idea!)

To do so, we add `fill = variable` to the aesthetic mapping, where `variable` is the name of the variable that `fill` (i.e., the color of the bars) should correspond to. Modify your code above to make this change.

In [None]:
# Your answer here!



By default, the axes (and legend) take their names from the variables in the data frame. We can change the default names with the family of `scale` commands -- the scale commands also let us change other aspects of the "scales", like their range, but we'll focus on their names now.

There are a variety of scale commands, like
- `scale_x_continuous`
- `scale_y_continuous`
- `scale_fill_continuous`
- `scale_x_discrete`
- `scale_y_discrete`
- `scale_fill_discrete`

The first part of the name (e.g., `x`, `y`, and `fill`) reference the dimension of the data that we're interested in, and the second part (e.g., `continuous` and `discrete`) specify what type of data we're talking about.

In our case, we can use `scale_x_continuous(name = 'Intergenerational mobility')` to specify that the horizontal access should be labeled "Intergenerational mobility".

### Exercise

Update the plot above so that the horizontal axis and legend are labelled "Intergenerational mobility" and the vertical axis has no name. You can set `name = NULL` to remove the name.

In [None]:
# Your answer here!



### Exercise

We can set the general appearance of a plot with the family of `theme` commands. For example, `theme_bw()` is a common choice that gives a white background. Try it yourself!

In [None]:
# Your answer here!



## Part 2: Global surface temperatures

Now on to global temperatures over time. Run the cell below to get started.

In [None]:
fname <- "https://5harad.com/datasets/API201/globaltemps.csv"
globaltemps <- read_csv(url(fname))

head(globaltemps)

### Exercise

Create a dot plot of global temperatures over time. Be sure to appropriately label the axes.

In [None]:
# Your answer here!



### Exercise

Connect the points in the plot above by adding `geom_line()`. (If you just want the line, and no points, you can also remove `geom_point()`.

In [None]:
# Your answer here!



### Exercise
Now rescale the vertical axis to make it look like there was little change over time. (You shouldn't do this in practice, as it can mislead readers, like we discussed in class!)

You can do this by passing in the argument `limits = c(lower, upper)` to a `scale` command to specify the range of an axis, where `lower` and `upper` are numbers that you specify. For example, `limits = c(-5, 1)` sets the range to go from -5 to 1.

In [None]:
# Your answer here!



## Part 3: Flight delays

Next we'll create a plot of airline delays. Run the following two cells to load and parse the data.

In [None]:
# load the data
install.packages('nycflights13')
library(nycflights13)

In [None]:
# prepare the data
national <- c('JetBlue', 'Southwest', 'United', 'American', 'Delta', 'US')
regional <- c('ExpressJet', 'Envoy', 'Endeavor')

flight_delays <- flights %>%
    mutate(delayed = (arr_delay > 15)) %>%
    group_by(carrier) %>%
    summarize(p_delayed = mean(delayed, na.rm=TRUE)) %>%
    left_join(airlines, by = 'carrier') %>%
    mutate(short_name = word(name, 1)) %>%
    filter(short_name %in% c(national, regional)) %>%
    mutate(carrier_type = if_else(short_name %in% national, 'National', 'Regional'))

flight_delays

### Exercise

Create a dot plot using the `flight_delays` data such that:
- The vertical axis lists airlines, with their delay percentages shown on the horiztontal axis;
- The points are colored so as to indicate whether an airline is a national or regional carrier;
- The axes are appropriately named and ordered.

A couple of tips:
- To color points, we use `color` as opposed to `fill` (which we used for bars). What is the corresponding name of the `scale` to name the legend?
- You can pass in the argument `label=percent` to a `scale` command to format numbers as percentages rather than decimals.

In [None]:
# Your answer here!



## Part 4: Voter intent (on Xbox!)

Finally, we'll examine candidate preferences collected on the Xbox gaming platform during the 2012 presidential election.

(It might seem like a really bad idea to run such a survey on Xbox, but later in the course we'll see how you can adjust such survey data to be more representative of likely voters.)

Run the cell below to start.

In [None]:
load(url("https://5harad.com/datasets/API201/voter_intent.Rdata"))
head(voter_intent)

### Exercise

Create a time series (i.e., a line plot) that shows support of Obama over time. Ensure the axes are appropriately named, let the range of the vertical axis to go from 35% to 55%, and format the tick marks as percentages.

In this case, the `date` column is already formatted in R's internal `date` format. Since we're plotting `date` on the horizontal axis, its corresponding scale can be referenced with the `scale_x_date` command.

In [None]:
# Your answer here!



In [None]:
demographic_dist

As a final exercise, we'll compare the demographic composition of the Xbox respondents to the American electorate (in the previous election, in 2008). Each row in the dataframe contains the number of Xbox respondents in various demographic groups (with `source` equal to `Xbox`), as well as the expected number of people in those groups in a large random sample of voters from the 2008 election (with `source` equal to `2008 Electorate`).

The demographic groups are organized into categories (`cat`), with percentages (`p`) computed within each category to sum to 100%. So, for example, the `p` column gives the percentage of respondents across race groups, and, separately, the percentage of respondents across education levels.

### Exercise

Create a bar plot with the demographic groups on the horizontal axis and the percentages of each group within its category on the vertical axis.

For each group, create two bars -- one for Xbox respondents and one for the 2008 electorate -- by appropriately setting the `fill` parameter in `aes`. To position the bars side-by-side (as opposed to stacked), set `position='dodge'` in `geom_col`.

Tip: You can angel the axis labels, so they don't overlap, with this command:
> theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1))

But note that this command should come _after_ theme_bw(), otherwise it will be overridden by the general plot style.

In [None]:
# Your answer here!



### Exercise

As a last step, we'll separate the demographic groups by category. We can do with `facet_grid`. The general syntax is `facet_grid(~variable)`, where `variable` is the name of the column to facet the plot by. To make the plot more readable, you can additionally add the parameters `scales="free_x"` and `space="free"` to `facet_grid`; these parameters ensure that the facets are appropriately sized, regardless of the number of groups in each category.

In [None]:
# Your answer here!

