# Worksheet 03A: Data Wrangling II and ggplot2 II
*Icíar Fernández Boyano*

## Instructions + Grading

+ To get full marks for each participation worksheet, you must successfully answer at least 40% of all autograded questions. In this worksheet, 40% are 4 questions.

+ Autograded questions are easily identifiable throughout the worksheet, labelled as **QUESTION**. Any other instructions that prompt the student to write code are activities, which are not graded and thus do not contribute to marks - but do contribute to the workflow of the worksheet!

+ Run this code chunk to load the packages required for the autograder:

If there are any packages which are not yet installed, you can use the code cell below to install them.

In [None]:
# Install packages here
# install.packages('palmerpenguins')

Use the following code cell to load any additional packages you want to use for this worksheet. You may not need to use this code cell at all.

In [None]:
# Load additional packages here
# library(palmerpenguins)

Run the code cell below to load the packages.

In [None]:
library(testthat)
library(digest)

## Attributions

The following resources were used as inspiration in the creation of this worksheet:

+ [R4DS Data Manipulation Chapter](https://r4ds.had.co.nz/transform.html)
+ [Rebecca Barter's post on across()](http://www.rebeccabarter.com/blog/2020-07-09-across/)
+ STAT545 materials from previous years 
+ [Palmer penguins dataset](https://allisonhorst.github.io/palmerpenguins/articles/examples.html)

## 0. Interacting with this Worksheet: Running code in jupyter

In Episode 01A of the [STAT 545 video series](https://www.youtube.com/channel/UCrB-uourf2vxGeBnGjQrA0w), RStudio was mentioned as being an IDE for R. You're probably viewing this worksheet in another IDE called **jupyter**. We're using jupyter for the STAT 545 worksheets because it works well with an autograder called nbgrader.

Try running the R code in the following _cell_: click on the cell, and either click "Run" in your toolbar, or press "Shift + Enter" or "Shift + Return".

In [None]:
1 + 1

The output appears below the cell.

Also notice that you can't change the above code. We've programmed the worksheet that way to preserve the worksheet structure -- another plus to jupyter over RStudio here. The only cells you can change are the ones where we prompt you for input.

## Class 6: The nuts and bolts of data wrangling 

This section of the worksheet is to be completed during **Class 6: Data Wrangling II.** By the end of today's worksheet, you will be able to: 
1. Use group_by(), and scoped variants of summarise() and mutate(), with across()
2. Apply your dplyr knowledge to exploratory data analysis of a dataset

### 1.0 Get started

Load the `gapminder`, `palmerpenguins`, and `tidyverse` packages.

In [7]:
library(gapminder)
library(palmerpenguins)
library(tidyverse)

### 1.1 Practicing dplyr verbs

In Data Wrangling II (Class 7), you have learned to use:

+ group_by()
+ summarize()
+ across() 

*Questions 1.0 and 1.1 use the `gapminder` dataset. The remaining questions in this section use the `penguins` dataset.*

**QUESTION 1.0**

Answer the following in a single expression:
+ What is the minimum life expectancy for each continent and each year of the `gapminder` dataset?
+ Add the corresponding country to the tibble, too
+ Arrange by min life expectancy

In [None]:
# youranswer
gapminder %>% 
  group_by(FILL_THIS_IN) %>% 
  FILL_THIS_IN(min_life = min(lifeExp),
               country = country[lifeExp == FILL_THIS_IN]) %>%
  arrange(FILL_THIS_IN)

### BEGIN HERE ###
gapminder %>% 
  group_by(continent, year) %>% 
  mutate(min_life = min(lifeExp),
        country = country[lifeExp == min_life]) %>%
  arrange(min_life)
### END HERE ###

**QUESTION 1.1**

Calculate the growth in population since the first year on record _for each country_ by **rearranging the following lines**, and **filling in the `FILL_THIS_IN`**. Here's another convenience function for you: `dplyr::first()`.

In [None]:
# youranswer
mutate(rel_growth = FILL_THIS_IN) %>% 
arrange(FILL_THIS_IN) %>% 
gapminder %>% 
group_by(country) %>% 

### BEGIN SOLUTION ###
gapminder %>% 
  group_by(country) %>% 
  arrange(year) %>% 
  mutate(rel_growth = pop - first(pop))
### END SOLUTION ###

**QUESTION 1.2**

Let's work with the `penguins` dataset from the `palmerpenguins` package. Which penguin species has the greatest body mass? Order the results in descending order of body mass.

In [None]:
# youranswer
penguins %>%
  group_by(FILL_THIS_IN) %>%
  summarise(body_mass = mean(FILL_THIS_IN, na.rm = TRUE)) %>%
  arrange(FILL_THIS_IN(FILL_THIS_IN))

### BEGIN SOLUTION ###
penguins %>%
  group_by(species) %>%
  summarise(body_mass = mean(body_mass_g, na.rm = TRUE)) %>%
  arrange(desc(body_mass))
### END SOLUTION ###

**QUESTION 1.3**

In a single expression, answer:
+ What is the mean value of each numeric variable in the `penguins` dataset in each island?
+ How many penguins are there in each island?

In [None]:
# youranswer
penguins %>% 
  group_by(FILL_THIS_IN) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(FILL_THIS_IN), FILL_THIS_IN, na.rm = TRUE), n = n())

### BEGIN SOLUTION ### 
penguins %>% 
  group_by(island) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric), mean, na.rm = TRUE), n = n())
### END SOLUTION ###

**QUESTION 1.4**

Identify how many missing values there are in each column of the `penguins` dataset. Hint: use `summarise()`, `everything()` and `across()`.

In [None]:
# youranswer
FILL_THIS_IN %>%
  FILL_THIS_IN(FILL_THIS_IN(FILL_THIS_IN(),
                            ~sum(is.na(.))))

### BEGIN SOLUTION ###
penguins %>%
  summarise(across(everything(), 
                   ~sum(is.na(.))))
### END SOLUTION ###

`~` indicates that you have started an anonymous function, the argument of which can be defined with `.x` or `.`. `~sum(is.na(.))` calculates how many NA values there are in each column (represented by `.`) and adds them up.

**QUESTION 1.5**

Replace the missing values of the numeric columns in the `penguins` dataset with the mean value of the relevant column. Hint: `FILL_THIS_IN_SAME` should be the same in all three expressions where it is noted. Another hint, look at question 1.4 above!

In [None]:
# youranswer
penguins %>%
  FILL_THIS_IN(across(FILL_THIS_IN(is.numeric), ~if_else(is.na(FILL_THIS_IN_SAME), mean(FILL_THIS_IN_SAME, na.rm = T), as.numeric(FILL_THIS_IN_SAME))))

### BEGIN SOLUTION ###
penguins %>%
  mutate(across(where(is.numeric), ~if_else(is.na(.), mean(., na.rm = T), as.numeric(.))))
### END SOLUTION ###

### 1.2 Explore a dataset with dplyr and ggplot II

*This section of the worksheet is not autograded, but answers will be uploaded after the deadline for submission of this worksheet.* 

For each of the tasks below, produce: 

- a tibble, using `dplyr` as your data manipulation tool;
- an accompanying plot of data from the tibble, using `ggplot2` as your visualization tool; and
- some dialogue about what your tables/figures show (doesn't have to be much).

**Tip:** Treat this worksheet as a "cheat sheet" for future-you / for working on your mini data analysis project! Don't assume that you'll remember the lessons you learned while working on this worksheet. Write things down:

- Add notes on difficulties/oddities you encountered. For example, which figures are easy/hard to make, which data formats make better inputs for plotting functions vs. for human-friendly tables.
- Provide attribution whenever you take code or an idea from somewhere else, whether a blog post, a colleague, a vignette, etc. Putting those pointers in your "cheat sheet" will be useful for future-you -- and it's just good practice to indicate where you got things from.

### Task 1

Report the absolute and/or relative abundance of countries with low life expectancy over time by continent: Compute some measure of worldwide life expectancy – you decide – a mean or median or some other quantile or perhaps your current age. Then determine how many countries on each continent have a life expectancy less than this benchmark, for each year.

### Task 2

Get the maximum and minimum of GDP per capita for all continents.


## Class 7: Effective visualizations with ggplot2

This section of the worksheet is to be completed during Class 7: ggplot II + effective visualizations. By the end of today's worksheets, you will be able to:

1. Troubleshoot common coding errors when plotting data with ggplot2, using the `gapminder` package.
2. Create an effective visualization of the `penguins` data with ggplot2.

### 1.0 Fix the plots!

In this section, we'll be looking at some erroneous plots and fixing them. I think you might not have these two packages installed:

In [None]:
install.packages("ggridges")
install.packages("scales")

Load all the packages you need to get started.

In [None]:
library(tidyverse)
library(gapminder)
library(ggridges)
library(scales)

**QUESTION 1.6**

Fix the errors in the following scatterplot. *Hint:* What is `select()` doing? `desc()` should be elsewhere...

In [None]:
# fix this plot
gapminder %>% 
  filter(country = "Canada") %>%
  select(desc(year)) %>%
  ggplot(aes(year, lifeExp)) +
  geom_point() 

### BEGIN HERE ###
gapminder %>% 
  filter(country == "Canada") %>% 
  ggplot(aes(desc(year), lifeExp)) +
  geom_point()
### END HERE ###

**QUESTION 1.7**

Instead of alpha transparency, suppose you're wanting to fix the overplotting issue by plotting small points. Why is this not working? Fix it.

In [None]:
# youranswer
ggplot(gapminder) +
  geom_point(aes(gdpPercap, lifeExp, size = 0.1)) +
  scale_x_log10(labels = scales::dollar_format())

### BEGIN HERE ###
ggplot(gapminder) +
  geom_point(aes(gdpPercap, lifeExp), size = 0.1) +
  scale_x_log10(labels = scales::dollar_format())
### END HERE ###

**QUESTION 1.8**

Fix the plot so that the size of the dots is related to the body mass, and so that the dots are colored by species. 

In [None]:
# youranswer
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm) +
        geom_point(shape = 21,
                   size = species,
                   fill = island))

### BEGIN HERE ###
ggplot(penguins, aes(x = body_mass_g, y = flipper_length_mm, size = body_mass_g, fill = species)) +
        geom_point(shape = 21)
### END HERE ###

**QUESTION 1.9**

- Change the x-axis text to be in "comma format" with `scales::comma_format()`.
- Separate each continent into sub-panels, in a single row of plots.

In [None]:
# youranswer
gapminder %>%
 ggplot(aes(gdpPercap, lifeExp)) +
  geom_point(alpha = 0.2) +
  scale_x_log10()

### BEGIN HERE ###
gapminder %>%
  ggplot(aes(gdpPercap, lifeExp)) +
  facet_wrap(~ continent, nrow = 1) +
  geom_point() +
  scale_x_log10(labels = scales::comma_format()) 
### END HERE ###

### 1.1 Investigate the `penguins` dataset with plots

**QUESTION 1.10**

Plot the `penguins` body mass (on the y axis) vs. flipper length (on the x axis) using a **scatterplot**, with the following specifications:
+ Color by species
+ Set the size of the points to 3, and the alpha to 0.8
+ Set the theme to minimal

In [None]:
# youranswer
mass_flipper <- ggplot(data = FILL_THIS_IN,
                       FILL_THIS_IN(x = FILL_THIS_IN,
                                    y = FILL_THIS_IN)) +
  FILL_THIS_IN(FILL_THIS_IN(color = FILL_THIS_IN),
               size = FILL_THIS_IN,
               alpha = FILL_THIS_IN) +
  FILL_THIS_IN() 

### BEGIN SOLUTION ###
mass_flipper <- ggplot(data = penguins,
                       aes(x = flipper_length_mm,
                           y = body_mass_g)) +
  geom_point(aes(color = species),
             size = 3,
             alpha = 0.8) +
  theme_minimal() 
### END SOLUTION ###

**QUESTION 1.11**

Repeat the same graph as above, this time coloring by sex, and separating each species into subpanels.

In [None]:
# youranswer

### BEGIN SOLUTION ###
mass_flipper <- ggplot(data = penguins,
                       aes(x = flipper_length_mm,
                           y = body_mass_g)) +
  geom_point(aes(color = sex),
             size = 3,
             alpha = 0.8) +
  theme_minimal() +
  facet_wrap(~species)
### END SOLUTION ###

*Not autograded:* What could we add to improve the above graph? This time, try to replicate the graph above, using `labs()` to add:
+ A title
+ A subtitle
+ Change the names of the x and y axes to something more readable (without _)
+ A legend for what the color indicates

Feeling inspired? You can also try to specify colors with `scale_color_manual()`!

In [None]:
# your code here