# Worksheet 02b: Nesting, List Columns, and `purrr`
_**Leader:** Diana Lin **Reviewer:** Iciar Fernandez **ASDA Assist:** David Kepplinger_

_Version 1.0_

_Attributions_: Major thanks to Firas Moosvi, as some of these questions were taken from his assignment from previous years, and the STAT547 2018-2019 Guidebook, as well as Vincenzo Coia for his ideas and input.

This is the corresponding worksheet for Lecture 03b (November 3, 2020) & Lecture 04b (November 5, 2020) of STAT545B.

To achieve full marks for worksheets, you will have to answer 40% of the autograded questions. For this particular worksheet, you will have to answer **6** out of the **14** autograded questions (Q15 & Q16 are not autograded). Some questions rely on answers from previous questions in order to be completed. 

Here are the groupings:

- **Group 1**: 1, 2
- **Group 2**: 3, 4, 5*
- **Group 3**: 10, 11, 12
- **Group 4**: 13, 14
- **Individuals**: 6, 7, 8, 9

\*The questions in Group 2 can actually be answered individually, but are ideally completed as a group as they build on each other conceptually.

Here you can install any packages you may need. Most likely you will need to install `devtools`, to order to install the very last package `distplyr`.

In [None]:
# install packages, e.g.
# install.packages('testthat')
# install.packages('digest')
# install.packages('tidyverse')
# install.packages('palmerpenguins')
# install.packages('glue')
# install.packages('gapminder')
# install.packages('broom')
# install.packages('devtools')
# devtools::install_github('vincenzocoia/distplyr')

Here you can load any packages you may need:

In [None]:
# load packages, e.g.
# library(devtools)

Here are the packages _we need_ for this worksheet:

In [None]:
library(testthat)
library(digest)
library(tidyverse)
library(palmerpenguins)
library(glue)
library(gapminder)
library(broom)
library(distplyr)

## Nesting and List Columns

_One_ of the ways a list-column can be made is by using `nest()`.

**QUESTION 1**: Create a tibble that bundles everything in `gapminder` except for `country` and `continent` into a list-column. Name your list column `other` (without using `rename()` or `mutate()`). Store your answer in `answer1`.

```r

(answer1 <- gapminder %>%
   nest(FILL_THIS_IN = FILL_THIS_IN))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 1', {
    expect_known_hash(enc2utf8(sapply(answer1$other, colnames)), 'ceba7fd58def34a537b5b13430a7ec2a')
    expect_known_hash(sapply(answer1$other, dim), '388a8eae98b3cb184d3fe8ed8dd46916')
    expect_known_hash(sapply(answer1$other, `[[`, 'year'), '0370844f5c0d097891d284949811883e')
})
cat('Success!')

Why would we use list-columns? Here is a use case.

**QUESTION 2**: _Reproducibly_ randomly sample 5 countries in the `gapminder` tibble. Store your answer in `answer2`.

```r
FILL_THIS_IN(123)
(answer2 <- gapminder %>%
    nest(FILL_THIS_IN = FILL_THIS_IN) %>% 
    sample_n(5) %>% 
    unnest(cols = FILL_THIS_IN))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 2', {
    expect_known_hash(sort(enc2utf8(as.character(answer2$country)), method = 'radix'), 'ca060e5983d51a09aeb24c2393462353')
    expect_known_hash(round(answer2$gdpPercap[order(enc2utf8(as.character(answer2$country)), method = 'radix')], 3), '75d94ea77a21140b54a374ed59f2253a')
})
cat('Success!')

## Exploring `purrr` Fundamentals

The `purrr` package is also part of the `tidyverse`.

Apply a function to each element in a list/vector with `map`.

General usage: `purrr::map(VECTOR_OR_LIST, YOUR_FUNCTION)`

Note:

- `map` always returns a list.
- `YOUR_FUNCTION` can return anything!

There are many variations of `map_*`, which you can find in this [cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf).

For the next few tasks, you will be converting for-loop(s) to vectorized expressions that reproduce the output (numbers should be the same, the format can be different).

**QUESTION 3**: Without using vectorization, take the square root of the following vector:

In [None]:
x <- 1:10

Store your answer in `answer3`:

```r
(answer3 <- map(FILL_THIS_IN, FILL_THIS_IN))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 3', {
    expect_known_hash(mode(answer3), '086ebc4c59c08c43e75bae74f1e16897')
    expect_known_hash(round(unlist(answer3), 4), 'ad16817e39d61cdf2ce38234f61306de')
})
cat('Success!')

In Question 3, we used the generic `map` function, and got a list. Let's use a more specific `map_*` function this time.

**QUESTION 4**: Without using vectorization, square each component of `x`. Store your answer in `answer4`:

```r
(answer4 <- map_dbl(FILL_THIS_IN, FILL_THIS_IN))
```

_Hint:_ The last `FILL_THIS_IN` corresponds to an anonymous function!

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 4', {
    expect_known_hash(mode(answer4), '46606ee201b428a3fa6c8a0d3d9e671c')
    expect_known_hash(round(unlist(answer4), 4), '84a2193460cb35ff884e4c3144abf122')
})
cat('Success!')

Now we've used both `map` and a more specific `map_dbl`. Now you can see how they differ, and how the use of one is better justified than the other for our purpose. Now it's your turn to choose! Remember to use the [cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/purrr.pdf) if you need it!

**QUESTION 5**: Below is sample code that computes the mean of every column in the `mtcars` dataset. Use the appropriate `purrr` function to vectorize this task.

In [None]:
answer5 <- vector("double", ncol(mtcars))
for (c in seq_along(mtcars)){
  answer5[[c]] <- mean(mtcars[[c]])
}
answer5

Store your answer in `answer5`. _Remember to choose the correct `purrr` function_!

```r
(answer5 <- FILL_THIS_IN(mtcars, FILL_THIS_IN))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 5', {
    expect_known_hash(mode(answer5), '46606ee201b428a3fa6c8a0d3d9e671c')
    expect_known_hash(round(answer5, 4), '3cc39a0fa3065dcf3186737c827b983a')
})
cat('Success!')

**QUESTION 6**: Below is sample code that divides the values in each column of the `mtcars` dataset by the maximum in that column. Underneath it is a vectorized method using `purrr`, returning a list, but we want a data frame instead.

In [None]:
for (c in seq_along(datasets::mtcars)){
  mtcars[[c]] <- datasets::mtcars[[c]] / max(datasets::mtcars[[c]], na.rm = TRUE)
}
head(mtcars)

In [None]:
map(datasets::mtcars, function(x) x / max(x))

Find a way to do this using a _`purrr`-style_ function, using only `dplyr` functions! Store your answer in `answer6`:

```r
(answer6 <- datasets::mtcars %>%
  mutate(FILL_THIS_IN(FILL_THIS_IN, FILL_THIS_IN)))
 ```
 
 _Hint_: The last `FILL_THIS_IN` corresponds to an anonymous function!

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 6', {
    expect_known_hash(class(answer6), '555434c8748e07b094500256087cdcc5')
    expect_known_hash(dimnames(answer6), 'e2f4b6e88adcdd56d67bba21719be092')
    expect_known_hash(round(answer6$mpg, 3), 'af82f570a0aa02d8abcbbd14386e98b0')
})
cat('Success!')

**QUESTION 7**: Below is sample code that creates a ggplot with fuel efficiency on the x-axis and horsepower on the y-axis for three cylinder levels (4, 6, 8). Use the appropriate `purrr` function to vectorize this task.

In [None]:
cylinders <- sort(unique(datasets::mtcars[['cyl']]))
answer7 <- vector("list",length(cylinders))
for (d in 1:length(cylinders)){
  answer7[[d]] <- datasets::mtcars %>% 
    filter(cyl == cylinders[d]) %>%
    ggplot() + 
    theme_bw() +
    geom_point(aes(x = mpg, y = hp) ) + 
    labs(x = 'Fuel efficiency (mpg)',
         y = 'Horsepower (hp)') + 
    ggtitle(glue("Horsepower and Fuel efficiency for {cylinders[d]} cylinders"))
}

You see that with the code above, nothing was printed. Now, for every plot in our list `answer7`, we want to print each plot. To do this, we can use `walk()`, another `purrr` function:

```r
walk(.x, .f, ...)
```

From the [documentation](https://purrr.tidyverse.org/reference/map.html):

|Argument|Description|
|--------|-----------|
|`.x`| A list or atomic vector |
|`.f`| A function, formula, or vector |


> `walk()` calls `.f` for its side-effect and returns the input `.x`

This allows us to 'iterate' through our list of plots and print them.

In [None]:
walk(answer7, print)

_Sidenote:_ `glue()` is a function that works like `paste()`, but `glue` increases readability (and allows creation of variables on the fly, etc.) Read more about `glue()` [here](https://www.tidyverse.org/blog/2017/10/glue-1.2.0/).

Example:

In [None]:
# define our variables
tues <- 3
thurs <- 4

# using the above predefined variables
paste("Tuesday is lecture", tues, "and Thursday is lecture", thurs)
glue("Tuesday is lecture {tues} and Thursday is lecture {thurs}.")

# create a variable on the fly
glue("Tuesday is lecture {tues <- 3} and Thursday is lecture {tues + 1}.")

Now it's your turn. Use the appropriate `purrr` function to vectorize the for loop above. Store your answer into `answer7`, and then use `walk` to print your plot.

```r
answer7 <- FILL_THIS_IN(FILL_THIS_IN, FILL_THIS_IN datasets::mtcars %>% 
    filter(cyl == FILL _THIS_IN[FILL_THIS_IN]) %>%
    ggplot() + 
    theme_bw() +
    geom_point(aes(x = mpg, y = hp) ) + 
    labs(x = 'Fuel efficiency (mpg)',
         y = 'Horsepower (hp)') + 
    ggtitle(glue(FILL_THIS_IN)))
walk(answer7, FILL_THIS_IN)
```

_Hint:_ Use an anonymous function. The `~` shorthand can be used as well. The third `FILL_THIS_IN` corresponds to an anonymous function.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 7', {
    expect_known_hash(lapply(answer7, class), 'e4cc620580d23510a19096fa70dfdb54')
    expect_known_hash(lapply(answer7, function (a) dimnames(a$data)), '5cd534fa153cf3feacea3b30794c8b0a')
})
cat('Success!')

**QUESTION 8**: Below is sample code that computes the number of unique values in each column of `mtcars` as a named vector, using for-loops. Use the appropriate `purrr` function to vectorize this task.

In [None]:
answer8 <- vector("double", ncol(datasets::mtcars))
for (c in seq_along(datasets::mtcars)){
  answer8[[c]] <- length(unique(datasets::mtcars[[c]]))
}
names(answer8) <- names(datasets::mtcars)
answer8

Store your answer in `answer8`:

```r
(answer8 <- datasets::mtcars %>% 
    FILL_THIS_IN(FILL_THIS_IN) %>% 
    FILL_THIS_IN(FILL_THIS_IN))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 8', {
    expect_known_hash(mode(answer8), '46606ee201b428a3fa6c8a0d3d9e671c')
    expect_known_hash(as.integer(answer8), '1981b33e1151073e1c227fe95218c6f5')
})
cat('Success!')

**QUESTION 9**: Below is sample code that takes input from various columns in the `diamonds` dataset and outputs a string containing information from the input. Use the appropriate `purrr` function to vectorize this task.

In [None]:
dmonds <- diamonds %>% 
  slice(1:4)

answer9 <- character()
for (d in 1:nrow(dmonds)){
  answer9[d] <- glue("Diamond #", d , 
                      " sold for $", dmonds$price[d],
                      " and was ", dmonds$carat[d], " carats")
}
answer9

Store your answer into `answer9`: 

```r
list_of_things <- list(FILL_THIS_IN, FILL_THIS_IN, FILL_THIS_IN)
(answer9 <- FILL_THIS_IN(list_of_things,
                   FILL_THIS_IN glue("Diamond #", FILL_THIS_IN,
                                     " sold for $", FILL_THIS_IN,
                                     " and was ", FILL_THIS_IN, " carats")))
```

_Hint_: Use an anonymous function using the `~` formula shorthand.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 9', {
    expect_known_hash(mode(answer9), 'af31d61b8795057ce1ce2e040685107c')
    expect_known_hash(enc2utf8(answer9), '9e2cb5c733e819bd84b9f75a6832c7ae')
})
cat('Success!')

**QUESTION 10**: Let's use `purrr` to make probability distributions. The Generalized Pareto Distribution is a three-parameter distribution, so if we wanted to make a bunch of these distributions, we'd need a `purrr` function to plug in the three parameters. To make the GPD distributions, we can use a function called `dst_gpd()`. Here are the parameters of our 5 GPD distributions:

In [None]:
(parameters <- tibble(loc   = c(105, 99, 120, 119, 111),
                      scale = c(12.2, 13.5, 18.5, 9.2, 15.5),
                      shape = c(0.4, 0.9, 0.5, 0.6, 0.4)))

Store your answer in `answer10`:

```r
(answer10 <- FILL_THIS_IN(FILL_THIS_IN, FILL_THIS_IN))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 10', {
    expect_known_hash(mode(answer10), '086ebc4c59c08c43e75bae74f1e16897')
    expect_known_hash(sapply(answer10, class), 'ee7cb82281be52ed29f760d8f17b5792')
    expect_known_hash(sapply(answer10, names), '2bd1810a814c7c80e0e8d74f70b42dc7')
    expect_known_hash(round(unlist(lapply(answer10, `[[`, 'parameters'), use.names = FALSE), 3), '10cf852189f399f692919183ba3379f4')
})
cat('Success!')

**QUESTION 11**: The following graph displays the probability density function of the first GPD. Modify the code using `purrr` so that all 5 density functions are displayed. Store your answer in `answer11`.

In [None]:
(answer11 <- tibble(x = c(50, 400)) %>%
  ggplot(aes(x)) +
  stat_function(fun = get_density(answer10[[1]]), alpha = 0.25) +
  theme_minimal() +
  labs(y = "Density"))

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 11', {
    expect_known_hash(class(answer11), 'f2396055c330843a8e2c8c2054acfb2d')
    expect_known_hash(sapply(answer11$layers, class), '212c3363c21284bec0f3cbbf4f002473')
    expect_known_hash(unlist(lapply(answer11$layers, function (l) class(l$geom))), 'd5d7a529255a566066df2b8e0051a240')
})
cat('Success!')

### Introducing `do.call()`

Let's make a mixture distribution of the above 5 GPD's using the function `distplyr::mix()`. The straightforward way to do this would be to do:

In [None]:
distplyr::mix(answer10[[1]], answer10[[2]], answer10[[3]], answer10[[4]], answer10[[5]])

How do we do this without explicitly giving all 5 GPDs? We can use `do.call()`:

```r
do.call(what, args)
```

From the [documentation](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/do.call):

|Argument|Description|
|--------|-----------|
|`what`|either a function or a non-empty character string naming the function to be called|
|`args`|a _list_ of arguments to the function call


**QUESTION 12**: Use `do.call()` to call `distplyr::mix()` on the 5 GDP's. Store your answer in `answer12`.
```r
answer12 <- do.call(FILL_THIS_IN, FILL_THIS_IN)
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 12', {
    expect_known_hash(class(answer12), 'deec76c2fe0b54d40912e147fb1aeaad')
    expect_known_hash(answer12$name, 'b9bd27c9624c17266d444af065f6e29c')
    expect_known_hash(unlist(lapply(answer12$components$distributions, class)), 'a442061b2621378ff95566da3e1eaa7c')
    expect_known_hash(unlist(lapply(answer12$components$distributions, `[[`, 'parameters'), use.names = FALSE), '10cf852189f399f692919183ba3379f4')
})
cat('Success!')

Now let's combine `purrr`, nesting, and linear modelling!

**QUESTION 13**: For each `gapminder` continent, fit a linear model of `lifeExp` from `log(gdpPercap)` and put this as a new column. Store your answer into `answer13`:

```r
(answer13 <- gapminder %>% 
  select(continent, gdpPercap, lifeExp) %>% 
  nest(data = c(FILL_THIS_IN, FILL_THIS_IN)) %>% 
  mutate(model = FILL_THIS_IN(data, ~ lm(FILL_THIS_IN ~ FILL_THIS_IN, data = FILL_THIS_IN))))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 13', {
    expect_known_hash(sapply(answer13$model, class), '2fe5bf6c6fb725f272c801e5f7560afe')
    expect_known_hash(round(unlist(lapply(answer13$model, coef)), 3), 'e536d3378586d3c54b920504b3238cde')
})
cat('Success!')

**QUESTION 14**: Using your model from Question 13, make predictions using `augment()` from the `broom` package, and then `unnest`. Store your answer in `answer14`:

```r
(answer14 <- answer13 %>% 
  transmute(continent, yhat = map(FILL_THIS_IN, FILL_THIS_IN)) %>% 
  unnest(FILL_THIS_IN))
```

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
test_that('Question 14', {
    expect_known_hash(dimnames(answer14), '5aef05c5c23749a0d5880dd697b47234')
    expect_known_hash(round(with(answer14, .fitted[order(lifeExp)]), 3), 'f8db1a712fe6b69f7577c88882248dc5')
    expect_known_hash(round(with(answer14, .sigma[order(lifeExp)]), 3), '83e927e2f02fd18c293c3a1ddb7a0ed2')
})
cat('Success!')

## Writing Tests

Let's try writing some tests for these `purrr` functions!

These questions are _not_ for marks, and do not have autograded tests.

**QUESTION 15:** Write an assertion for the following function, which computes the number of unique values in each column of `penguins` in a vectorized way. First, run the code cell below so that you can see what the output of the function looks like. 

In [None]:
unique <- map(penguins, unique) %>% 
  map_dbl(length)
unique

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

**QUESTION 16:** Write an assertion for the following function, which converts characters to factors. First, run the code cell below so that you can see what the output of the function looks like.

In [None]:
factors <- penguins %>%
  map_if(is.character, as.factor) 
factors %>% str() # check output

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer