# Worksheet: Cleaning and wrangling data

This worksheet covers the [Cleaning and wrangling data](https://datasciencebook.ca/wrangling.html) chapter of the online textbook, which also lists the learning objectives for this worksheet. You should read the textbook chapter before attempting this worksheet. 

In [None]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
source("cleanup.R")
options(repr.matrix.max.rows = 6)

**Question 0.0** Multiple Choice: 
<br> {points: 1}

Which statement below is incorrect about vectors and data frames in R?

A. the columns of data frames are vectors

B. data frames can have columns of different types (e.g., a column of numeric data, and a column of character data)

C. vectors can have elements of different types (e.g., element one can be numeric, and element 2 can be a character)

D. data frames are a special kind of list


*Assign your answer to an object called `answer0.0`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer0.0 is not character"= setequal(digest(paste(toString(class(answer0.0)), "687d1")), "b0606851f8ab1403affe70dae88277a1"))
stopifnot("length of answer0.0 is not correct"= setequal(digest(paste(toString(length(answer0.0)), "687d1")), "8be1aa3b66de219b03427ea3a762d5f1"))
stopifnot("value of answer0.0 is not correct"= setequal(digest(paste(toString(tolower(answer0.0)), "687d1")), "a287fdc997fab945a12b4dc687c87c00"))
stopifnot("letters in string value of answer0.0 are correct but case is not correct"= setequal(digest(paste(toString(answer0.0), "687d1")), "28a29618bfa432c827be44da36ccd9ea"))

print('Success!')

**Question 0.1** Multiple Choice: 
<br> {points: 1}

Which of the following does **_not_** characterize a tidy dataset?

A. each row is a single observation

B. each value should not be in a single cell

C. each column is a single variable

D. each value is a single cell


*Assign your answer to an object called `answer0.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer0.1 is not character"= setequal(digest(paste(toString(class(answer0.1)), "8c44c")), "2afe90074fad3ec3878a60a49c4f380e"))
stopifnot("length of answer0.1 is not correct"= setequal(digest(paste(toString(length(answer0.1)), "8c44c")), "5ef1b720a52eb9027ba23705d33bbf41"))
stopifnot("value of answer0.1 is not correct"= setequal(digest(paste(toString(tolower(answer0.1)), "8c44c")), "4f7db8f6066001518a42f5070c3861aa"))
stopifnot("letters in string value of answer0.1 are correct but case is not correct"= setequal(digest(paste(toString(answer0.1), "8c44c")), "9c01aabcc9cd69be9502aa601bf680e4"))

print('Success!')

**Question 0.2** Multiple Choice: 
<br> {points: 1}

For which scenario would using one of the `group_by()` + `summarize()` be appropriate?

A. To apply the same function to every row. 

B. To apply the same function to every column.

C. To apply the same function to groups of rows. 

D. To apply the same function to groups of columns.

*Assign your answer to an object called `answer0.2`.  Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer0.2 is not character"= setequal(digest(paste(toString(class(answer0.2)), "9bea8")), "3fdc4aab2072a9cfa24a7f38521b6a97"))
stopifnot("length of answer0.2 is not correct"= setequal(digest(paste(toString(length(answer0.2)), "9bea8")), "c13695fa7859703bd79c245a30e9062b"))
stopifnot("value of answer0.2 is not correct"= setequal(digest(paste(toString(tolower(answer0.2)), "9bea8")), "0404480a667292e5fc0a7fa025a8ec1b"))
stopifnot("letters in string value of answer0.2 are correct but case is not correct"= setequal(digest(paste(toString(answer0.2), "9bea8")), "53ddfcba7833c00fc9fcc981a95a8ee2"))

print('Success!')

**Question 0.3** Multiple Choice: 
<br> {points: 1}

For which scenario would using one of the `purrr` `map_*` functions be appropriate?

A. To apply the same function to groups of rows.

B. To apply the same function to every column.

C. To apply the same function to groups of columns. 

D. All of the above.

*Assign your answer to an object called `answer0.3`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).**

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer0.3 is not character"= setequal(digest(paste(toString(class(answer0.3)), "480a5")), "f0e6f49c23f73bde19aa6e7f578fd38c"))
stopifnot("length of answer0.3 is not correct"= setequal(digest(paste(toString(length(answer0.3)), "480a5")), "72e3ce2ec1c24d95cbf2a4141b940d76"))
stopifnot("value of answer0.3 is not correct"= setequal(digest(paste(toString(tolower(answer0.3)), "480a5")), "035781451c5e7f12406f34db17587d15"))
stopifnot("letters in string value of answer0.3 are correct but case is not correct"= setequal(digest(paste(toString(answer0.3), "480a5")), "e750fcc6efa6308b4e230d0bd21adf1d"))

print('Success!')

## 1. Assessing avocado prices to inform restaurant menu planning

It is a well known that millennials LOVE avocado toast (joking...well mostly 😉), and so many restaurants will offer menu items that centre around this delicious food! Like many food items, avocado prices fluctuate. So a restaurant who wants to maximize profits on avocado-containing dishes might ask if there are times when the price of avocados are less expensive to purchase? If such times exist, this is when the restaurant should put avocado-containing dishes on the menu to maximize their profits for those dishes. 

<img align="left" src="https://www.averiecooks.com/wp-content/uploads/2017/07/egghole-2.jpg" width="150" />

*Source: https://www.averiecooks.com/egg-hole-avocado-toast/*

To answer this question we will analyze a data set of avocado sales from multiple US markets. This data was downloaded from the [Hass Avocado Board website](http://www.hassavocadoboard.com/) in May of 2018 & compiled into a single CSV. Each row in the data set contains weekly sales data for a region. The data set spans the year 2015-2018.

Some relevant columns in the dataset:

- `Date` - The date in year-month-day format
- `average_price` - The average price of a single avocado
- `type` - conventional or organic
- `yr` - The year
- `region` - The city or region of the observation
- `small_hass_volume` in pounds (lbs)	
- `large_hass_volume` in pounds (lbs)		
- `extra_l_hass_volume`	in pounds (lbs)	
- `wk` - integer number for the calendar week in the year (e.g., first week of January is 1, and last week of December is 52).

To answer our question of whether there are times in the year when avocados are typically less expensive (and thus we can make more profitable menu items with them at a restaurant) we will want to create a scatter plot of `average_price` (y-axis) versus `Date` (x-axis).

**Question 1.1** Multiple Choice:
<br> {points: 1}

Which of the following is not included in the `csv` file?

A. Average price of a single avocado.

B. The farming practice (production with/without the use of chemicals). 

C. Average price of a bag of avocados.

D. All options are included in the data set.

*Assign your answer to an object called `answer1.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).** 

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.1 is not character"= setequal(digest(paste(toString(class(answer1.1)), "ec118")), "e018a26824269560efa6a7d3a78d3b40"))
stopifnot("length of answer1.1 is not correct"= setequal(digest(paste(toString(length(answer1.1)), "ec118")), "c80a712371da11395e42dde4098a6031"))
stopifnot("value of answer1.1 is not correct"= setequal(digest(paste(toString(tolower(answer1.1)), "ec118")), "f1f9122db747e20e2ef49d066edebb82"))
stopifnot("letters in string value of answer1.1 are correct but case is not correct"= setequal(digest(paste(toString(answer1.1), "ec118")), "34e1fef391dfe1382a9b86b23c212757"))

print('Success!')

**Question 1.2** Multiple Choice:
<br> {points: 1}

The rows in the data frame represent:

A. daily avocado sales data for a region

B. weekly avocado sales data for a region

C. bi-weekly avocado sales data for a region

D. yearly avocado sales data for a region

*Assign your answer to an object called `answer1.2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.2 is not character"= setequal(digest(paste(toString(class(answer1.2)), "e8c45")), "a901b8d371f39b8597c80086db687147"))
stopifnot("length of answer1.2 is not correct"= setequal(digest(paste(toString(length(answer1.2)), "e8c45")), "2e2666b55f20d431a550e24128560f81"))
stopifnot("value of answer1.2 is not correct"= setequal(digest(paste(toString(tolower(answer1.2)), "e8c45")), "40618c348c94020d630f7622fe183b2e"))
stopifnot("letters in string value of answer1.2 are correct but case is not correct"= setequal(digest(paste(toString(answer1.2), "e8c45")), "35a28fb1ffe83d0917250a6215772d80"))

print('Success!')

**Question 1.3** 
<br> {points: 1}

The first step to plotting total volume against average price is to read the file `avocado_prices.csv` using the shortest relative path. The data file was given to you along with this worksheet, but you will have to look to see where it is in the `worksheet_03` directory to correctly load it. When you do this, you should also preview the file to help you choose an appropriate `read_*` function to read the data.

*Assign your answer to an object called `avocado`.* 

In [None]:
#... <- ...("...")

# your code here
fail() # No Answer - remove if you provide an answer
avocado 

In [None]:
library(digest)
stopifnot("avocado should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(avocado)), "6830e")), "10904dfb90d0e48e887a23973b27c218"))
stopifnot("dimensions of avocado are not correct"= setequal(digest(paste(toString(dim(avocado)), "6830e")), "8b68118bed3e559ffe023d29a19319e9"))
stopifnot("column names of avocado are not correct"= setequal(digest(paste(toString(sort(colnames(avocado))), "6830e")), "f1c12e24360e04ba0dc619df1a64e4cc"))
stopifnot("types of columns in avocado are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(avocado, class)))), "6830e")), "94038b3a92901ab7951c8b9889f976e4"))
stopifnot("values in one or more numerical columns in avocado are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado, is.numeric))) sort(round(sapply(avocado[, sapply(avocado, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "6830e")), "f68caeeba0663c026f023a70c0a4dc91"))
stopifnot("values in one or more character columns in avocado are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado, is.character))) sum(sapply(avocado[sapply(avocado, is.character)], function(x) length(unique(x)))) else 0), "6830e")), "91710d839850e7713918ac672cf3a720"))
stopifnot("values in one or more factor columns in avocado are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado, is.factor))) sum(sapply(avocado[, sapply(avocado, is.factor)], function(col) length(unique(col)))) else 0), "6830e")), "472edccc2acc8a245fee5deb476f0a02"))

print('Success!')

**Question 1.4** Multiple Choice:
<br> {points: 1}

Why are the 2nd to 5th columns \<dbl\> instead of \<int\>?

A. They aren't "real" numbers. 

B. They contain decimals. 

C. They are numbers created using text/letters. 

D. They are integers. 

*Assign your answer to an object called `answer1.4`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Make sure the correct answer is an uppercase letter. 
# Surround your answer with quotation marks.
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.4 is not character"= setequal(digest(paste(toString(class(answer1.4)), "bd916")), "3d6d3d42a78d4222e94a9c4fb81c976b"))
stopifnot("length of answer1.4 is not correct"= setequal(digest(paste(toString(length(answer1.4)), "bd916")), "5ab72744d0c54caeaa725420cb625005"))
stopifnot("value of answer1.4 is not correct"= setequal(digest(paste(toString(tolower(answer1.4)), "bd916")), "9c629dce0f4a4d856a05e144ce47713c"))
stopifnot("letters in string value of answer1.4 are correct but case is not correct"= setequal(digest(paste(toString(answer1.4), "bd916")), "116e3361d6fd46ac17eb32353fc4b133"))

print('Success!')

Before we get started doing our analysis, let's learn about the pipe operator, `|>`, as it can be very helpful when doing data analysis in R!

### Pipe Operators: `|>`
Pipe operators allow you to chain together different functions - it takes the output of one statement and makes it the input of the next statement. Having a chain of processing functions is known as a *pipeline*. 

If we wanted to subset the avocado data to obtain just the average prices for organic avocados, we would need to first filter the `type` column using the function: `filter()` for the rows where the type is organic. Then we would need to use the `select()` function to get just the average price column.

Below we illustrate how to do this using the pipe operator, `|>`, instead of creating an intermediate object as we have in past worksheets: 

> *Note: the indentation on the second line of the pipeline is not required, but added for readability.*

In [None]:
# run this cell
filter(avocado, type == "organic") |> 
    select(average_price)

We can even start off a pipeline by passing the data frame into the first function. This is convenient and aids in readability. You will see this being used often in this course going forward. Below we show an example of this doing the same task we just completed above (subsetting the average price data for organic avocados).

In [None]:
avocado |> 
    filter( type == "organic") |> 
    select(average_price)

**Question 1.5**

{points: 1}

To answer our question, let's now create the scatter plot where we plot `average_price` on the y-axis versus `Date` on the x-axis. Fill in the ... in the cell below. Copy and paste your finished answer in place of `fail()`. Assign your answer to an object called `avocado_plot`. Don't forget to create proper English axis labels.

In [None]:
options(repr.plot.width = 14, repr.plot.height = 7) # Modifies the size of the plots
#... <- ... |>
#    ggplot(aes(x = ..., y = ...)) + 
#        geom_...() +
#        xlab("...") + 
#        ylab("...") + 
#        theme(text = element_text(size=20))


# your code here
fail() # No Answer - remove if you provide an answer
avocado_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(avocado_plot$layers)), function(i) {c(class(avocado_plot$layers[[i]]$geom))[1]})), "3f169")), "63502eda337c6ee430374b9810a3f72d"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(avocado_plot$layers)), function(i) {rlang::get_expr(c(avocado_plot$layers[[i]]$mapping, avocado_plot$mapping)$x)}), as.character))), "3f169")), "095a841ea3a52371f820bc35786e7d25"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(avocado_plot$layers)), function(i) {rlang::get_expr(c(avocado_plot$layers[[i]]$mapping, avocado_plot$mapping)$y)}), as.character))), "3f169")), "8c6f1d1ab08740b7360f56c45e84c202"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_plot$layers[[1]]$mapping, avocado_plot$mapping)$x)!= avocado_plot$labels$x), "3f169")), "f8d50264b5ac0ea4bd26c5c34b994db9"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_plot$layers[[1]]$mapping, avocado_plot$mapping)$y)!= avocado_plot$labels$y), "3f169")), "f8d50264b5ac0ea4bd26c5c34b994db9"))
stopifnot("incorrect colour variable in avocado_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_plot$layers[[1]]$mapping, avocado_plot$mapping)$colour)), "3f169")), "765218ac60271664869eeac40b99d5e8"))
stopifnot("incorrect shape variable in avocado_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_plot$layers[[1]]$mapping, avocado_plot$mapping)$shape)), "3f169")), "765218ac60271664869eeac40b99d5e8"))
stopifnot("the colour label in avocado_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_plot$layers[[1]]$mapping, avocado_plot$mapping)$colour) != avocado_plot$labels$colour), "3f169")), "765218ac60271664869eeac40b99d5e8"))
stopifnot("the shape label in avocado_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_plot$layers[[1]]$mapping, avocado_plot$mapping)$colour) != avocado_plot$labels$shape), "3f169")), "765218ac60271664869eeac40b99d5e8"))
stopifnot("fill variable in avocado_plot is not correct"= setequal(digest(paste(toString(quo_name(avocado_plot$mapping$fill)), "3f169")), "dcbe9780f7d64b46966485ef79b1e40f"))
stopifnot("fill label in avocado_plot is not informative"= setequal(digest(paste(toString((quo_name(avocado_plot$mapping$fill) != avocado_plot$labels$fill)), "3f169")), "765218ac60271664869eeac40b99d5e8"))
stopifnot("position argument in avocado_plot is not correct"= setequal(digest(paste(toString(class(avocado_plot$layers[[1]]$position)[1]), "3f169")), "9625108ea200e212e3ffbb9077559a3f"))

stopifnot("avocado_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(avocado_plot$data)), "3f16a")), "aae4585058d687186821ad56c5bdb320"))
stopifnot("dimensions of avocado_plot$data are not correct"= setequal(digest(paste(toString(dim(avocado_plot$data)), "3f16a")), "e0c47dadca093ae1a33220d69d5cfa2a"))
stopifnot("column names of avocado_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(avocado_plot$data))), "3f16a")), "45480502304cc43250a2fab7ae2d64d2"))
stopifnot("types of columns in avocado_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(avocado_plot$data, class)))), "3f16a")), "285bf41272651a46a9d44199cbb6651f"))
stopifnot("values in one or more numerical columns in avocado_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_plot$data, is.numeric))) sort(round(sapply(avocado_plot$data[, sapply(avocado_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "3f16a")), "8f86868f24942f511939a7cd1249aeb3"))
stopifnot("values in one or more character columns in avocado_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_plot$data, is.character))) sum(sapply(avocado_plot$data[sapply(avocado_plot$data, is.character)], function(x) length(unique(x)))) else 0), "3f16a")), "43abd3e12d4e0b0e1f27ab80af86d2a2"))
stopifnot("values in one or more factor columns in avocado_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_plot$data, is.factor))) sum(sapply(avocado_plot$data[, sapply(avocado_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "3f16a")), "63f9ef3db8477be85dc20a8ffa362e54"))

print('Success!')

We might be able to squint and start to see some pattern in the data above, but really what we see in the plot above is not very informative. Why? Because there is a lot of overplotting (data points sitting on top of other data points). What can we do? One solution is to reduce/aggregate the data in a meaningful way to help anwer our question. Remember that we are interested in determining if there are times when the price of avocados are less expensive so that we can recommend when restaurants should put dishes on the menu that contain avocado to maximize their profits for those dishes.

In the data we plotted above, each row is the total sales for avocados for that region for each year. Lets use `group_by` + `summarize` calculate the average price for each week across years and region. We can then plot that aggregated price against the week and perhaps get a clearer picture.

**Question 1.6**
<br> {points: 1}

Create a reduced/aggregated version of the `avocado` data set and name it `avocado_aggregate`. To do this you will want to `group_by` the `wk` column and then use `summarize` to calculate the average price (name that column `average_price`).

In [None]:
#... <- ... |> 
#    group_by(...) |> 
#    summarize(... = mean(average_price, na.rm = TRUE))

# your code here
fail() # No Answer - remove if you provide an answer
avocado_aggregate

In [None]:
library(digest)
stopifnot("avocado_aggregate should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(avocado_aggregate)), "10864")), "40843887145368275c53d796b1184635"))
stopifnot("dimensions of avocado_aggregate are not correct"= setequal(digest(paste(toString(dim(avocado_aggregate)), "10864")), "9e38c14a866fe19191a7f705f84d0854"))
stopifnot("column names of avocado_aggregate are not correct"= setequal(digest(paste(toString(sort(colnames(avocado_aggregate))), "10864")), "4d5aa6f9fc262590e9bce9e6e14e82be"))
stopifnot("types of columns in avocado_aggregate are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(avocado_aggregate, class)))), "10864")), "df684857fbcf11c71c6c58852e84dee9"))
stopifnot("values in one or more numerical columns in avocado_aggregate are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate, is.numeric))) sort(round(sapply(avocado_aggregate[, sapply(avocado_aggregate, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "10864")), "0a0bd3d722beaa3cd315f48503b74fa0"))
stopifnot("values in one or more character columns in avocado_aggregate are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate, is.character))) sum(sapply(avocado_aggregate[sapply(avocado_aggregate, is.character)], function(x) length(unique(x)))) else 0), "10864")), "4deb06145f92f85d0274d4e2d1e6f725"))
stopifnot("values in one or more factor columns in avocado_aggregate are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate, is.factor))) sum(sapply(avocado_aggregate[, sapply(avocado_aggregate, is.factor)], function(col) length(unique(col)))) else 0), "10864")), "4deb06145f92f85d0274d4e2d1e6f725"))

print('Success!')

**Question 1.7**
<br> {points: 1}

Now let's take the `avocado_aggregate` data frame and use it to create a scatter plot where we plot `average_price` on the y-axis versus `wk` on the x-axis. Assign your answer to an object called `avocado_aggregate_plot`. Don't forget to create proper English axis labels.

In [None]:
#... <- ... |>
#    ggplot(aes(x = ..., y = ...)) + 
#        ...() +
#        ...("...") + 
#        ...("...") +
#        theme(text = element_text(size=20))

# your code here
fail() # No Answer - remove if you provide an answer
avocado_aggregate_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(avocado_aggregate_plot$layers)), function(i) {c(class(avocado_aggregate_plot$layers[[i]]$geom))[1]})), "eacfa")), "98c20803b95bb6638c17ece64ca391f2"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(avocado_aggregate_plot$layers)), function(i) {rlang::get_expr(c(avocado_aggregate_plot$layers[[i]]$mapping, avocado_aggregate_plot$mapping)$x)}), as.character))), "eacfa")), "0d6e69246268a4c27362d05a82ce335d"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(avocado_aggregate_plot$layers)), function(i) {rlang::get_expr(c(avocado_aggregate_plot$layers[[i]]$mapping, avocado_aggregate_plot$mapping)$y)}), as.character))), "eacfa")), "8f998325cca3895d22e1d88c92427e5e"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot$layers[[1]]$mapping, avocado_aggregate_plot$mapping)$x)!= avocado_aggregate_plot$labels$x), "eacfa")), "82225908d4c39e7903f3b76cf7c02278"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot$layers[[1]]$mapping, avocado_aggregate_plot$mapping)$y)!= avocado_aggregate_plot$labels$y), "eacfa")), "82225908d4c39e7903f3b76cf7c02278"))
stopifnot("incorrect colour variable in avocado_aggregate_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot$layers[[1]]$mapping, avocado_aggregate_plot$mapping)$colour)), "eacfa")), "95c9e32e41d3e53852380b33b11db958"))
stopifnot("incorrect shape variable in avocado_aggregate_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot$layers[[1]]$mapping, avocado_aggregate_plot$mapping)$shape)), "eacfa")), "95c9e32e41d3e53852380b33b11db958"))
stopifnot("the colour label in avocado_aggregate_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot$layers[[1]]$mapping, avocado_aggregate_plot$mapping)$colour) != avocado_aggregate_plot$labels$colour), "eacfa")), "95c9e32e41d3e53852380b33b11db958"))
stopifnot("the shape label in avocado_aggregate_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot$layers[[1]]$mapping, avocado_aggregate_plot$mapping)$colour) != avocado_aggregate_plot$labels$shape), "eacfa")), "95c9e32e41d3e53852380b33b11db958"))
stopifnot("fill variable in avocado_aggregate_plot is not correct"= setequal(digest(paste(toString(quo_name(avocado_aggregate_plot$mapping$fill)), "eacfa")), "9b735ed4674fd4759e74c6cb82a8a723"))
stopifnot("fill label in avocado_aggregate_plot is not informative"= setequal(digest(paste(toString((quo_name(avocado_aggregate_plot$mapping$fill) != avocado_aggregate_plot$labels$fill)), "eacfa")), "95c9e32e41d3e53852380b33b11db958"))
stopifnot("position argument in avocado_aggregate_plot is not correct"= setequal(digest(paste(toString(class(avocado_aggregate_plot$layers[[1]]$position)[1]), "eacfa")), "8db37a2b20fe926608f3e4cf2a5f690e"))

stopifnot("avocado_aggregate_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(avocado_aggregate_plot$data)), "eacfb")), "417212fc7582e5ee7e889a65f71225dc"))
stopifnot("dimensions of avocado_aggregate_plot$data are not correct"= setequal(digest(paste(toString(dim(avocado_aggregate_plot$data)), "eacfb")), "feff2f62a6a438c3ffe5b4e23722b0c5"))
stopifnot("column names of avocado_aggregate_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(avocado_aggregate_plot$data))), "eacfb")), "6cae89340ef60c31afc4e607d2087b0a"))
stopifnot("types of columns in avocado_aggregate_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(avocado_aggregate_plot$data, class)))), "eacfb")), "b3e53facaf990b6122c9e3edf39d81f8"))
stopifnot("values in one or more numerical columns in avocado_aggregate_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate_plot$data, is.numeric))) sort(round(sapply(avocado_aggregate_plot$data[, sapply(avocado_aggregate_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "eacfb")), "bac027e24cf84b9afa97b2ddcedd12d5"))
stopifnot("values in one or more character columns in avocado_aggregate_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate_plot$data, is.character))) sum(sapply(avocado_aggregate_plot$data[sapply(avocado_aggregate_plot$data, is.character)], function(x) length(unique(x)))) else 0), "eacfb")), "e85b2b1eaaed2964c115baeb64c2d46a"))
stopifnot("values in one or more factor columns in avocado_aggregate_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate_plot$data, is.factor))) sum(sapply(avocado_aggregate_plot$data[, sapply(avocado_aggregate_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "eacfb")), "e85b2b1eaaed2964c115baeb64c2d46a"))

print('Success!')

We can now see that the prices of avocados does indeed fluctuate throughout the year. And we could use this information to recommend to restaurants that if they want to maximize profit from menu items that contain avocados, they should only offer them on the menu roughly between December and May. 

Why might this happen? Perhaps price has something to do with supply? We can also use this data set to get some insight into that question by plotting total avocado volume (y-axis) versus week. To do this, we will first have to create a column called `total_volume` whose value is the sum of the small, large and extra large-sized avocado volumes. To do this we will have to go back to the original `avocado` data frame we loaded.

**Question 1.8**
<br> {points: 1}

Our next step to plotting `total_volume` per week against week is to use `mutate` to create a new column in the `avocado` data frame called `total_volume` which is equal to the sum of all three volume columns:

Fill in the `...` in the cell below. Copy and paste your finished answer and replace the `fail()`. 

In [None]:
#... <- ... |>
#     mutate(... = ... + ... + ...) 

# your code here
fail() # No Answer - remove if you provide an answer
avocado

In [None]:
library(digest)
stopifnot("avocado should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(avocado)), "1ec2d")), "678ee891f0e31acaaaf9d0b9a5a6b697"))
stopifnot("dimensions of avocado are not correct"= setequal(digest(paste(toString(dim(avocado)), "1ec2d")), "666bbce72d1bd500d82f997cf3248392"))
stopifnot("column names of avocado are not correct"= setequal(digest(paste(toString(sort(colnames(avocado))), "1ec2d")), "f133c69492eb8d86856586599d28dab1"))
stopifnot("types of columns in avocado are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(avocado, class)))), "1ec2d")), "c760ff33d2f4b5639472afc614473501"))
stopifnot("values in one or more numerical columns in avocado are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado, is.numeric))) sort(round(sapply(avocado[, sapply(avocado, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "1ec2d")), "ddc44ba4024eea7bd093e68307fb8323"))
stopifnot("values in one or more character columns in avocado are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado, is.character))) sum(sapply(avocado[sapply(avocado, is.character)], function(x) length(unique(x)))) else 0), "1ec2d")), "09752fd11d903a40febef092927ec090"))
stopifnot("values in one or more factor columns in avocado are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado, is.factor))) sum(sapply(avocado[, sapply(avocado, is.factor)], function(col) length(unique(col)))) else 0), "1ec2d")), "aa055fccf1ddaa2f4f08134f898d4759"))

print('Success!')

**Question 1.9** 
<br> {points: 1}

Now, create another reduced/aggregated version of the `avocado` data frame and name it `avocado_aggregate_2`. To do this you will want to `group_by` the `wk` column and then use `summarize` to calculate the average total volume (name that column `total_volume`).

In [None]:
#... <- ... |> 
#    group_by(...) |> 
#    summarize(...)

# your code here
fail() # No Answer - remove if you provide an answer
avocado_aggregate_2

In [None]:
library(digest)
stopifnot("avocado_aggregate_2 should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(avocado_aggregate_2)), "602d")), "0b458e0fd292c3b120547d3ccf133fbc"))
stopifnot("dimensions of avocado_aggregate_2 are not correct"= setequal(digest(paste(toString(dim(avocado_aggregate_2)), "602d")), "97115f28e8b8183acb846b6abdc2d65f"))
stopifnot("column names of avocado_aggregate_2 are not correct"= setequal(digest(paste(toString(sort(colnames(avocado_aggregate_2))), "602d")), "24ce9a175e180b8658c637bb419eeaf3"))
stopifnot("types of columns in avocado_aggregate_2 are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(avocado_aggregate_2, class)))), "602d")), "8df22773ee7720393cd1053c24224db0"))
stopifnot("values in one or more numerical columns in avocado_aggregate_2 are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate_2, is.numeric))) sort(round(sapply(avocado_aggregate_2[, sapply(avocado_aggregate_2, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "602d")), "b2e453fb7f1fd5264bf0980016ae952d"))
stopifnot("values in one or more character columns in avocado_aggregate_2 are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate_2, is.character))) sum(sapply(avocado_aggregate_2[sapply(avocado_aggregate_2, is.character)], function(x) length(unique(x)))) else 0), "602d")), "ef309d693a97aa5e2d41f671a385cec0"))
stopifnot("values in one or more factor columns in avocado_aggregate_2 are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate_2, is.factor))) sum(sapply(avocado_aggregate_2[, sapply(avocado_aggregate_2, is.factor)], function(col) length(unique(col)))) else 0), "602d")), "ef309d693a97aa5e2d41f671a385cec0"))

print('Success!')

**Question 1.10** 
<br> {points: 1}

Now let's take the `avocado_aggregate_2` data frame and use it to create a scatter plot where we plot average `total_volume` (in pounds, lbs) on the y-axis versus `wk` on the x-axis. Assign your answer to an object called `avocado_aggregate_plot_2`. Don't forget to create proper English axis labels.

> Hint: don't forget to include the units for volume in your data visualization.

In [None]:
#... <- ... |>
#    ggplot(aes(x = ..., y = ...)) + 
#        ...() +
#        ...("...") + 
#        ...("...") +
#        theme(text = element_text(size=20))


# your code here
fail() # No Answer - remove if you provide an answer
avocado_aggregate_plot_2

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(avocado_aggregate_plot_2$layers)), function(i) {c(class(avocado_aggregate_plot_2$layers[[i]]$geom))[1]})), "9091c")), "c160aa75cefeb2a8504ad2cf86364c8c"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(avocado_aggregate_plot_2$layers)), function(i) {rlang::get_expr(c(avocado_aggregate_plot_2$layers[[i]]$mapping, avocado_aggregate_plot_2$mapping)$x)}), as.character))), "9091c")), "939e8cc5b23d4a59dc0a2b65342d364a"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(avocado_aggregate_plot_2$layers)), function(i) {rlang::get_expr(c(avocado_aggregate_plot_2$layers[[i]]$mapping, avocado_aggregate_plot_2$mapping)$y)}), as.character))), "9091c")), "5410fb7a099d7b1849144a2e17713d4a"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot_2$layers[[1]]$mapping, avocado_aggregate_plot_2$mapping)$x)!= avocado_aggregate_plot_2$labels$x), "9091c")), "85134a7506f5de5c2b90ad73f98914ef"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot_2$layers[[1]]$mapping, avocado_aggregate_plot_2$mapping)$y)!= avocado_aggregate_plot_2$labels$y), "9091c")), "85134a7506f5de5c2b90ad73f98914ef"))
stopifnot("incorrect colour variable in avocado_aggregate_plot_2, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot_2$layers[[1]]$mapping, avocado_aggregate_plot_2$mapping)$colour)), "9091c")), "9774f228e322f6688582bbe1bd70da8c"))
stopifnot("incorrect shape variable in avocado_aggregate_plot_2, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot_2$layers[[1]]$mapping, avocado_aggregate_plot_2$mapping)$shape)), "9091c")), "9774f228e322f6688582bbe1bd70da8c"))
stopifnot("the colour label in avocado_aggregate_plot_2 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot_2$layers[[1]]$mapping, avocado_aggregate_plot_2$mapping)$colour) != avocado_aggregate_plot_2$labels$colour), "9091c")), "9774f228e322f6688582bbe1bd70da8c"))
stopifnot("the shape label in avocado_aggregate_plot_2 is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(avocado_aggregate_plot_2$layers[[1]]$mapping, avocado_aggregate_plot_2$mapping)$colour) != avocado_aggregate_plot_2$labels$shape), "9091c")), "9774f228e322f6688582bbe1bd70da8c"))
stopifnot("fill variable in avocado_aggregate_plot_2 is not correct"= setequal(digest(paste(toString(quo_name(avocado_aggregate_plot_2$mapping$fill)), "9091c")), "fa8252b69ce6e4309b3f3a83e4b0dc1c"))
stopifnot("fill label in avocado_aggregate_plot_2 is not informative"= setequal(digest(paste(toString((quo_name(avocado_aggregate_plot_2$mapping$fill) != avocado_aggregate_plot_2$labels$fill)), "9091c")), "9774f228e322f6688582bbe1bd70da8c"))
stopifnot("position argument in avocado_aggregate_plot_2 is not correct"= setequal(digest(paste(toString(class(avocado_aggregate_plot_2$layers[[1]]$position)[1]), "9091c")), "32607333a9a8da9c469a13f94703f0b8"))

stopifnot("avocado_aggregate_plot_2$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(avocado_aggregate_plot_2$data)), "9091d")), "acf428f3fabd6f03b4a82671efce6658"))
stopifnot("dimensions of avocado_aggregate_plot_2$data are not correct"= setequal(digest(paste(toString(dim(avocado_aggregate_plot_2$data)), "9091d")), "476f61c2401867d2416791c07c3abf08"))
stopifnot("column names of avocado_aggregate_plot_2$data are not correct"= setequal(digest(paste(toString(sort(colnames(avocado_aggregate_plot_2$data))), "9091d")), "43b17ecf98f33772e3f3982d5bc1f835"))
stopifnot("types of columns in avocado_aggregate_plot_2$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(avocado_aggregate_plot_2$data, class)))), "9091d")), "f89685f672d4520fc6d52ee5a376b4c7"))
stopifnot("values in one or more numerical columns in avocado_aggregate_plot_2$data are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate_plot_2$data, is.numeric))) sort(round(sapply(avocado_aggregate_plot_2$data[, sapply(avocado_aggregate_plot_2$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "9091d")), "657b302c9d51378dc05bbf443a3643c9"))
stopifnot("values in one or more character columns in avocado_aggregate_plot_2$data are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate_plot_2$data, is.character))) sum(sapply(avocado_aggregate_plot_2$data[sapply(avocado_aggregate_plot_2$data, is.character)], function(x) length(unique(x)))) else 0), "9091d")), "79fda2175a353a7af0f5b2b191e71d37"))
stopifnot("values in one or more factor columns in avocado_aggregate_plot_2$data are not correct"= setequal(digest(paste(toString(if (any(sapply(avocado_aggregate_plot_2$data, is.factor))) sum(sapply(avocado_aggregate_plot_2$data[, sapply(avocado_aggregate_plot_2$data, is.factor)], function(col) length(unique(col)))) else 0), "9091d")), "79fda2175a353a7af0f5b2b191e71d37"))

print('Success!')

We can see from the above plot of the average total volume versus the week that there are more avocados sold (and perhaps this reflects what is available for sale) roughly between January to May. This time period of increased volume corresponds with the lower avocado prices. We can *hypothesize* (but not conclude, of course) that the lower prices may be due to an increased availability of avocados during this time period.

## 2. Sea Surface Temperatures in Departure Bay
The next data set that we will be looking at contains environmental data from 1914 to 2018. The data was collected by the DFO (Canada's Department of Fisheries and Oceans) at the Pacific Biological Station (Departure Bay). Daily sea surface temperature (in degrees Celsius) and salinity (in practical salinity units, PSU) observations have been carried out at several locations on the coast of British Columbia. The number of stations reporting at any given time has varied as sampling has been discontinued at some stations, and started or resumed at others.

Presently termed the British Columbia Shore Station Oceanographic Program (BCSOP), there are 12 participating stations; most of these are staffed by Fisheries and Oceans Canada. You can look at data from other stations at http://www.pac.dfo-mpo.gc.ca/science/oceans/data-donnees/lightstations-phares/index-eng.html 

Further information from the Government of Canada's website indicates: 
>  Observations are made daily using seawater collected in a bucket lowered into the surface water at or near the daytime high tide. This sampling method was designed long ago by Dr. John P. Tully and has not been changed in the interests of a homogeneous data set. This means, for example, that if an observer starts sampling one day at 6 a.m., and continues to sample at the daytime high tide on the second day the sample will be taken at about 06:50 the next day, 07:40 the day after etc. When the daytime high-tide gets close to 6 p.m. the observer will then begin again to sample early in the morning, and the cycle continues. Since there is a day/night variation in the sea surface temperatures the daily time series will show a signal that varies with the14-day tidal cycle. This artifact does not affect the monthly sea surface temperature data.

In this worksheet, we want to see if the sea surface temperature has been changing over time. 

**Question 2.1** True or False:
<br> {points: 1}

The sampling of surface water occurs at the same time each day. 

*Assign your answer to an object called `answer2.1`. Make sure your answer is lowercase "true" or lowercase "false".* 

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer2.1 is not character"= setequal(digest(paste(toString(class(answer2.1)), "cf70a")), "f46034ad50041fe1f37e6fd4c1a87640"))
stopifnot("length of answer2.1 is not correct"= setequal(digest(paste(toString(length(answer2.1)), "cf70a")), "57c71b0b4a7b7bdde285898d5cb76b69"))
stopifnot("value of answer2.1 is not correct"= setequal(digest(paste(toString(tolower(answer2.1)), "cf70a")), "a33ccf911a9f05df7bea34417c88bd74"))
stopifnot("letters in string value of answer2.1 are correct but case is not correct"= setequal(digest(paste(toString(answer2.1), "cf70a")), "a33ccf911a9f05df7bea34417c88bd74"))

print('Success!')

**Question 2.2** Multiple Choice:
<br> {points: 1}

If high tide occurred at 9am today, what time would the scientist collect data tomorrow?

A. 11:10 am 

B. 9:50 am 

C. 10:00 pm 

D. Trick question... you skip days when collecting data. 

*Assign your answer to an object called `answer2.2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).* 

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer2.2 is not character"= setequal(digest(paste(toString(class(answer2.2)), "61f6e")), "039f1d885a5c6edd012c23a6cb312f3f"))
stopifnot("length of answer2.2 is not correct"= setequal(digest(paste(toString(length(answer2.2)), "61f6e")), "df1d7aededad3ee47f11f9e6499fdf01"))
stopifnot("value of answer2.2 is not correct"= setequal(digest(paste(toString(tolower(answer2.2)), "61f6e")), "6bc85ade8c64f3af5efeb84cb3bd43b8"))
stopifnot("letters in string value of answer2.2 are correct but case is not correct"= setequal(digest(paste(toString(answer2.2), "61f6e")), "c1c86e0c3f56dcb8fda80d8ac19a5caa"))

print('Success!')

**Question 2.3**
<br> {points: 1}

To begin working with this data, read the file `departure_bay_temperature.csv` using a relative path. Note, this file (just like the avocado data set) is found within the `worksheet_03` directory. 

*Assign your answer to an object called `sea_surface`.* 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
sea_surface

In [None]:
library(digest)
stopifnot("sea_surface should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(sea_surface)), "cd219")), "03e981103d8a2c08d303865bb3f71247"))
stopifnot("dimensions of sea_surface are not correct"= setequal(digest(paste(toString(dim(sea_surface)), "cd219")), "09e1a20e444414ed12fd937cfca0b2e2"))
stopifnot("column names of sea_surface are not correct"= setequal(digest(paste(toString(sort(colnames(sea_surface))), "cd219")), "9c50ed2006a1bdb4fa9d550d34b3f38d"))
stopifnot("types of columns in sea_surface are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(sea_surface, class)))), "cd219")), "3534f579763596286a1f0dc4962103ad"))
stopifnot("values in one or more numerical columns in sea_surface are not correct"= setequal(digest(paste(toString(if (any(sapply(sea_surface, is.numeric))) sort(round(sapply(sea_surface[, sapply(sea_surface, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "cd219")), "45f839df43452ab0fd78dfd8b8626116"))
stopifnot("values in one or more character columns in sea_surface are not correct"= setequal(digest(paste(toString(if (any(sapply(sea_surface, is.character))) sum(sapply(sea_surface[sapply(sea_surface, is.character)], function(x) length(unique(x)))) else 0), "cd219")), "617c582feb22712e057522fd92197732"))
stopifnot("values in one or more factor columns in sea_surface are not correct"= setequal(digest(paste(toString(if (any(sapply(sea_surface, is.factor))) sum(sapply(sea_surface[, sapply(sea_surface, is.factor)], function(col) length(unique(col)))) else 0), "cd219")), "617c582feb22712e057522fd92197732"))

print('Success!')

**Question 2.3.1**
<br> {points: 1}

The data above in Question 2.3 is not tidy, which reasons listed below explain why?

A. There are NA's in the data set

B. The variable temperature is split across more than one column

C. Values for the variable month are stored as column names

D. A and C

E. B and C

F. All of the above

Assign your answer to an object called `answer2.3.1`.

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer2.3.1 is not character"= setequal(digest(paste(toString(class(answer2.3.1)), "4f92a")), "16117d269a8c4fd9b923c6a1cf27fbd8"))
stopifnot("length of answer2.3.1 is not correct"= setequal(digest(paste(toString(length(answer2.3.1)), "4f92a")), "2b2caae526c216582500f44885a4d5c8"))
stopifnot("value of answer2.3.1 is not correct"= setequal(digest(paste(toString(tolower(answer2.3.1)), "4f92a")), "e0cbe6117ab6085270c86cd9d29993c4"))
stopifnot("letters in string value of answer2.3.1 are correct but case is not correct"= setequal(digest(paste(toString(answer2.3.1), "4f92a")), "517232d9117a02ef244b399f891ebd45"))

print('Success!')

**Question 2.4**
<br> {points: 1}

Given `ggplot` expects tidy data, we need to convert our data into that format. To do this we will use the `pivot_longer()` function. We would like our data to end up looking like this:

| Year | Month | Temperature |
|------|-------|-------------|
| 1914 | Jan   | 7.2         |
| 1914 | Feb   | NA          |
| 1914 | Mar   | NA          |
| ...  | ...   | ...         |
| 2018 | Oct   | NA          |
| 2018 | Nov   | NA          |
| 2018 | Dec   | NA          |


Fill in the `...` in the cell below. Copy and paste your finished answer and replace the `fail()`.

*Assign your answer to an object called `tidy_temp`.*

In [None]:
#... <- sea_surface |>
#    ...(cols = Jan:Dec, 
#                 names_to = "...", 
#                 values_to = "Temperature")

# your code here
fail() # No Answer - remove if you provide an answer
tidy_temp

In [None]:
library(digest)
stopifnot("tidy_temp should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(tidy_temp)), "d4ee1")), "f7a54bb102d15b9dda5cf7ad8eb72200"))
stopifnot("dimensions of tidy_temp are not correct"= setequal(digest(paste(toString(dim(tidy_temp)), "d4ee1")), "ed048e82747c0238d038a3dd0a9be714"))
stopifnot("column names of tidy_temp are not correct"= setequal(digest(paste(toString(sort(colnames(tidy_temp))), "d4ee1")), "235c884206916b5a58e4ddb2af551723"))
stopifnot("types of columns in tidy_temp are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(tidy_temp, class)))), "d4ee1")), "10dff466b894f72ba0a95bafd370413d"))
stopifnot("values in one or more numerical columns in tidy_temp are not correct"= setequal(digest(paste(toString(if (any(sapply(tidy_temp, is.numeric))) sort(round(sapply(tidy_temp[, sapply(tidy_temp, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "d4ee1")), "9ba9c3a486e7d7730835165daae4d4f8"))
stopifnot("values in one or more character columns in tidy_temp are not correct"= setequal(digest(paste(toString(if (any(sapply(tidy_temp, is.character))) sum(sapply(tidy_temp[sapply(tidy_temp, is.character)], function(x) length(unique(x)))) else 0), "d4ee1")), "734fbddbbd300ac91297168d1178ba08"))
stopifnot("values in one or more factor columns in tidy_temp are not correct"= setequal(digest(paste(toString(if (any(sapply(tidy_temp, is.factor))) sum(sapply(tidy_temp[, sapply(tidy_temp, is.factor)], function(col) length(unique(col)))) else 0), "d4ee1")), "8c95576a29f0ee3e769bce9086fbfe8e"))

print('Success!')

**Question 2.5**
<br> {points: 1}

Now that we have our data in a tidy format, we can create our plot that compares the average monthly sea surface temperatures (in degrees Celsius) to the year they were recorded. To make our plots more informative, we should plot each month separately. We can use `filter` to do this before we pipe our data into the `ggplot` function. Let's start out by just plotting the data for the month of November. As usual, use proper English to label your axes :)

*Assign your answer to an object called `nov_temp_plot`.*

> Hint: don't forget to include the units for temperature in your data visualization.

In [None]:
options(repr.plot.width = 12, repr.plot.height = 7)
#... <- ... |> 
#    filter(... == ...) |> 
#    ggplot(aes(x = ..., y = ...)) + 
#    geom_point() + 
#    xlab(...) + 
#    ylab(...) +
#    theme(text = element_text(size=20))



# your code here
fail() # No Answer - remove if you provide an answer
nov_temp_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(nov_temp_plot$layers)), function(i) {c(class(nov_temp_plot$layers[[i]]$geom))[1]})), "e8be1")), "3fcd5bcab77921722db0969fb9ccffc7"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(nov_temp_plot$layers)), function(i) {rlang::get_expr(c(nov_temp_plot$layers[[i]]$mapping, nov_temp_plot$mapping)$x)}), as.character))), "e8be1")), "388b0bf8d2094c6c5163505ef0c4b0f5"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(nov_temp_plot$layers)), function(i) {rlang::get_expr(c(nov_temp_plot$layers[[i]]$mapping, nov_temp_plot$mapping)$y)}), as.character))), "e8be1")), "b01f874eff0424ee420000cf1ccf02cc"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(nov_temp_plot$layers[[1]]$mapping, nov_temp_plot$mapping)$x)!= nov_temp_plot$labels$x), "e8be1")), "60caf9b344899be9e331234b699226c9"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(nov_temp_plot$layers[[1]]$mapping, nov_temp_plot$mapping)$y)!= nov_temp_plot$labels$y), "e8be1")), "60caf9b344899be9e331234b699226c9"))
stopifnot("incorrect colour variable in nov_temp_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(nov_temp_plot$layers[[1]]$mapping, nov_temp_plot$mapping)$colour)), "e8be1")), "1123479006c1d39475c95329c9cdb64e"))
stopifnot("incorrect shape variable in nov_temp_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(nov_temp_plot$layers[[1]]$mapping, nov_temp_plot$mapping)$shape)), "e8be1")), "1123479006c1d39475c95329c9cdb64e"))
stopifnot("the colour label in nov_temp_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(nov_temp_plot$layers[[1]]$mapping, nov_temp_plot$mapping)$colour) != nov_temp_plot$labels$colour), "e8be1")), "1123479006c1d39475c95329c9cdb64e"))
stopifnot("the shape label in nov_temp_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(nov_temp_plot$layers[[1]]$mapping, nov_temp_plot$mapping)$colour) != nov_temp_plot$labels$shape), "e8be1")), "1123479006c1d39475c95329c9cdb64e"))
stopifnot("fill variable in nov_temp_plot is not correct"= setequal(digest(paste(toString(quo_name(nov_temp_plot$mapping$fill)), "e8be1")), "79a1750ce0a99af94b0d31427600fe2e"))
stopifnot("fill label in nov_temp_plot is not informative"= setequal(digest(paste(toString((quo_name(nov_temp_plot$mapping$fill) != nov_temp_plot$labels$fill)), "e8be1")), "1123479006c1d39475c95329c9cdb64e"))
stopifnot("position argument in nov_temp_plot is not correct"= setequal(digest(paste(toString(class(nov_temp_plot$layers[[1]]$position)[1]), "e8be1")), "990ec0011638dbc946a7bf90c969906b"))

stopifnot("nov_temp_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(nov_temp_plot$data)), "e8be2")), "07f94019294766d0eb87a3beadb5f973"))
stopifnot("dimensions of nov_temp_plot$data are not correct"= setequal(digest(paste(toString(dim(nov_temp_plot$data)), "e8be2")), "d7b05eef39b4f522c454ec48c68f8492"))
stopifnot("column names of nov_temp_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(nov_temp_plot$data))), "e8be2")), "3d03cf08af2d53c29768b267c1f89114"))
stopifnot("types of columns in nov_temp_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(nov_temp_plot$data, class)))), "e8be2")), "192bdd6368f61c7b4dda56babdc6d5a0"))
stopifnot("values in one or more numerical columns in nov_temp_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(nov_temp_plot$data, is.numeric))) sort(round(sapply(nov_temp_plot$data[, sapply(nov_temp_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "e8be2")), "be005c4793c1443ed377c0625526f413"))
stopifnot("values in one or more character columns in nov_temp_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(nov_temp_plot$data, is.character))) sum(sapply(nov_temp_plot$data[sapply(nov_temp_plot$data, is.character)], function(x) length(unique(x)))) else 0), "e8be2")), "194b18677347ea6bebcf3db3ff29ccd2"))
stopifnot("values in one or more factor columns in nov_temp_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(nov_temp_plot$data, is.factor))) sum(sapply(nov_temp_plot$data[, sapply(nov_temp_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "e8be2")), "f79f7f4a1456e465100f430454f861ee"))

print('Success!')

We can see that there may be a small decrease in colder temperatures in recent years, and/or the temperatures in recent years look less variable compared to years before 1975. What about other months? Let's plot them! 

Instead of repeating the code above for the 11 other months, we'll take advantage of a `ggplot2` function that we haven't met yet, `facet_wrap`. This function is used to create many plots side-by-side, and wrapped around to new lines if there are too many plots. You tell `ggplot2` how to split up the plots by specifying the argument `facets = vars(...)`, where `...` represents the variable that is used to split the plots. We will learn more about this function next week, this week we will give you the code for it.

**Question 2.6**
<br> {points: 1}

Fill in the missing code below to plot the average monthly sea surface temperatures to the year they were recorded for all months. Assign your answer to an object called `all_temp_plot`.

> Hint: don't forget to include the units for temperature in your data visualization.

In [None]:
options(repr.plot.width = 14, repr.plot.height = 8)
#... <- ... |> 
#    ggplot(aes(x = ..., y = ...)) + 
#    geom_point() + 
#    facet_wrap(facets = vars(factor(Month, levels = c("Jan","Feb","Mar","Apr","May","Jun",
#                                          "Jul","Aug","Sep","Oct","Nov","Dec")))) +
#    xlab(...) + 
#    ylab(...) +
#    theme(text = element_text(size=20))


# your code here
fail() # No Answer - remove if you provide an answer
all_temp_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(all_temp_plot$layers)), function(i) {c(class(all_temp_plot$layers[[i]]$geom))[1]})), "2e9fb")), "6801735ee7afdd227bff7ba91c296e79"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(all_temp_plot$layers)), function(i) {rlang::get_expr(c(all_temp_plot$layers[[i]]$mapping, all_temp_plot$mapping)$x)}), as.character))), "2e9fb")), "f5e4909286ca29c56884ce1a14f0efa3"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(all_temp_plot$layers)), function(i) {rlang::get_expr(c(all_temp_plot$layers[[i]]$mapping, all_temp_plot$mapping)$y)}), as.character))), "2e9fb")), "9fb2f1fcb76df00bc4a2dc71ec7c79d1"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(all_temp_plot$layers[[1]]$mapping, all_temp_plot$mapping)$x)!= all_temp_plot$labels$x), "2e9fb")), "ee691d1c620095cabe3e137fffb33781"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(all_temp_plot$layers[[1]]$mapping, all_temp_plot$mapping)$y)!= all_temp_plot$labels$y), "2e9fb")), "ee691d1c620095cabe3e137fffb33781"))
stopifnot("incorrect colour variable in all_temp_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(all_temp_plot$layers[[1]]$mapping, all_temp_plot$mapping)$colour)), "2e9fb")), "68b934fe91ee10da2375ecb45145c82f"))
stopifnot("incorrect shape variable in all_temp_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(all_temp_plot$layers[[1]]$mapping, all_temp_plot$mapping)$shape)), "2e9fb")), "68b934fe91ee10da2375ecb45145c82f"))
stopifnot("the colour label in all_temp_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(all_temp_plot$layers[[1]]$mapping, all_temp_plot$mapping)$colour) != all_temp_plot$labels$colour), "2e9fb")), "68b934fe91ee10da2375ecb45145c82f"))
stopifnot("the shape label in all_temp_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(all_temp_plot$layers[[1]]$mapping, all_temp_plot$mapping)$colour) != all_temp_plot$labels$shape), "2e9fb")), "68b934fe91ee10da2375ecb45145c82f"))
stopifnot("fill variable in all_temp_plot is not correct"= setequal(digest(paste(toString(quo_name(all_temp_plot$mapping$fill)), "2e9fb")), "f7f0fd8f97dbfbea712e50eaa48aa061"))
stopifnot("fill label in all_temp_plot is not informative"= setequal(digest(paste(toString((quo_name(all_temp_plot$mapping$fill) != all_temp_plot$labels$fill)), "2e9fb")), "68b934fe91ee10da2375ecb45145c82f"))
stopifnot("position argument in all_temp_plot is not correct"= setequal(digest(paste(toString(class(all_temp_plot$layers[[1]]$position)[1]), "2e9fb")), "a5c1caed42fca696443a433987ef6661"))

stopifnot("all_temp_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(all_temp_plot$data)), "2e9fc")), "5f47a0f51a2060fbfbc0e54eb31a7704"))
stopifnot("dimensions of all_temp_plot$data are not correct"= setequal(digest(paste(toString(dim(all_temp_plot$data)), "2e9fc")), "b9c48172efd93d4784c80fe79e760aed"))
stopifnot("column names of all_temp_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(all_temp_plot$data))), "2e9fc")), "45de861665f60751801bec21ec3f9595"))
stopifnot("types of columns in all_temp_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(all_temp_plot$data, class)))), "2e9fc")), "fe1351e63c172db0ba2342acbad04993"))
stopifnot("values in one or more numerical columns in all_temp_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(all_temp_plot$data, is.numeric))) sort(round(sapply(all_temp_plot$data[, sapply(all_temp_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "2e9fc")), "bef6bf7bbc31af22a516fd4c1c5c59e8"))
stopifnot("values in one or more character columns in all_temp_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(all_temp_plot$data, is.character))) sum(sapply(all_temp_plot$data[sapply(all_temp_plot$data, is.character)], function(x) length(unique(x)))) else 0), "2e9fc")), "35a1ad0429502227f86f2c4cbb085302"))
stopifnot("values in one or more factor columns in all_temp_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(all_temp_plot$data, is.factor))) sum(sapply(all_temp_plot$data[, sapply(all_temp_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "2e9fc")), "822875a2d695bbae5a499e9410a409ea"))

stopifnot("type of class(all_temp_plot$facet)[1] is not character"= setequal(digest(paste(toString(class(class(all_temp_plot$facet)[1])), "2e9fd")), "5ef62a586966b3183cc11ab9b1dd3299"))
stopifnot("length of class(all_temp_plot$facet)[1] is not correct"= setequal(digest(paste(toString(length(class(all_temp_plot$facet)[1])), "2e9fd")), "e43bde651b75f08801511c2edaa85f1f"))
stopifnot("value of class(all_temp_plot$facet)[1] is not correct"= setequal(digest(paste(toString(tolower(class(all_temp_plot$facet)[1])), "2e9fd")), "5630da010e7efc18fdb9cec8c6d1799a"))
stopifnot("letters in string value of class(all_temp_plot$facet)[1] are correct but case is not correct"= setequal(digest(paste(toString(class(all_temp_plot$facet)[1]), "2e9fd")), "16b789b1704b7483b60d3e47c3cb8f08"))

stopifnot("type of as.character(rlang::get_expr(all_temp_plot$facet$params$facets[[1]])[2]) is not character"= setequal(digest(paste(toString(class(as.character(rlang::get_expr(all_temp_plot$facet$params$facets[[1]])[2]))), "2e9fe")), "7e7c65bacb079833e32c967d107c0b40"))
stopifnot("length of as.character(rlang::get_expr(all_temp_plot$facet$params$facets[[1]])[2]) is not correct"= setequal(digest(paste(toString(length(as.character(rlang::get_expr(all_temp_plot$facet$params$facets[[1]])[2]))), "2e9fe")), "300c75e69ecd62fdf67ec668447544a2"))
stopifnot("value of as.character(rlang::get_expr(all_temp_plot$facet$params$facets[[1]])[2]) is not correct"= setequal(digest(paste(toString(tolower(as.character(rlang::get_expr(all_temp_plot$facet$params$facets[[1]])[2]))), "2e9fe")), "211f11c5c87bc650a344890f38adf439"))
stopifnot("letters in string value of as.character(rlang::get_expr(all_temp_plot$facet$params$facets[[1]])[2]) are correct but case is not correct"= setequal(digest(paste(toString(as.character(rlang::get_expr(all_temp_plot$facet$params$facets[[1]])[2])), "2e9fe")), "fe707add52c903537d763b02432991e9"))

print('Success!')

We can see above that some months show a small, but general increase in temperatures, whereas others don't. And some months show a change in variability and others do not. From this it is clear to us that if we are trying to understand temperature changes over time, we best keep data from different months separate.

## 3. Pollution in Madrid
We're working with a data set from Kaggle once again! [This data](https://www.kaggle.com/decide-soluciones/air-quality-madrid) was collected under the instructions from Madrid's City Council and is publicly available on their website. In recent years, high levels of pollution during certain dry periods has forced the authorities to take measures against the use of cars and act as a reasoning to propose certain regulations. This data includes daily and hourly measurements of air quality from 2001 to 2008. Pollutants are categorized based on their chemical properties.

There are a number of stations set up around Madrid and each station's data frame contains all particle measurements that such station has registered from 01/2001 - 04/2008. Not every station has the same equipment, therefore each station can measure only a certain subset of particles. The complete list of possible measurements and their explanations are given by the website:

- `SO_2`: sulphur dioxide level measured in μg/m³. High levels can produce irritation in the skin and membranes, and worsen asthma or heart diseases in sensitive groups.
- `CO`: carbon monoxide level measured in mg/m³. Carbon monoxide poisoning involves headaches, dizziness and confusion in short exposures and can result in loss of consciousness, arrhythmias, seizures or even death.
- `NO_2`: nitrogen dioxide level measured in μg/m³. Long-term exposure is a cause of chronic lung diseases, and are harmful for the vegetation.
- `PM10`: particles smaller than 10 μm. Even though they cannot penetrate the alveolus, they can still penetrate through the lungs and affect other organs. Long term exposure can result in lung cancer and cardiovascular complications.
- `NOx`: nitrous oxides level measured in μg/m³. Affect the human respiratory system worsening asthma or other diseases, and are responsible of the yellowish-brown color of photochemical smog.
- `O_3`: ozone level measured in μg/m³. High levels can produce asthma, bronchytis or other chronic pulmonary diseases in sensitive groups or outdoor workers.
- `TOL`: toluene (methylbenzene) level measured in μg/m³. Long-term exposure to this substance (present in tobacco smoke as well) can result in kidney complications or permanent brain damage.
- `BEN`: benzene level measured in μg/m³. Benzene is a eye and skin irritant, and long exposures may result in several types of cancer, leukaemia and anaemias. Benzene is considered a group 1 carcinogenic to humans.
- `EBE`: ethylbenzene level measured in μg/m³. Long term exposure can cause hearing or kidney problems and the IARC has concluded that long-term exposure can produce cancer.
- `MXY`: m-xylene level measured in μg/m³. Xylenes can affect not only air but also water and soil, and a long exposure to high levels of xylenes can result in diseases affecting the liver, kidney and nervous system.
- `PXY`: p-xylene level measured in μg/m³. See MXY for xylene exposure effects on health.
- `OXY`: o-xylene level measured in μg/m³. See MXY for xylene exposure effects on health.
- `TCH`: total hydrocarbons level measured in mg/m³. This group of substances can be responsible of different blood, immune system, liver, spleen, kidneys or lung diseases.
- `NMHC`: non-methane hydrocarbons (volatile organic compounds) level measured in mg/m³. Long exposure to some of these substances can result in damage to the liver, kidney, and central nervous system. Some of them are suspected to cause cancer in humans.

The goal of this assignment is to see if pollutants are decreasing (is air quality improving) and also compare which pollutant has decreased the most over the span of 5 years (2001 - 2006). 
1. First do a plot of one of the pollutants (EBE). 
2. Next, group it by month and year; calculate the maximum value and plot it (to see the trend through time). 
3. Now we will look at which pollutant decreased the most. Repeat the same thing for every column - to speed up the process, use the `map()` function. First we will look at pollution in 2001 (get the maximum value for each of the pollutants). And then do the same for 2006. 

**Question 3.1** Multiple Choice: 
<br> {points: 1}

What big picture question are we trying to answer?

A. Did EBE decrease in Madrid between 2001 and 2006?

B. Of all the pollutants, which decreased the most between 2001 and 2006? 

C. Of all the pollutants, which decreased the least between 2001 and 2006?

D. Did EBE increase in Madrid between 2001 and 2006?

*Assign your answer to an object called `answer3.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer3.1 is not character"= setequal(digest(paste(toString(class(answer3.1)), "3cd19")), "2ff844da441ebf8b9897cd989aca2e07"))
stopifnot("length of answer3.1 is not correct"= setequal(digest(paste(toString(length(answer3.1)), "3cd19")), "57a33d49e142fb3400bf7b748b5643ce"))
stopifnot("value of answer3.1 is not correct"= setequal(digest(paste(toString(tolower(answer3.1)), "3cd19")), "a1fec650e52e5bea41449e12ff8b7110"))
stopifnot("letters in string value of answer3.1 are correct but case is not correct"= setequal(digest(paste(toString(answer3.1), "3cd19")), "194af98ce3c384a9c4d6828ef732513d"))

print('Success!')

**Question 3.2** 
<br> {points: 1}

To begin working with this data, read the file `madrid_pollution.csv`. Note, this file (just like the avocado and sea surface data set) is found in the `worksheet_wrangling` directory. 

*Assign your answer to an object called `madrid`.* 

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
madrid

In [None]:
library(digest)
stopifnot("madrid should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(madrid)), "eecc")), "bf08e39a99b754134a55b7d66af3c2ca"))
stopifnot("dimensions of madrid are not correct"= setequal(digest(paste(toString(dim(madrid)), "eecc")), "0c7bee8b5d0a9c8750998d4bf868a3a9"))
stopifnot("column names of madrid are not correct"= setequal(digest(paste(toString(sort(colnames(madrid))), "eecc")), "fc1b1098aa1102f6727e40a5f36fc14f"))
stopifnot("types of columns in madrid are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(madrid, class)))), "eecc")), "7a5db15bbaf7667f5f7eca77b38fd00a"))
stopifnot("values in one or more numerical columns in madrid are not correct"= setequal(digest(paste(toString(if (any(sapply(madrid, is.numeric))) sort(round(sapply(madrid[, sapply(madrid, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "eecc")), "a59bf42b4fa32c5ac731e213629947a3"))
stopifnot("values in one or more character columns in madrid are not correct"= setequal(digest(paste(toString(if (any(sapply(madrid, is.character))) sum(sapply(madrid[sapply(madrid, is.character)], function(x) length(unique(x)))) else 0), "eecc")), "4adb5d52dc071bf291e3dbcfa0b84a2a"))
stopifnot("values in one or more factor columns in madrid are not correct"= setequal(digest(paste(toString(if (any(sapply(madrid, is.factor))) sum(sapply(madrid[, sapply(madrid, is.factor)], function(col) length(unique(col)))) else 0), "eecc")), "31783b1799a49661c4c8f65a961503c6"))

print('Success!')

**Question 3.3**
<br> {points: 1}

Now that the data is loaded in R, create a scatter plot that compares ethylbenzene (`EBE`) values against the date they were recorded. This graph will showcase the concentration of ethylbenzene in Madrid over time. As usual, label your axes: 

- x = Date
- y = Ethylbenzene (μg/m³)

*Assign your answer to an object called `EBE_pollution`.*

In [None]:
options(repr.plot.width = 13, repr.plot.height = 7)


# your code here
fail() # No Answer - remove if you provide an answer
EBE_pollution

# Are levels increasing or decreasing?

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(EBE_pollution$layers)), function(i) {c(class(EBE_pollution$layers[[i]]$geom))[1]})), "6095f")), "8b4ae05720eb3841de8b419cc6bd0cf7"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(EBE_pollution$layers)), function(i) {rlang::get_expr(c(EBE_pollution$layers[[i]]$mapping, EBE_pollution$mapping)$x)}), as.character))), "6095f")), "6296b4124e6cccb464b2dd0083599a51"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(EBE_pollution$layers)), function(i) {rlang::get_expr(c(EBE_pollution$layers[[i]]$mapping, EBE_pollution$mapping)$y)}), as.character))), "6095f")), "fe764e35d04862de7e4fceb8b14565fa"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(EBE_pollution$layers[[1]]$mapping, EBE_pollution$mapping)$x)!= EBE_pollution$labels$x), "6095f")), "c5e7ff1afd92d4fdbe4748e8d2efccb0"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(EBE_pollution$layers[[1]]$mapping, EBE_pollution$mapping)$y)!= EBE_pollution$labels$y), "6095f")), "c5e7ff1afd92d4fdbe4748e8d2efccb0"))
stopifnot("incorrect colour variable in EBE_pollution, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(EBE_pollution$layers[[1]]$mapping, EBE_pollution$mapping)$colour)), "6095f")), "8e7ff49e1c16b820db4e982fea313576"))
stopifnot("incorrect shape variable in EBE_pollution, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(EBE_pollution$layers[[1]]$mapping, EBE_pollution$mapping)$shape)), "6095f")), "8e7ff49e1c16b820db4e982fea313576"))
stopifnot("the colour label in EBE_pollution is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(EBE_pollution$layers[[1]]$mapping, EBE_pollution$mapping)$colour) != EBE_pollution$labels$colour), "6095f")), "8e7ff49e1c16b820db4e982fea313576"))
stopifnot("the shape label in EBE_pollution is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(EBE_pollution$layers[[1]]$mapping, EBE_pollution$mapping)$colour) != EBE_pollution$labels$shape), "6095f")), "8e7ff49e1c16b820db4e982fea313576"))
stopifnot("fill variable in EBE_pollution is not correct"= setequal(digest(paste(toString(quo_name(EBE_pollution$mapping$fill)), "6095f")), "ce3ba9cd313e1cc06270a4543dfae9db"))
stopifnot("fill label in EBE_pollution is not informative"= setequal(digest(paste(toString((quo_name(EBE_pollution$mapping$fill) != EBE_pollution$labels$fill)), "6095f")), "8e7ff49e1c16b820db4e982fea313576"))
stopifnot("position argument in EBE_pollution is not correct"= setequal(digest(paste(toString(class(EBE_pollution$layers[[1]]$position)[1]), "6095f")), "252908bf9450a38343056e614718ec88"))

stopifnot("EBE_pollution$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(EBE_pollution$data)), "60960")), "8f9a6f15fd3e2747c89241ff124c6a93"))
stopifnot("dimensions of EBE_pollution$data are not correct"= setequal(digest(paste(toString(dim(EBE_pollution$data)), "60960")), "5d46f643936f9d59eb14d9d8b5d1db4e"))
stopifnot("column names of EBE_pollution$data are not correct"= setequal(digest(paste(toString(sort(colnames(EBE_pollution$data))), "60960")), "950a9da33f3438be4e106e9f274ad294"))
stopifnot("types of columns in EBE_pollution$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(EBE_pollution$data, class)))), "60960")), "2a27928dedddc970677fbccfe06607d1"))
stopifnot("values in one or more numerical columns in EBE_pollution$data are not correct"= setequal(digest(paste(toString(if (any(sapply(EBE_pollution$data, is.numeric))) sort(round(sapply(EBE_pollution$data[, sapply(EBE_pollution$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "60960")), "2688160e9a2c8ce0c5839b855d8ac4d6"))
stopifnot("values in one or more character columns in EBE_pollution$data are not correct"= setequal(digest(paste(toString(if (any(sapply(EBE_pollution$data, is.character))) sum(sapply(EBE_pollution$data[sapply(EBE_pollution$data, is.character)], function(x) length(unique(x)))) else 0), "60960")), "780c706bab38abad2633c4f5612a68f1"))
stopifnot("values in one or more factor columns in EBE_pollution$data are not correct"= setequal(digest(paste(toString(if (any(sapply(EBE_pollution$data, is.factor))) sum(sapply(EBE_pollution$data[, sapply(EBE_pollution$data, is.factor)], function(col) length(unique(col)))) else 0), "60960")), "85b7ddcb28a8e8a3b8f83f1ef8a15981"))

print('Success!')

We can see from this plot that over time, there are less and less high (> 25 μg/m³) EBE values.

**Question 3.4**
<br> {points: 1}

The question above asks you to write out code that allows visualization of all EBE recordings - which are taken every single hour of every day. Consequently the graph consists of many points and appears densely plotted. In this question, we are going to clean up the graph and focus on max EBE readings from each month. To further investigate if this trend is changing over time, we will use `group_by` and `summarize` to create a new data set.

Fill in the `...` in the cell below. Copy and paste your finished answer and replace the `fail()`.

*Assign your answer to an object called `madrid_pollution`.*

In [None]:
# ... <- ... |>
#     group_by(year, ...) |>
#     ...(max_ebe = max(EBE, na.rm = TRUE))

# your code here
fail() # No Answer - remove if you provide an answer
madrid_pollution

In [None]:
library(digest)
stopifnot("madrid_pollution should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(madrid_pollution)), "f24e2")), "4a5c864ce4a239eea5b81bae86194e4a"))
stopifnot("dimensions of madrid_pollution are not correct"= setequal(digest(paste(toString(dim(madrid_pollution)), "f24e2")), "fc3887a7c21795cf11f4081d3a44016f"))
stopifnot("column names of madrid_pollution are not correct"= setequal(digest(paste(toString(sort(colnames(madrid_pollution))), "f24e2")), "8e53f3d20b826e41824f52eea67508b2"))
stopifnot("types of columns in madrid_pollution are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(madrid_pollution, class)))), "f24e2")), "e37fbd2a6aa5e816890dad26ccda3c62"))
stopifnot("values in one or more numerical columns in madrid_pollution are not correct"= setequal(digest(paste(toString(if (any(sapply(madrid_pollution, is.numeric))) sort(round(sapply(madrid_pollution[, sapply(madrid_pollution, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "f24e2")), "66b8fcd49fac29b1105a0f1179b34270"))
stopifnot("values in one or more character columns in madrid_pollution are not correct"= setequal(digest(paste(toString(if (any(sapply(madrid_pollution, is.character))) sum(sapply(madrid_pollution[sapply(madrid_pollution, is.character)], function(x) length(unique(x)))) else 0), "f24e2")), "9aabeace0ba955c2d749e92236cb2bc0"))
stopifnot("values in one or more factor columns in madrid_pollution are not correct"= setequal(digest(paste(toString(if (any(sapply(madrid_pollution, is.factor))) sum(sapply(madrid_pollution[, sapply(madrid_pollution, is.factor)], function(col) length(unique(col)))) else 0), "f24e2")), "b56b287ff2f1f37e3e43faf298acbfb1"))

print('Success!')

**Question 3.5**
<br> {points: 1}

Plot the new maximum EBE values versus the month they were recorded, split into side-by-side plots for each year. Again, we will use facetting (this time with `facet_grid`, more on this next week) to plot each year side-by-side. We will also use the `theme` function to rotate the axis labels to make them more readable (more on this is coming next week too!).

*Assign your answer to an object called `madrid_plot`. Remember to label your axes.*

In [None]:
#... <- ... |>
#    ggplot(aes(x = ..., y = ...)) + 
#    geom_point() +
#    xlab(...) + 
#    ylab(...) +
#    facet_grid(~ year) +
#    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
#    theme(text = element_text(size=20))

# your code here
fail() # No Answer - remove if you provide an answer
madrid_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(madrid_plot$layers)), function(i) {c(class(madrid_plot$layers[[i]]$geom))[1]})), "883f6")), "3ae3012ead12f98380254c057505ef32"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(madrid_plot$layers)), function(i) {rlang::get_expr(c(madrid_plot$layers[[i]]$mapping, madrid_plot$mapping)$x)}), as.character))), "883f6")), "2e708840038ba78bfb437451fa0d394f"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(madrid_plot$layers)), function(i) {rlang::get_expr(c(madrid_plot$layers[[i]]$mapping, madrid_plot$mapping)$y)}), as.character))), "883f6")), "f1e4131c08b96579f07a0696c76b1bd5"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(madrid_plot$layers[[1]]$mapping, madrid_plot$mapping)$x)!= madrid_plot$labels$x), "883f6")), "c96b1aa9b1536588ffa2053e21262382"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(madrid_plot$layers[[1]]$mapping, madrid_plot$mapping)$y)!= madrid_plot$labels$y), "883f6")), "c96b1aa9b1536588ffa2053e21262382"))
stopifnot("incorrect colour variable in madrid_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(madrid_plot$layers[[1]]$mapping, madrid_plot$mapping)$colour)), "883f6")), "82f640173a3294dbcda0387fdd360362"))
stopifnot("incorrect shape variable in madrid_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(madrid_plot$layers[[1]]$mapping, madrid_plot$mapping)$shape)), "883f6")), "82f640173a3294dbcda0387fdd360362"))
stopifnot("the colour label in madrid_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(madrid_plot$layers[[1]]$mapping, madrid_plot$mapping)$colour) != madrid_plot$labels$colour), "883f6")), "82f640173a3294dbcda0387fdd360362"))
stopifnot("the shape label in madrid_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(madrid_plot$layers[[1]]$mapping, madrid_plot$mapping)$colour) != madrid_plot$labels$shape), "883f6")), "82f640173a3294dbcda0387fdd360362"))
stopifnot("fill variable in madrid_plot is not correct"= setequal(digest(paste(toString(quo_name(madrid_plot$mapping$fill)), "883f6")), "e9c6c2b25e01cfb97ac94adf431a8fc6"))
stopifnot("fill label in madrid_plot is not informative"= setequal(digest(paste(toString((quo_name(madrid_plot$mapping$fill) != madrid_plot$labels$fill)), "883f6")), "82f640173a3294dbcda0387fdd360362"))
stopifnot("position argument in madrid_plot is not correct"= setequal(digest(paste(toString(class(madrid_plot$layers[[1]]$position)[1]), "883f6")), "eae6f6bec45519f4bb52190ab4d2f98f"))

stopifnot("select(madrid_plot$data, -'.group') should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(select(madrid_plot$data, -'.group'))), "883f7")), "e29563e376dcc1ba06b73c161608a33c"))
stopifnot("dimensions of select(madrid_plot$data, -'.group') are not correct"= setequal(digest(paste(toString(dim(select(madrid_plot$data, -'.group'))), "883f7")), "801ed7bb0d8497645b39eb0aca6784e3"))
stopifnot("column names of select(madrid_plot$data, -'.group') are not correct"= setequal(digest(paste(toString(sort(colnames(select(madrid_plot$data, -'.group')))), "883f7")), "39c24e46e2bba13bc3037a13ef1d628d"))
stopifnot("types of columns in select(madrid_plot$data, -'.group') are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(select(madrid_plot$data, -'.group'), class)))), "883f7")), "18fec193051357eaed2f70807bf110ed"))
stopifnot("values in one or more numerical columns in select(madrid_plot$data, -'.group') are not correct"= setequal(digest(paste(toString(if (any(sapply(select(madrid_plot$data, -'.group'), is.numeric))) sort(round(sapply(select(madrid_plot$data, -'.group')[, sapply(select(madrid_plot$data, -'.group'), is.numeric)], sum, na.rm = TRUE), 2)) else 0), "883f7")), "ed0b8ba1222886b0a6fa53c15ccfec10"))
stopifnot("values in one or more character columns in select(madrid_plot$data, -'.group') are not correct"= setequal(digest(paste(toString(if (any(sapply(select(madrid_plot$data, -'.group'), is.character))) sum(sapply(select(madrid_plot$data, -'.group')[sapply(select(madrid_plot$data, -'.group'), is.character)], function(x) length(unique(x)))) else 0), "883f7")), "351408f3b543eecbc277e2e6935d8168"))
stopifnot("values in one or more factor columns in select(madrid_plot$data, -'.group') are not correct"= setequal(digest(paste(toString(if (any(sapply(select(madrid_plot$data, -'.group'), is.factor))) sum(sapply(select(madrid_plot$data, -'.group')[, sapply(select(madrid_plot$data, -'.group'), is.factor)], function(col) length(unique(col)))) else 0), "883f7")), "19bdb63dd185b9bb1afaf34ea34530f6"))

stopifnot("type of class(madrid_plot$facet)[1] is not character"= setequal(digest(paste(toString(class(class(madrid_plot$facet)[1])), "883f8")), "9d8ac70e3c9c2112ba6819a87abda22f"))
stopifnot("length of class(madrid_plot$facet)[1] is not correct"= setequal(digest(paste(toString(length(class(madrid_plot$facet)[1])), "883f8")), "051f41d78225b1c1df8a92a7dfc5de5d"))
stopifnot("value of class(madrid_plot$facet)[1] is not correct"= setequal(digest(paste(toString(tolower(class(madrid_plot$facet)[1])), "883f8")), "20bf32141bf59a29c5d9901d3b7c3999"))
stopifnot("letters in string value of class(madrid_plot$facet)[1] are correct but case is not correct"= setequal(digest(paste(toString(class(madrid_plot$facet)[1]), "883f8")), "7354e2b1cf8a20284432a75fe5ab470f"))

stopifnot("type of as.character(rlang::get_expr(madrid_plot$facet$params$cols)) is not character"= setequal(digest(paste(toString(class(as.character(rlang::get_expr(madrid_plot$facet$params$cols)))), "883f9")), "c725733c57276438b5ef6996c0fd404d"))
stopifnot("length of as.character(rlang::get_expr(madrid_plot$facet$params$cols)) is not correct"= setequal(digest(paste(toString(length(as.character(rlang::get_expr(madrid_plot$facet$params$cols)))), "883f9")), "82cc2910f71608743b1cf81017b8b896"))
stopifnot("value of as.character(rlang::get_expr(madrid_plot$facet$params$cols)) is not correct"= setequal(digest(paste(toString(tolower(as.character(rlang::get_expr(madrid_plot$facet$params$cols)))), "883f9")), "dd866e9972c2fe99568a6d0e08f694b9"))
stopifnot("letters in string value of as.character(rlang::get_expr(madrid_plot$facet$params$cols)) are correct but case is not correct"= setequal(digest(paste(toString(as.character(rlang::get_expr(madrid_plot$facet$params$cols))), "883f9")), "dd866e9972c2fe99568a6d0e08f694b9"))

print('Success!')

**Question 3.6**
<br> {points: 1}

Now we want to see which of the pollutants has decreased the most. Therefore, we must repeat the same thing that we did in the questions above but for every pollutant (using the original data set)! This is where `purrr`'s `map*` functions can be really helpful! 

First we will look at Madrid pollution in 2001 (filter for this year). Next we have to select the columns that should be excluded (such as the date). Lastly, use the `map_dfr()` function to create max values for all columns.

Fill in the `...` in the cell below. Copy and paste your finished answer and replace the `fail()`.

*Assign your answer to an object called `pollution_2001`.*

In [None]:
# ... <- madrid |>
#     ...(year == 2001) |>
#     select(-..., -year, -mnth) |>
#     map_dfr(..., na.rm  = TRUE)

# your code here
fail() # No Answer - remove if you provide an answer
pollution_2001

In [None]:
library(digest)
stopifnot("pollution_2001 should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(pollution_2001)), "a0d65")), "8c7b3ad4465120adad0d07d26a67bb33"))
stopifnot("dimensions of pollution_2001 are not correct"= setequal(digest(paste(toString(dim(pollution_2001)), "a0d65")), "b9de569cf68ea12661dd9a33b2452e39"))
stopifnot("column names of pollution_2001 are not correct"= setequal(digest(paste(toString(sort(colnames(pollution_2001))), "a0d65")), "48a287b116f8f4862733de30a6c3224a"))
stopifnot("types of columns in pollution_2001 are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(pollution_2001, class)))), "a0d65")), "421ce48ea7eb9054213ae6e4950cb271"))
stopifnot("values in one or more numerical columns in pollution_2001 are not correct"= setequal(digest(paste(toString(if (any(sapply(pollution_2001, is.numeric))) sort(round(sapply(pollution_2001[, sapply(pollution_2001, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "a0d65")), "4b6c356965014142832d66d5d4ffa65a"))
stopifnot("values in one or more character columns in pollution_2001 are not correct"= setequal(digest(paste(toString(if (any(sapply(pollution_2001, is.character))) sum(sapply(pollution_2001[sapply(pollution_2001, is.character)], function(x) length(unique(x)))) else 0), "a0d65")), "8c378565ebf1c23f3b260e7c5479f54e"))
stopifnot("values in one or more factor columns in pollution_2001 are not correct"= setequal(digest(paste(toString(if (any(sapply(pollution_2001, is.factor))) sum(sapply(pollution_2001[, sapply(pollution_2001, is.factor)], function(col) length(unique(col)))) else 0), "a0d65")), "8c378565ebf1c23f3b260e7c5479f54e"))

print('Success!')

**Question 3.7**
<br> {points: 1}

Now repeat what you did for Question 3.6, but filter for 2006 instead. 

*Assign your answer to an object called `pollution_2006`.*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
pollution_2006

In [None]:
library(digest)
stopifnot("pollution_2006 should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(pollution_2006)), "3abe1")), "1d213dba682a42049363da8c1af19229"))
stopifnot("dimensions of pollution_2006 are not correct"= setequal(digest(paste(toString(dim(pollution_2006)), "3abe1")), "c947ab3873bbd0627b6d6f3ecf6ad79e"))
stopifnot("column names of pollution_2006 are not correct"= setequal(digest(paste(toString(sort(colnames(pollution_2006))), "3abe1")), "d20731c55ba38a4e36f9ab02e36e6a6f"))
stopifnot("types of columns in pollution_2006 are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(pollution_2006, class)))), "3abe1")), "ad745598e60b63ad4ed675e4003c888a"))
stopifnot("values in one or more numerical columns in pollution_2006 are not correct"= setequal(digest(paste(toString(if (any(sapply(pollution_2006, is.numeric))) sort(round(sapply(pollution_2006[, sapply(pollution_2006, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "3abe1")), "66723a0f10f1eb54e170612927aabb36"))
stopifnot("values in one or more character columns in pollution_2006 are not correct"= setequal(digest(paste(toString(if (any(sapply(pollution_2006, is.character))) sum(sapply(pollution_2006[sapply(pollution_2006, is.character)], function(x) length(unique(x)))) else 0), "3abe1")), "91e455c0932bf066654afc6da4165407"))
stopifnot("values in one or more factor columns in pollution_2006 are not correct"= setequal(digest(paste(toString(if (any(sapply(pollution_2006, is.factor))) sum(sapply(pollution_2006[, sapply(pollution_2006, is.factor)], function(col) length(unique(col)))) else 0), "3abe1")), "91e455c0932bf066654afc6da4165407"))

print('Success!')

**Question 3.8** 
<br> {points: 1}

Which pollutant decreased by the greatest magnitude between 2001 and 2006? Given that your the two objects you just created, `pollution_2001` and `pollution_2006` are data frames with the same columns you should be able to subtract the two objects to find which pollutant decreased by the greatest magnitude between the two years. 

*Assign your answer to an object called `answer3.8`. Make sure to write the answer exactly as it is given in the data set.* Example: 

```
answer3.8 <- "BEN"
```

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer3.8 is not character"= setequal(digest(paste(toString(class(answer3.8)), "8a001")), "1025546ae79122d0e55189a92e4a2c6c"))
stopifnot("length of answer3.8 is not correct"= setequal(digest(paste(toString(length(answer3.8)), "8a001")), "0e4754d8a2e7db5e5493656589b62881"))
stopifnot("value of answer3.8 is not correct"= setequal(digest(paste(toString(tolower(answer3.8)), "8a001")), "2466181f05cbe2c2ac3f22b7f894f281"))
stopifnot("letters in string value of answer3.8 are correct but case is not correct"= setequal(digest(paste(toString(answer3.8), "8a001")), "226da6915272752fbc3124f7594fb30e"))

print('Success!')

**Question 3.9**
<br> {points: 1}

Given that there were only 14 columns in the data frame above, you could use your eyes to pick out which pollutant decreased by the greatest magnitude between 2001 and 2006. But what would you do if you had 100 columns? Or 1000 columns? It would take A LONG TIME for your human eyeballs to find the biggest difference. Maybe you could use the min funcion:

In [None]:
# run this cell
pollution_2006 - pollution_2001
min(pollution_2006 - pollution_2001)

This is a step in the right direction, but you get the value and not the column name... What are we to do? Tidy our data! Our data is not in tidy format, and so it's difficult to access the values for the variable pollutant because they are stuck as column headers. Let's use `pivot_longer` to tidy our data and make it look like this:

| pollutant | value  |
|-----------|--------|
| BEN       | -33.04 |
| CO        | -6.91  |
| ...       | ...    |

To answer this question, fill in the `...` in the cell below. Copy and paste your finished answer and replace the `fail()`.

*Assign your answer to an object called `pollution_diff` and ensure it has the same column names as the table pictured above.*

In [None]:
pollution_diff  <- pollution_2006 - pollution_2001
#pollution_diff  <- ... |> 
#    pivot_longer(cols = everything(), 
#           names_to = ..., 
#           values_to = ...)

# your code here
fail() # No Answer - remove if you provide an answer
pollution_diff

In [None]:
library(digest)
stopifnot("pollution_diff should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(pollution_diff)), "4f361")), "f7e7cf86ee230b15c82e3f052f9dccde"))
stopifnot("dimensions of pollution_diff are not correct"= setequal(digest(paste(toString(dim(pollution_diff)), "4f361")), "cea49e45b9f80d6bf2814dcecd839305"))
stopifnot("column names of pollution_diff are not correct"= setequal(digest(paste(toString(sort(colnames(pollution_diff))), "4f361")), "34e89f3f61b10fd01cc9c123ef8b6eb8"))
stopifnot("types of columns in pollution_diff are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(pollution_diff, class)))), "4f361")), "4f8d4bac04995fb7563082439479e471"))
stopifnot("values in one or more numerical columns in pollution_diff are not correct"= setequal(digest(paste(toString(if (any(sapply(pollution_diff, is.numeric))) sort(round(sapply(pollution_diff[, sapply(pollution_diff, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "4f361")), "4743aa9ca278a519e43e1e0ac2013522"))
stopifnot("values in one or more character columns in pollution_diff are not correct"= setequal(digest(paste(toString(if (any(sapply(pollution_diff, is.character))) sum(sapply(pollution_diff[sapply(pollution_diff, is.character)], function(x) length(unique(x)))) else 0), "4f361")), "3a8fc11a87eadc4078e8ff6de08ed705"))
stopifnot("values in one or more factor columns in pollution_diff are not correct"= setequal(digest(paste(toString(if (any(sapply(pollution_diff, is.factor))) sum(sapply(pollution_diff[, sapply(pollution_diff, is.factor)], function(col) length(unique(col)))) else 0), "4f361")), "bfde140a5b1d41b4d0e746ad957ca5a1"))

print('Success!')

**Question 3.10**
<br> {points: 1}

Now that you have tidy data, you can use `arrange` and `desc` to order the data in descending order. Each element of the `value` column corresponds to an amount of decrease in a pollutant; so the *largest decrease* in pollutant should be *most negative entry*, i.e., the last row in the resulting dataframe. Therefore, we can take the sorted dataframe and pipe it to `tail` (with the argument `n = 1`) to return only the last row of the data frame.

(the function `tail` is just like `head`, except it returns the last rows of the dataframe instead of the first rows.)

To answer this question, fill in the `...` in the cell below. Copy and paste your finished answer and replace the `fail()`.

*Assign your answer to an object called `max_pollution_diff`.*

In [None]:
#... <- ... |> arrange(desc(...)) |> 
#    tail(n = 1)

# your code here
fail() # No Answer - remove if you provide an answer
max_pollution_diff

In [None]:
library(digest)
stopifnot("max_pollution_diff should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(max_pollution_diff)), "887ff")), "d7e0c002c3ea5a40bbe0f12b3bf20efb"))
stopifnot("dimensions of max_pollution_diff are not correct"= setequal(digest(paste(toString(dim(max_pollution_diff)), "887ff")), "478b8816c29d0f150ef651ba358194f8"))
stopifnot("column names of max_pollution_diff are not correct"= setequal(digest(paste(toString(sort(colnames(max_pollution_diff))), "887ff")), "f80d140400972faec46b52d48417a78f"))
stopifnot("types of columns in max_pollution_diff are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(max_pollution_diff, class)))), "887ff")), "6b15ac49ead826c24915e06b0a617b4b"))
stopifnot("values in one or more numerical columns in max_pollution_diff are not correct"= setequal(digest(paste(toString(if (any(sapply(max_pollution_diff, is.numeric))) sort(round(sapply(max_pollution_diff[, sapply(max_pollution_diff, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "887ff")), "22ac2b16aa3fc92d3ad3ce547625a59f"))
stopifnot("values in one or more character columns in max_pollution_diff are not correct"= setequal(digest(paste(toString(if (any(sapply(max_pollution_diff, is.character))) sum(sapply(max_pollution_diff[sapply(max_pollution_diff, is.character)], function(x) length(unique(x)))) else 0), "887ff")), "8f0aeb96e0a15395009c014a1dbf4ffc"))
stopifnot("values in one or more factor columns in max_pollution_diff are not correct"= setequal(digest(paste(toString(if (any(sapply(max_pollution_diff, is.factor))) sum(sapply(max_pollution_diff[, sapply(max_pollution_diff, is.factor)], function(col) length(unique(col)))) else 0), "887ff")), "189259674f42b618febcba94793d82d4"))

print('Success!')

At the end of this data wrangling worksheet, we'll leave you with a couple quotes to ponder:

> “Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy

> “Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

*source: [Tidy data chapter](https://r4ds.had.co.nz/tidy-data.html) from R for Data Science by Garrett Grolemund & Hadley Wickham*

In [None]:
source("cleanup.R")