# Tutorial: Reading in data locally and from the web

This worksheet covers the [Reading in data locally and from the web](https://datasciencebook.ca/reading.html) chapter of the online textbook, which also lists the learning objectives for this worksheet. You should read the textbook chapter before attempting this worksheet. 

In [None]:
### Run this cell before continuing. 
library(tidyverse)
library(repr)
library(rvest)
library(stringr)
library(janitor)
options(repr.matrix.max.rows = 6)
source("cleanup.R")

## 1. Happiness Report
As you might remember from `worksheet_reading`, we practised loading data from the *Sustainable Development Solutions Network's* [World Happiness Report](http://worldhappiness.report/). That data was the output of their analysis that calculated each country's happiness score and how much each variable contributed to it. In this tutorial, we are going to look at the data at an earlier stage of the study - the aggregated/averaged values (per country and year) for many different social and health aspects that the researchers anticipated might contribute to happiness (Table2.1 from [this Excel spreadsheet](https://github.com/UBC-DSCI/dsci-100-student/raw/refs/heads/master/data/reading/WHR2018Chapter2OnlineData.xls)).

The goal for today is to produce a plot of 2017's positive affect scores against healthy life expectancy at birth, with healthy life expectancy at birth on the x-axis and positive affect on the y-axis. For this study, positive affect was defined as the average of three positive affect measures: happiness, laughter and enjoyment. We would also like to convert the **positive affect score** from a scale of 0 - 1 to a scale from 0 - 10.

1. use `filter` to subset the rows where the year is equal to 2017
2. use `mutate` to convert the "Positive affect" score from a scale of 0 - 1 to a scale from 0 - 10
3. use `select` to choose the "Healthy life expectancy at birth" column and the scaled "Positive affect" column
4. use `ggplot` to create our plot of "Healthy life expectancy at birth" (x - axis) and scaled "Positive affect" (y - axis)

**Tips for success:** Try going through all of the steps on your own, but don't forget to discuss with others (classmates, TAs, or an instructor) if you get stuck. If something is wrong and you can't spot the issue, be sure to **read the error message carefully**. Since there are a lot of steps involved in working with data and modifying it, feel free to look back at `worksheet_reading`. 

**Question 1.1** Multiple Choice: 
<br> {points: 1}

What is the maximum value for the "Positive affect" score (in the original data file that you read into R)?

A. 100

B. 10 

C. 1

D. 0.1

E. 5

*Assign your answer to an object called `answer1.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.1 is not character"= setequal(digest(paste(toString(class(answer1.1)), "bc78e")), "891668d9c7d4d926493054d7b3445a62"))
stopifnot("length of answer1.1 is not correct"= setequal(digest(paste(toString(length(answer1.1)), "bc78e")), "bbac0636d3444d471d14e9bf24b14051"))
stopifnot("value of answer1.1 is not correct"= setequal(digest(paste(toString(tolower(answer1.1)), "bc78e")), "a735b3698bdda400f50ff6c16aec3481"))
stopifnot("letters in string value of answer1.1 are correct but case is not correct"= setequal(digest(paste(toString(answer1.1), "bc78e")), "264b63de5b0d7d8979e289cd4e380132"))

print('Success!')

**Question 1.2** Multiple Choice: 
<br> {points: 1}

Which column's values will be used to filter the data?

A. `countries`

B. `generosity`

C. `positive affect`

D. `year`

*Assign your answer to an object called `answer1.2`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer1.2 is not character"= setequal(digest(paste(toString(class(answer1.2)), "b78f7")), "dec8b3242fa51e80474413b89d0ca755"))
stopifnot("length of answer1.2 is not correct"= setequal(digest(paste(toString(length(answer1.2)), "b78f7")), "efd31e4e68112ddcd36a3cdca73cf4e9"))
stopifnot("value of answer1.2 is not correct"= setequal(digest(paste(toString(tolower(answer1.2)), "b78f7")), "bda630e73df807885af7f7dd53c97dd1"))
stopifnot("letters in string value of answer1.2 are correct but case is not correct"= setequal(digest(paste(toString(answer1.2), "b78f7")), "f51ec949c28e390ad9433cc59ad3f668"))

print('Success!')

**Question 1.3.0**
<br> {points: 1}

Use the appropriate `read_*` function to read in the `WHR2018Chapter2OnlineData` (look in the `tutorial_02` directory to ensure you use the correct relative path to read it in).

_Assign the data frame to an object called `happy_df_csv`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
happy_df_csv

In [None]:
library(digest)
stopifnot("happy_df_csv should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(happy_df_csv)), "76797")), "ed5c953a313da8cbce4439ed3daef511"))
stopifnot("dimensions of happy_df_csv are not correct"= setequal(digest(paste(toString(dim(happy_df_csv)), "76797")), "7a611a183420891c89472f601d847973"))
stopifnot("column names of happy_df_csv are not correct"= setequal(digest(paste(toString(sort(colnames(happy_df_csv))), "76797")), "fa7cfdb21791ab4e35c7712cbf777f3d"))
stopifnot("types of columns in happy_df_csv are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(happy_df_csv, class)))), "76797")), "9118026d2717d6c34f7e82878faa8408"))
stopifnot("values in one or more numerical columns in happy_df_csv are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_df_csv, is.numeric))) sort(round(sapply(happy_df_csv[, sapply(happy_df_csv, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "76797")), "de1c4d136c16adf2ff0078b3692fe121"))
stopifnot("values in one or more character columns in happy_df_csv are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_df_csv, is.character))) sum(sapply(happy_df_csv[sapply(happy_df_csv, is.character)], function(x) length(unique(x)))) else 0), "76797")), "cfcf03c543a518bfeb59e529a954656f"))
stopifnot("values in one or more factor columns in happy_df_csv are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_df_csv, is.factor))) sum(sapply(happy_df_csv[, sapply(happy_df_csv, is.factor)], function(col) length(unique(col)))) else 0), "76797")), "039feb6f70e8b8c35cf635b51f2a73ec"))

print('Success!')

**Question 1.3.1**
<br> {points: 1}

Above, you loaded the data from a file we already downloaded and converted to a `.csv` for you. But you can also use the `readxl` R package to directly load in Excel files into R. Given that the data we loaded above (`WHR2018Chapter2OnlineData.csv`) was originally sourced from an Excel file on the web, let's now directly read that Excel file into R using the `read_excel` function from that package. This Excel file has multiple sheets, the data we want is on the first one.

> **Note:**
> `read_excel` does not support putting a URL as the file path argument. So we need to first download the file and write it to disk using R's `download.file` function, and then we can read that saved Excel file into R using `read_excel`.

To answer the question, fill in the blanks in the code below. If you are unsure, try reading the documentation for the new functions and ask others for help!

Assign the data into an object called `happy_df`.

In [None]:
library(readxl)
url <- "https://github.com/UBC-DSCI/dsci-100-student/raw/refs/heads/master/data/reading/WHR2018Chapter2OnlineData.xls"

# download.file(..., destfile = "data/WHR2018Chapter2OnlineData.xls")
#... <- read_excel(path = ..., sheet = ...)

# your code here
fail() # No Answer - remove if you provide an answer
happy_df

In [None]:
library(digest)
stopifnot("happy_df should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(happy_df)), "8c7ca")), "efcbf6038250c0a502025b560858f822"))
stopifnot("dimensions of happy_df are not correct"= setequal(digest(paste(toString(dim(happy_df)), "8c7ca")), "e63cc41e197c2e436f72e0e90aebfcd1"))
stopifnot("column names of happy_df are not correct"= setequal(digest(paste(toString(sort(colnames(happy_df))), "8c7ca")), "749999d0e6e89524885201d408aeb876"))
stopifnot("types of columns in happy_df are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(happy_df, class)))), "8c7ca")), "e3b0bdf29e034d617db8f44d37b4e825"))
stopifnot("values in one or more numerical columns in happy_df are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_df, is.numeric))) sort(round(sapply(happy_df[, sapply(happy_df, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "8c7ca")), "cbdbed152c9404f9383e769c2bd6c9f5"))
stopifnot("values in one or more character columns in happy_df are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_df, is.character))) sum(sapply(happy_df[sapply(happy_df, is.character)], function(x) length(unique(x)))) else 0), "8c7ca")), "35f8b9ec22d344569f5252ff731e21b2"))
stopifnot("values in one or more factor columns in happy_df are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_df, is.factor))) sum(sapply(happy_df[, sapply(happy_df, is.factor)], function(col) length(unique(col)))) else 0), "8c7ca")), "d3ebefa81745a5f75f39a9a4cde3fc93"))

print('Success!')

Look at the column names - they contain spaces! This is not a best practice and will make it difficult to use our tidyverse functions... Run the cell below to use the `clean_names` function from the `janitor` library that will replace all the spaces with an underscore (`_`) and make all characters lowercase so that our column names are in a standard format.

In [None]:
### Run this cell before continuing. 
happy_df <- happy_df |> clean_names()
happy_df

**Question 1.3.2**
<br> {points: 1}

Using the scaffolding given in the cell below, `filter`, `mutate`, and `select` the `happy_df` data frame as needed to get it ready to create our desired scatterplot. Recall that we wanted to rescale the "Positive affect" scores so that they fall in the range 0-10 instead of 0-1. Call the new, re-scaled column `positive_affect_scaled`.

_Assign the data frame containing only the columns we need to create our plot to an object called `reduced_happy_df`._

In [None]:
# happy_step1 <- ...(happy_df, year == ...)
# happy_step2 <- mutate(happy_step1, positive_affect_scaled = ...)
# reduced_happy_df <- ...(happy_step2, ..., ...)

# your code here
fail() # No Answer - remove if you provide an answer
reduced_happy_df

In [None]:
library(digest)
stopifnot("happy_step1 should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(happy_step1)), "3ab7")), "9bfc3ff835a96608cdcee1140007218e"))
stopifnot("dimensions of happy_step1 are not correct"= setequal(digest(paste(toString(dim(happy_step1)), "3ab7")), "1138494d893299dc8d4a9a77f0afbb65"))
stopifnot("column names of happy_step1 are not correct"= setequal(digest(paste(toString(sort(colnames(happy_step1))), "3ab7")), "3d404a7b53ad25a345935326825722a6"))
stopifnot("types of columns in happy_step1 are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(happy_step1, class)))), "3ab7")), "faddb1819581167cd29f555aee3b27e9"))
stopifnot("values in one or more numerical columns in happy_step1 are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_step1, is.numeric))) sort(round(sapply(happy_step1[, sapply(happy_step1, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "3ab7")), "30ffc4b0a2fa2dd230ab9f687d1b5726"))
stopifnot("values in one or more character columns in happy_step1 are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_step1, is.character))) sum(sapply(happy_step1[sapply(happy_step1, is.character)], function(x) length(unique(x)))) else 0), "3ab7")), "9e9fbc8e8d4ff9371422416ae237d4d8"))
stopifnot("values in one or more factor columns in happy_step1 are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_step1, is.factor))) sum(sapply(happy_step1[, sapply(happy_step1, is.factor)], function(col) length(unique(col)))) else 0), "3ab7")), "18cdf60592473186078a853802f6cae8"))

stopifnot("happy_step2 should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(happy_step2)), "3ab8")), "88ce917dc80be049368114b5654fa5b4"))
stopifnot("dimensions of happy_step2 are not correct"= setequal(digest(paste(toString(dim(happy_step2)), "3ab8")), "07fd07633b0233c66df7bc027f8e6f17"))
stopifnot("column names of happy_step2 are not correct"= setequal(digest(paste(toString(sort(colnames(happy_step2))), "3ab8")), "c8df45bd1c3fd47cf0b0aebc49e68878"))
stopifnot("types of columns in happy_step2 are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(happy_step2, class)))), "3ab8")), "327f63a8bcfa9be076bd270857e6143b"))
stopifnot("values in one or more numerical columns in happy_step2 are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_step2, is.numeric))) sort(round(sapply(happy_step2[, sapply(happy_step2, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "3ab8")), "ae26838dcba591554bf078ebc5ba7a4f"))
stopifnot("values in one or more character columns in happy_step2 are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_step2, is.character))) sum(sapply(happy_step2[sapply(happy_step2, is.character)], function(x) length(unique(x)))) else 0), "3ab8")), "cd9cfc3c4dcee516e6e6c0c0776558ef"))
stopifnot("values in one or more factor columns in happy_step2 are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_step2, is.factor))) sum(sapply(happy_step2[, sapply(happy_step2, is.factor)], function(col) length(unique(col)))) else 0), "3ab8")), "5b3853cd705289bdfbf08ca0f943663f"))

stopifnot("reduced_happy_df should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(reduced_happy_df)), "3ab9")), "eef931df5ef2ffd82b235c14ab582e1d"))
stopifnot("dimensions of reduced_happy_df are not correct"= setequal(digest(paste(toString(dim(reduced_happy_df)), "3ab9")), "fb6872b1cacb4d1cdc6f07c1f6da3589"))
stopifnot("column names of reduced_happy_df are not correct"= setequal(digest(paste(toString(sort(colnames(reduced_happy_df))), "3ab9")), "ae41ab4a60f3b34daffbde41baa336bf"))
stopifnot("types of columns in reduced_happy_df are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(reduced_happy_df, class)))), "3ab9")), "8ea41a0208a8aea09f754ce6f6f507ff"))
stopifnot("values in one or more numerical columns in reduced_happy_df are not correct"= setequal(digest(paste(toString(if (any(sapply(reduced_happy_df, is.numeric))) sort(round(sapply(reduced_happy_df[, sapply(reduced_happy_df, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "3ab9")), "c185c46be0fd3745d2d040b1630464af"))
stopifnot("values in one or more character columns in reduced_happy_df are not correct"= setequal(digest(paste(toString(if (any(sapply(reduced_happy_df, is.character))) sum(sapply(reduced_happy_df[sapply(reduced_happy_df, is.character)], function(x) length(unique(x)))) else 0), "3ab9")), "16ad06c7acb0faaa0180f19f7b662bff"))
stopifnot("values in one or more factor columns in reduced_happy_df are not correct"= setequal(digest(paste(toString(if (any(sapply(reduced_happy_df, is.factor))) sum(sapply(reduced_happy_df[, sapply(reduced_happy_df, is.factor)], function(col) length(unique(col)))) else 0), "3ab9")), "16ad06c7acb0faaa0180f19f7b662bff"))

print('Success!')

**Question 1.4** 
<br> {points: 1}

Using the modified data set, `reduced_happy_df`, generate the scatterplot described above and make sure to label the axes in proper written English.

_Assign your plot to an object called `happy_plot`._

In [None]:
options(repr.plot.width = 8, repr.plot.height = 8)

#... <- ggplot(reduced_happy_df, ...(x = ..., y = ...)) + 
#     geom_...() + 
#     ...("...") + 
#     ylab("Positive affect score (out of ...)")

# your code here
fail() # No Answer - remove if you provide an answer
happy_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(happy_plot$layers)), function(i) {c(class(happy_plot$layers[[i]]$geom))[1]})), "2fc0d")), "e6cb6a79ddf6bd786b05d09a92b5e8d0"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(happy_plot$layers)), function(i) {rlang::get_expr(c(happy_plot$layers[[i]]$mapping, happy_plot$mapping)$x)}), as.character))), "2fc0d")), "fa41a9c993c9f6a0ab4cbfe131dba2d9"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(happy_plot$layers)), function(i) {rlang::get_expr(c(happy_plot$layers[[i]]$mapping, happy_plot$mapping)$y)}), as.character))), "2fc0d")), "34daefbe7a396c06e0441ff87f02b425"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(happy_plot$layers[[1]]$mapping, happy_plot$mapping)$x)!= happy_plot$labels$x), "2fc0d")), "63f1b425f1204ba11d70c22a89f5c63e"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(happy_plot$layers[[1]]$mapping, happy_plot$mapping)$y)!= happy_plot$labels$y), "2fc0d")), "63f1b425f1204ba11d70c22a89f5c63e"))
stopifnot("incorrect colour variable in happy_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(happy_plot$layers[[1]]$mapping, happy_plot$mapping)$colour)), "2fc0d")), "5f9622864eb0a71ccc55abbfa3596931"))
stopifnot("incorrect shape variable in happy_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(happy_plot$layers[[1]]$mapping, happy_plot$mapping)$shape)), "2fc0d")), "5f9622864eb0a71ccc55abbfa3596931"))
stopifnot("the colour label in happy_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(happy_plot$layers[[1]]$mapping, happy_plot$mapping)$colour) != happy_plot$labels$colour), "2fc0d")), "5f9622864eb0a71ccc55abbfa3596931"))
stopifnot("the shape label in happy_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(happy_plot$layers[[1]]$mapping, happy_plot$mapping)$colour) != happy_plot$labels$shape), "2fc0d")), "5f9622864eb0a71ccc55abbfa3596931"))
stopifnot("fill variable in happy_plot is not correct"= setequal(digest(paste(toString(quo_name(happy_plot$mapping$fill)), "2fc0d")), "2bb6e7e409639fa12e8e79d85b080a26"))
stopifnot("fill label in happy_plot is not informative"= setequal(digest(paste(toString((quo_name(happy_plot$mapping$fill) != happy_plot$labels$fill)), "2fc0d")), "5f9622864eb0a71ccc55abbfa3596931"))
stopifnot("position argument in happy_plot is not correct"= setequal(digest(paste(toString(class(happy_plot$layers[[1]]$position)[1]), "2fc0d")), "3d6ad4460bfe1ce169a474b0d2d1c01b"))

stopifnot("happy_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(happy_plot$data)), "2fc0e")), "b0d38254d90177ac532512b146e5bb6f"))
stopifnot("dimensions of happy_plot$data are not correct"= setequal(digest(paste(toString(dim(happy_plot$data)), "2fc0e")), "00854ca523b431a81f7512d96e9c3409"))
stopifnot("column names of happy_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(happy_plot$data))), "2fc0e")), "52d06470d8482283f3760ec39d516d06"))
stopifnot("types of columns in happy_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(happy_plot$data, class)))), "2fc0e")), "07b558ea3b42f1090f8f7d75611b2efe"))
stopifnot("values in one or more numerical columns in happy_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_plot$data, is.numeric))) sort(round(sapply(happy_plot$data[, sapply(happy_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "2fc0e")), "eb02a3f73a96dd6fa2a5e71cc7e2b680"))
stopifnot("values in one or more character columns in happy_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_plot$data, is.character))) sum(sapply(happy_plot$data[sapply(happy_plot$data, is.character)], function(x) length(unique(x)))) else 0), "2fc0e")), "6ab54e5bb20f88e1b075296b81105e52"))
stopifnot("values in one or more factor columns in happy_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(happy_plot$data, is.factor))) sum(sapply(happy_plot$data[, sapply(happy_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "2fc0e")), "6ab54e5bb20f88e1b075296b81105e52"))

print('Success!')

**Question 1.5** 
<br> {points: 3}

In one sentence or two, describe what you see in the scatterplot above. Does there appear to be a relationship between life expectancy at birth and postive affect? If so, describe it.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 1.6** 
<br> {points: 3}

Plot freedom to make life choices against healthy life expectancy at birth (using the unmodified `happy_df` data for the chart). **You should NOT scale the variables to be plotted.** Ensure that healthy life expectancy at birth is on the x-axis and that you give your axes human-readable labels.

_Assign your plot to an object called `happy_plot_2`._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
# Most of the tests for this question are hidden. You have to decide whether you've created the right object.
# Heres one test to at least ensure you named the object correctly
library(digest)
stopifnot("type of exists('happy_plot_2') is not logical"= setequal(digest(paste(toString(class(exists('happy_plot_2'))), "9def3")), "6d098cb7157ff0e17dd2c4a7abc666cf"))
stopifnot("logical value of exists('happy_plot_2') is not correct"= setequal(digest(paste(toString(exists('happy_plot_2')), "9def3")), "7913ae460c7c64a97f7066eef689a55a"))

print('Success!')

**Question 1.7**
<br> {points: 3}

In a sentence or two, describe what you see in the scatterplot above. Does there appear to be a relationship between healthy life expectancy at birth and the other variable you plotted? If so, describe it.

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## 2. Whistler Snow

Skiing and snowboarding are huge in British Columbia. Some of the best slopes for snow sports are quite close. In fact, the famous mountain-bearing city of Whistler is just two hours north of Vancouver. With cold weather and plenty of snowfall, Whistler is an ideal destination for winter sports fanatics. 

One thing skiers and snowboarders want is fresh snow! When are they most likely to find this? In the `data` directory, we have two-year-long data sets from [Environment Canada from the Whistler Roundhouse Station](http://climate.weather.gc.ca/historical_data/search_historic_data_stations_e.html?StationID=348&Year=2007&Month=3&Day=1&timeframe=2&type=bar&MeasTypeID=snow&searchType=stnProx&txtRadius=25&optProxType=navLink&txtLatDecDeg=50.128889166667&txtLongDecDeg=122.95483333333&optLimit=specDate&selRowPerPage=25&station=WHISTLER) (on Whistler mountain). This weather station is located 1,835 m above sea level.

To answer the question of "When are skiers and snowboarders most likely to find fresh snow at Whistler?" you will create a line plot with the date is on the x-axis and the total snow per day in centimetres (the column named `Total Snow cm` in the data file) on the y-axis. Given that we have data for two years (2017 & 2018), we will create one plot for each year to see if there is a trend we can observe across the two years.

**Question 2.1** Multiple Choice: 
<br> {points: 1}

What are we going to plot on the y-axis?

A. total precipitation per day in centimetres

B. total snow on the ground in centimetres

C. total snow per day in centimetres

D. total rain per day in centimetres

*Assign your answer to an object called `answer2.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`).*

In [None]:
# Replace the fail() with your answer. 

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer2.1 is not character"= setequal(digest(paste(toString(class(answer2.1)), "1c278")), "019916d6f658d2864306dcb72fe7742c"))
stopifnot("length of answer2.1 is not correct"= setequal(digest(paste(toString(length(answer2.1)), "1c278")), "4827d9b95a6ae25aba7d39850f00c3c4"))
stopifnot("value of answer2.1 is not correct"= setequal(digest(paste(toString(tolower(answer2.1)), "1c278")), "d8ea58a18613d2470502a64882045aa6"))
stopifnot("letters in string value of answer2.1 are correct but case is not correct"= setequal(digest(paste(toString(answer2.1), "1c278")), "1ae652094caf170998db06c35480c2af"))

print('Success!')

**Question 2.2.0** 
<br> {points: 1}

Read in the file named `eng-daily-01012018-12312018.csv` from the `data` directory. **Make sure you preview the file to choose the correct `read_*` function and argument values to get the data into R.** 

_Assign your data frame to an object called `whistler_2018`._

*Note: You'll see a lot of entries of the form `NA`. This is the symbol R uses to denote missing data. Interestingly, you can do math and make comparisons with `NA`: for example,* `NA + 1 = NA`, `NA * 3 = NA`, `NA > 3 = NA`. *Most operations on `NA` return `NA`. This may seem a bit weird, but it makes things much simpler in R since it removes the need to write any special code to handle missing data!*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
whistler_2018

In [None]:
library(digest)
stopifnot("whistler_2018 should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(whistler_2018)), "ec9b3")), "9439aad17b519fff36d6a4cb419b27d5"))
stopifnot("dimensions of whistler_2018 are not correct"= setequal(digest(paste(toString(dim(whistler_2018)), "ec9b3")), "c1c96254c0de2888cabdff82e4e7dc4b"))
stopifnot("column names of whistler_2018 are not correct"= setequal(digest(paste(toString(sort(colnames(whistler_2018))), "ec9b3")), "93bf63e7be44510bf0fc4b52f0f87580"))
stopifnot("types of columns in whistler_2018 are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(whistler_2018, class)))), "ec9b3")), "e4a00f4054d43b675c5d21109204b7f1"))
stopifnot("values in one or more numerical columns in whistler_2018 are not correct"= setequal(digest(paste(toString(if (any(sapply(whistler_2018, is.numeric))) sort(round(sapply(whistler_2018[, sapply(whistler_2018, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "ec9b3")), "b10859dd53d8ecf6112695bb0c6125a7"))
stopifnot("values in one or more character columns in whistler_2018 are not correct"= setequal(digest(paste(toString(if (any(sapply(whistler_2018, is.character))) sum(sapply(whistler_2018[sapply(whistler_2018, is.character)], function(x) length(unique(x)))) else 0), "ec9b3")), "23b6d268a34dbaf3993fdd02396d136a"))
stopifnot("values in one or more factor columns in whistler_2018 are not correct"= setequal(digest(paste(toString(if (any(sapply(whistler_2018, is.factor))) sum(sapply(whistler_2018[, sapply(whistler_2018, is.factor)], function(col) length(unique(col)))) else 0), "ec9b3")), "a178b3f43cfe0dd824a90e8347404205"))

print('Success!')

**Question 2.2.1** 
<br> {points: 1}

Looking at the column names of the `whistler_2018` data frame, you can see we have white space in our column names again. Use `clean_names` to remove the whitespace to make it easier to use our `tidyverse` functions. Store the result with the same name, `whistler_2018`.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
whistler_2018

In [None]:
library(digest)
stopifnot("whistler_2018 should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(whistler_2018)), "4fafb")), "25d33f7c8a924cbb531874ad0763094e"))
stopifnot("dimensions of whistler_2018 are not correct"= setequal(digest(paste(toString(dim(whistler_2018)), "4fafb")), "bf5f2e25e3b4a8d3324a49c12aaeb889"))
stopifnot("column names of whistler_2018 are not correct"= setequal(digest(paste(toString(sort(colnames(whistler_2018))), "4fafb")), "36cd1e90d6d4c651da5107bc2a4ae902"))
stopifnot("types of columns in whistler_2018 are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(whistler_2018, class)))), "4fafb")), "906895ac05403cf79bdeb68da91f0ec3"))
stopifnot("values in one or more numerical columns in whistler_2018 are not correct"= setequal(digest(paste(toString(if (any(sapply(whistler_2018, is.numeric))) sort(round(sapply(whistler_2018[, sapply(whistler_2018, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "4fafb")), "3b5616b5b9a13f241864b8a1b6e48db2"))
stopifnot("values in one or more character columns in whistler_2018 are not correct"= setequal(digest(paste(toString(if (any(sapply(whistler_2018, is.character))) sum(sapply(whistler_2018[sapply(whistler_2018, is.character)], function(x) length(unique(x)))) else 0), "4fafb")), "90fad10e7bbe1d211bae402edd37a40b"))
stopifnot("values in one or more factor columns in whistler_2018 are not correct"= setequal(digest(paste(toString(if (any(sapply(whistler_2018, is.factor))) sum(sapply(whistler_2018[, sapply(whistler_2018, is.factor)], function(col) length(unique(col)))) else 0), "4fafb")), "2d53a40299c57de734c258d48b819105"))

print('Success!')

**Question 2.3** 
<br> {points: 1}

Create a line plot with the date on the x-axis and the total snow per day (in cm) on the y-axis by filling in the `...` in the code below. Ensure you give your axes human-readable labels.

_Assign your plot to an object called `whistler_2018_plot`._

In [None]:
options(repr.plot.width = 12, repr.plot.height = 5)

# ... <- ggplot(..., aes(x = ..., y = ...)) + 
#     geom_line() +
#     xlab(...) +
#     ylab(...) +
#     scale_x_date(date_breaks = "1 month") + # labels every month
#     theme(axis.text.x = element_text(angle = 90, hjust = 1)) # rotates x axis labels to be vertical

# your code here
fail() # No Answer - remove if you provide an answer
whistler_2018_plot

In [None]:
library(digest)
stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(whistler_2018_plot$layers)), function(i) {c(class(whistler_2018_plot$layers[[i]]$geom))[1]})), "ef4ac")), "45b08e4018c09024440be0caec7f40ad"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(whistler_2018_plot$layers)), function(i) {rlang::get_expr(c(whistler_2018_plot$layers[[i]]$mapping, whistler_2018_plot$mapping)$x)}), as.character))), "ef4ac")), "4d572e7377a4b6617e96f38303fdfda9"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(whistler_2018_plot$layers)), function(i) {rlang::get_expr(c(whistler_2018_plot$layers[[i]]$mapping, whistler_2018_plot$mapping)$y)}), as.character))), "ef4ac")), "a65f4d20bb78b0236685ef9daa7e8a12"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(whistler_2018_plot$layers[[1]]$mapping, whistler_2018_plot$mapping)$x)!= whistler_2018_plot$labels$x), "ef4ac")), "f0dca371264c8f5f4b41f92132fcd31b"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(whistler_2018_plot$layers[[1]]$mapping, whistler_2018_plot$mapping)$y)!= whistler_2018_plot$labels$y), "ef4ac")), "f0dca371264c8f5f4b41f92132fcd31b"))
stopifnot("incorrect colour variable in whistler_2018_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(whistler_2018_plot$layers[[1]]$mapping, whistler_2018_plot$mapping)$colour)), "ef4ac")), "34135ebcc0bca1e8b91fe4b066fa482c"))
stopifnot("incorrect shape variable in whistler_2018_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(whistler_2018_plot$layers[[1]]$mapping, whistler_2018_plot$mapping)$shape)), "ef4ac")), "34135ebcc0bca1e8b91fe4b066fa482c"))
stopifnot("the colour label in whistler_2018_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(whistler_2018_plot$layers[[1]]$mapping, whistler_2018_plot$mapping)$colour) != whistler_2018_plot$labels$colour), "ef4ac")), "34135ebcc0bca1e8b91fe4b066fa482c"))
stopifnot("the shape label in whistler_2018_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(whistler_2018_plot$layers[[1]]$mapping, whistler_2018_plot$mapping)$colour) != whistler_2018_plot$labels$shape), "ef4ac")), "34135ebcc0bca1e8b91fe4b066fa482c"))
stopifnot("fill variable in whistler_2018_plot is not correct"= setequal(digest(paste(toString(quo_name(whistler_2018_plot$mapping$fill)), "ef4ac")), "cdc878d16bb6f1f3ac1ab6283fc4e960"))
stopifnot("fill label in whistler_2018_plot is not informative"= setequal(digest(paste(toString((quo_name(whistler_2018_plot$mapping$fill) != whistler_2018_plot$labels$fill)), "ef4ac")), "34135ebcc0bca1e8b91fe4b066fa482c"))
stopifnot("position argument in whistler_2018_plot is not correct"= setequal(digest(paste(toString(class(whistler_2018_plot$layers[[1]]$position)[1]), "ef4ac")), "a801b90ef3591ee86a2d074417ffff3f"))

stopifnot("whistler_2018_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(whistler_2018_plot$data)), "ef4ad")), "ed62f002bb0d97472d16a40f9f7c8d21"))
stopifnot("dimensions of whistler_2018_plot$data are not correct"= setequal(digest(paste(toString(dim(whistler_2018_plot$data)), "ef4ad")), "7d2856e5bc15861707e711acea3dc6f6"))
stopifnot("column names of whistler_2018_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(whistler_2018_plot$data))), "ef4ad")), "5d8de8e2d5f68a65e4521aae9caeb27d"))
stopifnot("types of columns in whistler_2018_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(whistler_2018_plot$data, class)))), "ef4ad")), "ec1e08599243e625504849b7aae5681e"))
stopifnot("values in one or more numerical columns in whistler_2018_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(whistler_2018_plot$data, is.numeric))) sort(round(sapply(whistler_2018_plot$data[, sapply(whistler_2018_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "ef4ad")), "16bd2337f03a45aa23f983bd63e02981"))
stopifnot("values in one or more character columns in whistler_2018_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(whistler_2018_plot$data, is.character))) sum(sapply(whistler_2018_plot$data[sapply(whistler_2018_plot$data, is.character)], function(x) length(unique(x)))) else 0), "ef4ad")), "4a93c9553691809458ed5908da3f37cb"))
stopifnot("values in one or more factor columns in whistler_2018_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(whistler_2018_plot$data, is.factor))) sum(sapply(whistler_2018_plot$data[, sapply(whistler_2018_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "ef4ad")), "1e8eb969275d0df0907c6cceaeb63c3c"))

print('Success!')

**Question 2.4** 
<br> {points: 3}

Looking at the line plot above, for 2018, of the months when it snowed, which 2 months had the **most** fresh snow?

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.5**
<br> {points: 3}

Repeat the data loading and plot creation using the file `eng-daily-01012017-12312017.csv` located in the `data` directory to visualize the same data for the year 2017. 

_Assign your plot to an object called `whistler_2017_plot`._

In [None]:
# whistler_2017 <- ...
# whistler_2017 <- whistler_2017 |> ...()

# ... <- ggplot(..., aes(x = ..., y = ...)) + 
#    geom_line() + 
#    xlab("...") + 
#    ylab("...") +
#    scale_x_date(date_breaks = "1 month") +
#    theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
#    theme(text = element_text(size = 20))

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
# Most of the tests for this question are hidden. You have to decide whether you've created the right object.
# Heres one test to at least ensure you named the object correctly
library(digest)
stopifnot("type of exists('whistler_2017_plot') is not logical"= setequal(digest(paste(toString(class(exists('whistler_2017_plot'))), "9a7af")), "fa88680ece6774047e68f6b059fda7b2"))
stopifnot("logical value of exists('whistler_2017_plot') is not correct"= setequal(digest(paste(toString(exists('whistler_2017_plot')), "9a7af")), "35badcccb023ab9a8acde6025234c9e2"))

print('Success!')

**Question 2.6**
<br> {points: 3}

Looking at the line plot above, for 2017, of the months when it snowed, which 2 months had the **most** fresh snow?

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 2.7**
<br> {points: 3}

Are the months  with the most fresh snow the same in 2017 as they were in 2018? **Hint:** you might want to add a code cell where you plot the two plots right after each other so you can easily compare them in one screen view.

You can combine two plots, one atop the other, by using the `plot_grid` function from the `cowplot` package:

```
library(cowplot)
plot_grid(plot1, plot2, ncol = 1)
```
Is there any advantage of looking at 2 years worth of data? Why or why not?

DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

## 3. Reading from a Database

In `worksheet_reading`, you'll recall that we opened a database stored in a `.db` file. This involved a lot more effort than just opening a `.csv`, `.tsv`, or any of the other plaintext / Excel formats. It was a bit of a pain to use a database in that setting since we had to use `dbplyr` to translate `tidyverse`-like commands (`filter`, `select`, etc.) into SQL commands that the database understands. We didn't run into this problem in the worksheet, but not _all_ `tidyverse` commands can currently be translated with SQLite databases. For example, with an SQLite database, we can compute a mean, but can't easily compute a median.

**Why should we bother with databases at all then?**

Databases become really useful in a large-scale setting:

- they enable storing large datasets across multiple computers with automatic redundancy and backups
- they enable multiple users to access them simultaneously and remotely without conflicts and errors 
- they provide mechanisms for ensuring data integrity and validating input
- they provide security to keep data safe

For example: there are around [4 billion](https://www.internetlivestats.com/google-search-statistics/) Google searches conducted daily as of 2019. Can you imagine if Google stored all of the data from those queries in a single `.csv` file!? Chaos would ensue. 

To reap the real benefits of databases, we'll need to move to a more fully-powered one: [PostgreSQL](https://www.postgresql.org/). We'll begin by loading the `DBI` and `dbplyr` packages that R uses to talk to databases, as well as the `RPostgres` package that provides the interface between these packages and PostgreSQL databases (note the similarity to the `RSQLite` package from `worksheet_02`).

In [None]:
### Run this cell before continuing. 
library(dbplyr)
library(DBI)
library(RPostgres)
library(lubridate) # This package is used to convert different time/date formats.

### Investigating Trends in Crowdfunding

[Kickstarter](https://www.kickstarter.com/) is an online crowd-funding site where people can post projects they want to do, but don't have the financial resources required to fund the project on their own. Other users of Kickstarter can pledge money to the project (also called "backing" a project) to help the project become a reality. To persuade people to back a project, the project owner usually offers rewards to the "backers" for their help with funding, which they receive once funding reaches a particular amount.

In this section, we'll investigate how the amount of funding successful projects get has changed over time. We consider a project to be successful if the amount of funds pledged exceeded the goal.

**Question 3.0**
<br>{points: 1}

Databases are often stored *remotely* (i.e., not on your computer or on this JupyterHub). Your first task is to load the Kickstarter data from a PostgreSQL database stored remotely on the UBC statistics network.


URL: `"dsci-100-student.stat.ubc.ca"`

Port: `5432`

Username: `"dsci100"`

Password: `"dsci100"`

Database Name: `"kickstarter"`

Table Name: `"projects"`

We've provided the code to do this below. Replace each `...` with one of the 5 above items. 

*Note: As this database will be used by the entire class, you will only have read access (no write permissions).*

*Assign the resulting database connection object to* `connection` *and the project table data to* `project_data`.

In [None]:
# ... <- dbConnect(RPostgres::Postgres(), dbname = ...,
#                 host = ..., port = 5432,
#                 user = ..., password = ...)
# ... <- tbl(connection, ...)

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("connection should be an PqConnection database connection object"= setequal(digest(paste(toString('PqConnection' %in% class(connection)), "a503f")), "ba809102edf6df8661e6b6d2d86ca3bf"))
stopifnot("connection is the wrong database"= setequal(digest(paste(toString(dbListTables(connection)), "a503f")), "c0cba42ee2f6e91fb4a23e71b45f2f82"))

stopifnot("project_data should be a data frame"= setequal(digest(paste(toString('tbl_dbi' %in% class(project_data)), "a5040")), "8d73811cc554913f405348d7138bdd19"))
stopifnot("project_data does not contain the correct number of rows"= setequal(digest(paste(toString(collect(count(project_data))$n), "a5040")), "ce8679f8ffabdd9d25bea8d700fb8af6"))
stopifnot("project_data does not contain the correct columns"= setequal(digest(paste(toString(sort(colnames(project_data))), "a5040")), "6c7494cd7383664dedeab6eba5be0e4c"))

print('Success!')

We can now use the `colnames` function to see what columns are available in the `project_data` table.

In [None]:
colnames(project_data)

**Question 3.1**
<br> {points: 1}

If we want to plot compare pledged and goal amounts of funding over time for successful projects in the United States, which columns should we `select` from the table?

A. `id`, `slug`, `pledged`

B. `pledged`, `goal`, `deadline`, `country`

C. `pledged`, `usd_pledged`, `location_id`

D. `currency`, `state`, `country`, `goal`

_Assign your answer to an object called `answer3.1`. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. `"F"`)._

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("type of answer3.1 is not character"= setequal(digest(paste(toString(class(answer3.1)), "56172")), "cbf55ed4511fc9d7b576190c4ffba55d"))
stopifnot("length of answer3.1 is not correct"= setequal(digest(paste(toString(length(answer3.1)), "56172")), "fe162b877dbdb8f3e35b1707c9515de1"))
stopifnot("value of answer3.1 is not correct"= setequal(digest(paste(toString(tolower(answer3.1)), "56172")), "f94f96bafabd9c3c26491899576ff7a8"))
stopifnot("letters in string value of answer3.1 are correct but case is not correct"= setequal(digest(paste(toString(answer3.1), "56172")), "0df64c3fe898af6566c4685a7ac7f8b9"))

print('Success!')

**Question 3.2**
<br> {points: 1}

Now we'll visualize the data. In order to do this, we need to take the correct subset of data from the table and use `ggplot` to plot the result. Note that we make the scatter plot slightly transparent (using `alpha = 0.01` in the code below) because there is so much data that it would otherwise be hard to see anything (*overplotting*).

In the below cell, you'll see some lines of code (currently commented out with `#` characters). **Remove the comments and rearrange these lines of code** to plot the ratio of pledged and goal funding as a function of project deadline date for all successful (where pledged funding is greater than goal funding) projects in the United States in the dataset. You don't need to add any new code, just reorder the lines we have given you.

*Note: there is a lot of data to plot here, so give it a moment to display!*

*Hint: you'll want to put all the dataframe manipulation functions first, and then the plotting functions afterward. Also note that some lines have a `+` at the end, meaning they're in the middle of the plotting code! To not be overwhelmed trying to solve all the code at once, focus on one step at a time and uncomment only the code needed to run that one step. When that step works, move on to the next.*


In [None]:
#     geom_point(alpha = 0.01) +
# funding_over_time_plot <- ggplot(prj, aes(x = as_datetime(deadline), y = pledged / goal)) +
#     ylab('Pledged Funding / Goal Funding')
# prj <- filter(prj_unfiltered, pledged > goal & country == "US")
#     scale_y_continuous(trans = 'log10', breaks = c(1, 10, 100, 1000)) +
#     xlab('Date') +
#     theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
# prj_unfiltered <- select(project_data, 'deadline', 'pledged', 'goal', 'country')

# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("prj_unfiltered should be a data frame"= setequal(digest(paste(toString('tbl_dbi' %in% class(prj_unfiltered)), "2264")), "d6f74863295f4b3ca8c180e75dd5b7f2"))
stopifnot("prj_unfiltered does not contain the correct number of rows"= setequal(digest(paste(toString(collect(count(prj_unfiltered))$n), "2264")), "451a997c1a94bc236f56572e0a1881c1"))
stopifnot("prj_unfiltered does not contain the correct columns"= setequal(digest(paste(toString(sort(colnames(prj_unfiltered))), "2264")), "813d898bcccbb87d8126a83f6bf02091"))

stopifnot("prj should be a data frame"= setequal(digest(paste(toString('tbl_dbi' %in% class(prj)), "2265")), "dc75d30544a9e1c8091a2f3d161ae546"))
stopifnot("prj does not contain the correct number of rows"= setequal(digest(paste(toString(collect(count(prj))$n), "2265")), "53ea596acf71ab4c86b6122e2ab74e9d"))
stopifnot("prj does not contain the correct columns"= setequal(digest(paste(toString(sort(colnames(prj))), "2265")), "38c80da0a977a38545b4673fde06dba4"))

stopifnot("type of plot is not correct (if you are using two types of geoms, try flipping the order of the geom objects!)"= setequal(digest(paste(toString(sapply(seq_len(length(funding_over_time_plot$layers)), function(i) {c(class(funding_over_time_plot$layers[[i]]$geom))[1]})), "2266")), "a79a81aa7eb3abda993f3560d27d2836"))
stopifnot("variable x is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(funding_over_time_plot$layers)), function(i) {rlang::get_expr(c(funding_over_time_plot$layers[[i]]$mapping, funding_over_time_plot$mapping)$x)}), as.character))), "2266")), "fb057927c3d0a8b3077a9471ef832bf2"))
stopifnot("variable y is not correct"= setequal(digest(paste(toString(unlist(lapply(sapply(seq_len(length(funding_over_time_plot$layers)), function(i) {rlang::get_expr(c(funding_over_time_plot$layers[[i]]$mapping, funding_over_time_plot$mapping)$y)}), as.character))), "2266")), "278723c529555286a68d4fcd123f8692"))
stopifnot("x-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(funding_over_time_plot$layers[[1]]$mapping, funding_over_time_plot$mapping)$x)!= funding_over_time_plot$labels$x), "2266")), "aa32fd6c6d063d60eac18255ef5b654c"))
stopifnot("y-axis label is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(funding_over_time_plot$layers[[1]]$mapping, funding_over_time_plot$mapping)$y)!= funding_over_time_plot$labels$y), "2266")), "aa32fd6c6d063d60eac18255ef5b654c"))
stopifnot("incorrect colour variable in funding_over_time_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(funding_over_time_plot$layers[[1]]$mapping, funding_over_time_plot$mapping)$colour)), "2266")), "924c1bdd27da4b90e4455fc14c02ff1f"))
stopifnot("incorrect shape variable in funding_over_time_plot, specify a correct one if required"= setequal(digest(paste(toString(rlang::get_expr(c(funding_over_time_plot$layers[[1]]$mapping, funding_over_time_plot$mapping)$shape)), "2266")), "924c1bdd27da4b90e4455fc14c02ff1f"))
stopifnot("the colour label in funding_over_time_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(funding_over_time_plot$layers[[1]]$mapping, funding_over_time_plot$mapping)$colour) != funding_over_time_plot$labels$colour), "2266")), "924c1bdd27da4b90e4455fc14c02ff1f"))
stopifnot("the shape label in funding_over_time_plot is not descriptive, nicely formatted, or human readable"= setequal(digest(paste(toString(rlang::get_expr(c(funding_over_time_plot$layers[[1]]$mapping, funding_over_time_plot$mapping)$colour) != funding_over_time_plot$labels$shape), "2266")), "924c1bdd27da4b90e4455fc14c02ff1f"))
stopifnot("fill variable in funding_over_time_plot is not correct"= setequal(digest(paste(toString(quo_name(funding_over_time_plot$mapping$fill)), "2266")), "042710115aa4669b595d9321db8abfe4"))
stopifnot("fill label in funding_over_time_plot is not informative"= setequal(digest(paste(toString((quo_name(funding_over_time_plot$mapping$fill) != funding_over_time_plot$labels$fill)), "2266")), "924c1bdd27da4b90e4455fc14c02ff1f"))
stopifnot("position argument in funding_over_time_plot is not correct"= setequal(digest(paste(toString(class(funding_over_time_plot$layers[[1]]$position)[1]), "2266")), "8e51ae714cea683dab5687ff8971ec15"))

stopifnot("funding_over_time_plot$data should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(funding_over_time_plot$data)), "2267")), "7678dfec9f04901d6786781806060f62"))
stopifnot("dimensions of funding_over_time_plot$data are not correct"= setequal(digest(paste(toString(dim(funding_over_time_plot$data)), "2267")), "706666b1a15e7647bdf67b51f8d1e680"))
stopifnot("column names of funding_over_time_plot$data are not correct"= setequal(digest(paste(toString(sort(colnames(funding_over_time_plot$data))), "2267")), "ce55da4b6cfdd55175a19b47bf0be906"))
stopifnot("types of columns in funding_over_time_plot$data are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(funding_over_time_plot$data, class)))), "2267")), "e2d271d248a1f898e20a5c6da1afd4b5"))
stopifnot("values in one or more numerical columns in funding_over_time_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(funding_over_time_plot$data, is.numeric))) sort(round(sapply(funding_over_time_plot$data[, sapply(funding_over_time_plot$data, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "2267")), "adc3110574fd6fa1f7bf66f61461c8ec"))
stopifnot("values in one or more character columns in funding_over_time_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(funding_over_time_plot$data, is.character))) sum(sapply(funding_over_time_plot$data[sapply(funding_over_time_plot$data, is.character)], function(x) length(unique(x)))) else 0), "2267")), "d2a9c18ad6a71f582af20790a06d50dc"))
stopifnot("values in one or more factor columns in funding_over_time_plot$data are not correct"= setequal(digest(paste(toString(if (any(sapply(funding_over_time_plot$data, is.factor))) sum(sapply(funding_over_time_plot$data[, sapply(funding_over_time_plot$data, is.factor)], function(col) length(unique(col)))) else 0), "2267")), "bc4c0d9c5229ce602e70406c54235b20"))

print('Success!')

**Question 3.3**
<br> {points: 3}

Is there a relationship between the ratio of pledged/goal funding and time? If so, describe it.

Additionally, mention a pattern in the data or a characteristic of it that you may not have expected in advance.


DOUBLE CLICK TO EDIT **THIS CELL** AND REPLACE THIS TEXT WITH YOUR ANSWER.

**Question 3.4**
<br> {points: 1}

Finally, we'll save the project data to a local file in the `data/` folder called `project_data.csv`. Recall that we don't want to try to download and save the *entire dataset* (way too much data!) from the database, but only the `tbl` object named `prj`. So you will need to use the `collect` function followed by the appropriate `write_*` function.

*Assign the output of collect to an object called `project_df`*

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer

In [None]:
library(digest)
stopifnot("project_df should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(project_df)), "9bb9d")), "26001e03cf64c3beb886023de9a837cc"))
stopifnot("dimensions of project_df are not correct"= setequal(digest(paste(toString(dim(project_df)), "9bb9d")), "edff5185408e46dc935ee906ecfde3b9"))
stopifnot("column names of project_df are not correct"= setequal(digest(paste(toString(sort(colnames(project_df))), "9bb9d")), "5650fe6a9da69292ff138487aff722d6"))
stopifnot("types of columns in project_df are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(project_df, class)))), "9bb9d")), "b41c4b1935e4cd321a07aab66fc41874"))
stopifnot("values in one or more numerical columns in project_df are not correct"= setequal(digest(paste(toString(if (any(sapply(project_df, is.numeric))) sort(round(sapply(project_df[, sapply(project_df, is.numeric)], sum, na.rm = TRUE), 2)) else 0), "9bb9d")), "0ffc80028ef23eab6f505d4549a78399"))
stopifnot("values in one or more character columns in project_df are not correct"= setequal(digest(paste(toString(if (any(sapply(project_df, is.character))) sum(sapply(project_df[sapply(project_df, is.character)], function(x) length(unique(x)))) else 0), "9bb9d")), "c8af8224db3c35bc29b3cfe31ed50e23"))
stopifnot("values in one or more factor columns in project_df are not correct"= setequal(digest(paste(toString(if (any(sapply(project_df, is.factor))) sum(sapply(project_df[, sapply(project_df, is.factor)], function(col) length(unique(col)))) else 0), "9bb9d")), "8deb4828e1e23159f5d3b0254a4d1d0a"))

stopifnot("type of file.exists('data/project_data.csv') is not logical"= setequal(digest(paste(toString(class(file.exists('data/project_data.csv'))), "9bb9e")), "12cfc8caef051f80f281979451d5ddd1"))
stopifnot("logical value of file.exists('data/project_data.csv') is not correct"= setequal(digest(paste(toString(file.exists('data/project_data.csv')), "9bb9e")), "6d72cc04d7002212738d1b8c7cea3885"))

stopifnot("read_csv('data/project_data.csv', show_col_types = FALSE) should be a data frame"= setequal(digest(paste(toString('data.frame' %in% class(read_csv('data/project_data.csv', show_col_types = FALSE))), "9bb9f")), "76af0d06ae13396a8a028892106ff904"))
stopifnot("dimensions of read_csv('data/project_data.csv', show_col_types = FALSE) are not correct"= setequal(digest(paste(toString(dim(read_csv('data/project_data.csv', show_col_types = FALSE))), "9bb9f")), "418f44515b6bbe54944a0e802c3d0c8f"))
stopifnot("column names of read_csv('data/project_data.csv', show_col_types = FALSE) are not correct"= setequal(digest(paste(toString(sort(colnames(read_csv('data/project_data.csv', show_col_types = FALSE)))), "9bb9f")), "a3908f4894b9e22b5a4ee6fb528a95f7"))
stopifnot("types of columns in read_csv('data/project_data.csv', show_col_types = FALSE) are not correct"= setequal(digest(paste(toString(sort(unlist(sapply(read_csv('data/project_data.csv', show_col_types = FALSE), class)))), "9bb9f")), "0a8bbafe75664dd5f3470181e2d4587b"))
stopifnot("values in one or more numerical columns in read_csv('data/project_data.csv', show_col_types = FALSE) are not correct"= setequal(digest(paste(toString(if (any(sapply(read_csv('data/project_data.csv', show_col_types = FALSE), is.numeric))) sort(round(sapply(read_csv('data/project_data.csv', show_col_types = FALSE)[, sapply(read_csv('data/project_data.csv', show_col_types = FALSE), is.numeric)], sum, na.rm = TRUE), 2)) else 0), "9bb9f")), "f7e9e52832662092685eb6cc327816a0"))
stopifnot("values in one or more character columns in read_csv('data/project_data.csv', show_col_types = FALSE) are not correct"= setequal(digest(paste(toString(if (any(sapply(read_csv('data/project_data.csv', show_col_types = FALSE), is.character))) sum(sapply(read_csv('data/project_data.csv', show_col_types = FALSE)[sapply(read_csv('data/project_data.csv', show_col_types = FALSE), is.character)], function(x) length(unique(x)))) else 0), "9bb9f")), "4966bbbed7000f199e506e7fb1b5cd82"))
stopifnot("values in one or more factor columns in read_csv('data/project_data.csv', show_col_types = FALSE) are not correct"= setequal(digest(paste(toString(if (any(sapply(read_csv('data/project_data.csv', show_col_types = FALSE), is.factor))) sum(sapply(read_csv('data/project_data.csv', show_col_types = FALSE)[, sapply(read_csv('data/project_data.csv', show_col_types = FALSE), is.factor)], function(col) length(unique(col)))) else 0), "9bb9f")), "d94ee172da04e6c8e958fedf0d088767"))

print('Success!')

## 4 (Optional). Reading Data from the Internet

**Question 4.0**
<br> {points: 0}

More practice scraping! To keep ourselves out of legal hot water, we will get more practice scraping data using a website that was created for that purpose: http://books.toscrape.com/

Your task here is to scrape the prices of the science fiction novels on [this page](http://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html) and determine the maximum, minimum and average price of science fiction novels at this bookstore. Tidy up and nicely present your results by creating a data frame called `sci_fi_stats` that has 2 columns, one called `stats` that contains the words `max`, `min` and `mean` and once called `value` that contains the calculated value for each of these.

The functions for maximum, minimum and average in R are listed in the table below:

| Calculation to perform | Function in R |
| ---------------------- | ------------- |
| maximum                | `max`         |
| minimum                | `min`         |
| average                | `mean`        |

Some other helpful hints:
- If you end up scraping some characters other than numbers you will have to use `str_replace_all` from the `stringr` library to remove them (similar to what we did with the commas in worksheet_02).
- Use `as.numeric` to convert your character type numbers to numeric type numbers before you pass them into the `max`, `min` and `mean` functions.
- If you have `NA` values in your objects that you need to pass into the `max`, `min` and `mean` functions, you will need to set the `na.rm` argument in these functions to `TRUE`.
- use the function `c` to create the vectors that will go in your data frame, for example, to create a vector with the values 10, 16 and 13 named ages, we would type: `ages <- c(10, 16, 13)`.
- use the function `tibble` to create the data frame from your vectors.

In [None]:
# your code here
fail() # No Answer - remove if you provide an answer
sci_fi_stats

**Question 4.1**
<br> {points: 0}

In `worksheet_reading` you had practice scraping data from the web. Now that you have the skills, should you scrape that website you have been dreaming of harvesting data from? Maybe, maybe not... You should check the website's Terms of Service first and consider the application you have planned for the data after you scrape it.

List 3 websites you might be interested in scraping data from (for fun, profit, or research/education). List their URLs as part of your answer. For each website, search for their Terms of Service page. Take note if such a page exists, and if it does, provide the link to it and tell us whether or not they allow web scraping of their website.

You can list them in this cell! Double click to edit.

### Bonus/optional additional readings on legalities of web scraping:

Here are two recent news stories about web scraping and their legal implications:

- [D.C. Court: Accessing Public Information is Not a Computer Crime](https://www.eff.org/deeplinks/2018/04/dc-court-accessing-public-information-not-computer-crime)

- [Dear Canada: Accessing Publicly Available Information on the Internet Is Not a Crime](https://www.eff.org/deeplinks/2018/04/dear-canada-accessing-publicly-available-information-internet-not-crime)

In [None]:
source("cleanup.R")