# Class Activity - Intro to the Tidyverse (R-II)

## Instructions

In this module, you will work with tidyverse to practice formatting, manipulating, and summarizing data. For that, please load the tidyverse library (note that every time you start your notebook, you'll have to reload libraries):

In [None]:
library(tidyverse)

We're going to work with COVID-19 state-wise data that you'll use later (for more plotting purposes). We'll convert the date to a "computer parsable" version of date.

Execute the commands below:

In [None]:
data <- read_csv(file="all-states-history.csv")
data_tbl <- tibble(data)

| Group | State        | State Nickname       | State Code |
|-------|--------------|----------------------|------------|
| 1     | Pennsylvania | Keystone State       | PA         |
| 2     | California   | Golden State         | CA         |
| 3     | New York     | Empire State         | NY         |
| 4     | Louisiana    | Pelican State        | LA         |
| 5     | South Dakota | Mount Rushmore State | SD         |



For the purposed of Q1 - Q4, select a specific data (you can choose from the above) that you are most interested in.

Let's compare the frequency of covid mortality for your assigned date for:
   - The 2020 "Holiday" wave which I will roughly define here as betwee 10-15-2020 and 1-15-2021
   - The 2020 "Pre Holiday" wave which I define here as records earlier than 10-15-2020 

**Q1.** Create a variable named `pre` which extracts:

- entries from your state only
- obtains entries earlier than 10-15-2020

Provide (and execute) your code below:



**Q2.** Create a variable named `post` which extracts:
- entires from your state only
- obtains entries between 10-15-2020 and 01-15-2021

Provide (and execute) your code below:

**Q3.** Take the code you just made and extend it to summarize the **mean number of mortalities per day** (i.e., column named `deathIncrease`), for

- the "pre" Holiday time period
- the Holiday wave time period

If you copy the code you created for Q1 and Q2, you'll overwrite "pre" and "post". **This is completely fine in this case**. 

The point of this exercise is for you to get used to extend a little bit of code *modestly*, run it, check the output, etc.

Provide (and execute) your code below:



**Q4.** Using `pre` and `post`, use simple R commands to calculate:

- the difference between `post` and `pre` (i.e., post minus pre)
- the ratio of post to pre (i.e., `post / pre`)
- the difference above per 100K people: `1e5 * ((post - pre) / state_size )`

You can find the size of the states in a file called "2010_census_bystate.txt"

Provide (and execute) your code below:



|  |
| :-- |



#######################

In the questions above, you did this for a single state. But now, let's unlock *the power of R* -- let's generate these summaries for EVERY STATE. 

**Q5.** Create a variable named `pre_allstates` which extracts

- entries for ALL states
- obtains entries earlier than 10\-15\-2020
- Summarizes the  mean number of mortalities per day **by state**
- HINT: your table should have one row per state

You can and should reuse the code you used for **Q1 and Q3** here \(plus a new variable name\)! You can do this with a single line change in code.

Provide (and execute) your code below:



**Q6.** Create a variable named `post_allstates` which extracts

- entires for ALL states
- obtains entries between 10\-15\-2020 and 01\-15\-2021
- Summarizes the  mean number of mortalities per day **by state**
- HINT: your table should have one row per state

You can and should reuse the code you used for **Q2** **and Q3** here \(plus a new variable name\)! You can do this with a single line change in code. 

Provide (and execute) your code below:



**Q7.** Create a new variable called `comp_allstates` that:
- joins (full) `pre_allstates` and `post_allstates` by "state" (i.e., the state abbreviation)

Provide (and execute) your code below:

**Q8.** Create summaries as you did in **Q4** for ALL states:
- the difference between post and pre (i.e., `post - pre`)
- the ratio of post to pre (i.e., `post / pre`)

(hint: mutate!) Provide (and execute) your code below:

**Q9.** In your directory, we have provided you a file of state population sizes from the 2010 census: 

**2010_census_bystate.txt**

- load this data set into R. This file has a header, separated by the tab character, i.e., "\t".
- merge this with the variable **comp_allstates** you created in **Q7**
- calculate the difference above per 100K people: `1e5 * ((post - pre) / state_size )` for all states

(hint: mutate!) Provide (and execute) your code below:



Congrats! You've just made your first tidyverse data parsing job... and maybe even a pipeline! As good habit, report the session details of R you used.

Execute the below command:

In [None]:
sessionInfo()