# GCB535: Data Parsing - III 

## Instructions

In this *adventure*, you are going to work with tidyverse to practice formatting, manipulating, and summarizing data. For that, please load the tidyverse library (note that every time you start your notebook, you'll have to reload libraries):

In [None]:
library(tidyverse)

We're going to work with the COVID-19 state-wise data that you've been making various plots with over the last few days, and we'll convert the date to a "computer parsable" version of date.

Execute the commands below:

In [None]:
data <- read.table(file="../17_R_ggplot2-I/all-states-history.csv", header=T)
data_tbl <- tibble(data)
data_tbl$date <- as.Date(data_tbl$date,format = '%Y-%m-%d')

For Q1, Q2, Q3, and Q4,  groups will select their specific state to focus their analysis on:

| Group | State        | State Nickname       | State Code |
|-------|--------------|----------------------|------------|
| 1     | Pennsylvania | Keystone State       | PA         |
| 2     | California   | Golden State         | CA         |
| 3     | New York     | Empire State         | NY         |
| 4     | Louisiana    | Pelican State        | LA         |
| 5     | South Dakota | Mount Rushmore State | SD         |

During the breakout session, work as a group to complete as many of the tasks as you have time for!

#######################

Let's compare the frequency of covid mortality for your assigned date for:
   - The "Holiday" wave which I will roughly define here as betwee 10-15-2020 and 1-15-2021
   - The "Pre Holiday" wave which I define here as records earlier than 10-15-2020 

**Q1.** Create a variable named **pre** which extracts:
- entries from your state only
- obtains entries earlier than 10-15-2020

Provide (and execute) your code below:

**Q2.** Create a variable named **post** which extracts:
- entires from your state only
- obtains entries between 10-15-2020 and 01-15-2021

Provide (and execute) your code below:

**Q3.** Take the code you just made and extend it to summarize the **mean number of mortalities per day** (i.e., column named **deathIncrease**), for
- the "pre" Holiday time period
- the Holiday wave time period

If you copy the code you created for Q1 and Q2, you'll overwrite "pre" and "post". **This is completely fine in this case**. 

The point of this exercise is for you to get used to extend a little bit of code *modestly*, run it, check the output, etc.

Provide (and execute) your code below:

**Q4.** Using **pre** and **post**, use simple R commands to calculate:
- the difference between **post** and **pre** (i.e., post minus pre)
- the ratio of post to pre (i.e., post / pre)
- the difference above per 100K people: 1e5 * ((post - pre) / state_size )

For ease I've placed the sizes of the states in question here in the table below.

Provide (and execute) your code below:

You can populate your data (by hand) here -- double click on the cell if you feel like manually populating this as we go through this in the discussion:

| Group | State        | State Nickname       | State Code | Size      | diff | ratio | diff_per_100K |
|-------|--------------|----------------------|------------|-----------|------|-------|---------------|
| 1     | Pennsylvania | Keystone State       | PA         | 12702379  |      |       |               |
| 2     | California   | Golden State         | CA         | 37253956  |      |       |               |
| 3     | New York     | Empire State         | NY         | 19378102  |      |       |               |
| 4     | Louisiana    | Pelican State        | LA         | 4533372   |      |       |               |
| 5     | South Dakota | Mount Rushmore State | SD         | 814180    |      |       |               |

#######################

In the questions above, did this one state at a time. But now, let's unlock *the power of R* -- let's generate these summaries for EVERY STATE. 

**Q5.** Create a variable named **pre_allstates** which extracts
- entries for ALL states
- obtains entries earlier than 10-15-2020

You can and should reuse the code you used for **Q1** here (plus a new variable name)! You can do this with a single line change in code.

Provide (and execute) your code below:

**Q6.** Create a variable named **post_allstates** which extracts
- entires from your state only
- obtains entries between 10-15-2020 and 01-15-2021

You can and should reuse the code you used for **Q2** here (plus a new variable name)! You can do this with a single line change in code. 

Provide (and execute) your code below:

**Q7.** Create a new variable called **comp_allstates** that:
- joins (full) **pre_allstates** and **post_allstates** by "state" (i.e., the state abbreviation)

Provide (and execute) your code below:

**Q8.** Create summaries as you did in **Q4** for ALL states:
- the difference between post and pre (i.e., post minus pre)
- the ratio of post to pre (i.e., post / pre)

(hint: mutate!) Provide (and execute) your code below:

**Q9.** In your directory, we have provided you a file of state population sizes from the 2010 census: 

**2010_census_bystate.txt**

- load this data set into R. This file has a header, separated by the tab character, i.e., "\t".
- merge this with the variable **comp_allstates** you created in **Q7**
- calculate the difference above per 100K people: 1e5 * ((post - pre) / state_size ) for all states

(hint: mutate!) Provide (and execute) your code below:

Congrats! You've just made your first tidyverse data parsing job... and maybe even a pipeline! As good habit, report the session details of R you used.

Execute the below command:

In [None]:
sessionInfo()