# GCB535: Data Parsing with Tidyverse

## Instructions

In this *adventure*, you are going to work with tidyverse to practice formatting, manipulating, and summarizing data. For that, please load the tidyverse library (note that every time you start your notebook, you'll have to reload libraries):

In [1]:
library(tidyverse)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.7      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



We're going to work with the COVID-19 state-wise data that you've been making various plots with over the last few days, and we'll convert the date to a "computer parsable" version of date.

Execute the commands below:

In [17]:
data <- read_csv(file="all-states-history.csv")
data_tbl <- tibble(data)


[1mRows: [22m[34m19765[39m [1mColumns: [22m[34m41[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m   (1): state
[32mdbl[39m  (39): death, deathConfirmed, deathIncrease, deathProbable, hospitalized...
[34mdate[39m  (1): date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


For Q1, Q2, Q3, and Q4,  groups will select their specific state to focus their analysis on:



| Group | State        | State Nickname       | State Code |
|-------|--------------|----------------------|------------|
| 1     | Pennsylvania | Keystone State       | PA         |
| 2     | California   | Golden State         | CA         |
| 3     | New York     | Empire State         | NY         |
| 4     | Louisiana    | Pelican State        | LA         |
| 5     | South Dakota | Mount Rushmore State | SD         |



During the breakout session, work as a group to complete as many of the tasks as you have time for!

#######################

Let's compare the frequency of covid mortality for your assigned date for:
   - The "Holiday" wave which I will roughly define here as betwee 10-15-2020 and 1-15-2021
   - The "Pre Holiday" wave which I define here as records earlier than 10-15-2020 

**Q1.** Create a variable named `pre` which extracts:

- entries from your state only
- obtains entries earlier than 10-15-2020

Provide (and execute) your code below:



date,state,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,hospitalizedIncrease,⋯,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
<date>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2020-10-14,PA,8411,,27,,,,749,0,⋯,3506849,30438,,,,,2244079,15733,,0
2020-10-13,PA,8384,,16,,,,773,0,⋯,3476411,35600,,,,,2228346,16572,,0
2020-10-12,PA,8368,,18,,,,725,0,⋯,3440811,24648,,,,,2211774,12048,,0
2020-10-11,PA,8350,,6,,,,706,0,⋯,3416163,30831,,,,,2199726,15324,,0
2020-10-10,PA,8344,,36,,,,732,0,⋯,3385332,48695,,,,,2184402,21316,,0
2020-10-09,PA,8308,,9,,,,734,0,⋯,3336637,40270,,,,,2163086,15920,,0


**Q2.** Create a variable named `post` which extracts:
- entires from your state only
- obtains entries between 10-15-2020 and 01-15-2021

Provide (and execute) your code below:

date,state,death,deathConfirmed,deathIncrease,deathProbable,hospitalized,hospitalizedCumulative,hospitalizedCurrently,hospitalizedIncrease,⋯,totalTestResults,totalTestResultsIncrease,totalTestsAntibody,totalTestsAntigen,totalTestsPeopleAntibody,totalTestsPeopleAntigen,totalTestsPeopleViral,totalTestsPeopleViralIncrease,totalTestsViral,totalTestsViralIncrease
<date>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2021-01-14,PA,18742,,313,,,,4980,0,⋯,8307622,52579,,,,,4089675,17172,,0
2021-01-13,PA,18429,,349,,,,5069,0,⋯,8255043,56546,,,,,4072503,19227,,0
2021-01-12,PA,18080,,227,,,,5204,0,⋯,8198497,52516,,,,,4053276,16573,,0
2021-01-11,PA,17853,,83,,,,5232,0,⋯,8145981,39353,,,,,4036703,15727,,0
2021-01-10,PA,17770,,103,,,,5201,0,⋯,8106628,58042,,,,,4020976,20448,,0
2021-01-09,PA,17667,,273,,,,5298,0,⋯,8048586,72839,,,,,4000528,23463,,0


**Q3.** Take the code you just made and extend it to summarize the **mean number of mortalities per day** (i.e., column named `deathIncrease`), for

- the "pre" Holiday time period
- the Holiday wave time period

If you copy the code you created for Q1 and Q2, you'll overwrite "pre" and "post". **This is completely fine in this case**. 

The point of this exercise is for you to get used to extend a little bit of code *modestly*, run it, check the output, etc.

Provide (and execute) your code below:



mort_
<dbl>
37.21681


mort_
<dbl>
113.2967


**Q4.** Using `pre` and `post`, use simple R commands to calculate:

- the difference between `post` and `pre` (i.e., post minus pre)
- the ratio of post to pre (i.e., `post / pre`)
- the difference above per 100K people: `1e5 * ((post - pre) / state_size )`

For ease I've placed the sizes of the states in question here in the table below.

Provide (and execute) your code below:



|  |
| :-- |



#######################

In the questions above, did this one state at a time. But now, let's unlock *the power of R* -- let's generate these summaries for EVERY STATE. 

**Q5.** Create a variable named `pre_allstates` which extracts

- entries for ALL states
- obtains entries earlier than 10\-15\-2020
- Summarizes the  mean number of mortalities per day **by state**

You can and should reuse the code you used for **Q1 and Q3** here \(plus a new variable name\)! You can do this with a single line change in code.

Provide (and execute) your code below:



state,mort_pre
<chr>,<dbl>
AK,0.2869955
AL,12.1891892
AR,7.3273543
AS,0.0
AZ,25.6533333
CA,73.9511111


**Q6.** Create a variable named `post_allstates` which extracts

- entires from your state only
- obtains entries between 10\-15\-2020 and 01\-15\-2021
- Summarizes the  mean number of mortalities per day **by state**

You can and should reuse the code you used for **Q2** **and Q3** here \(plus a new variable name\)! You can do this with a single line change in code. 

Provide (and execute) your code below:



state,mort_post
<chr>,<dbl>
AK,1.791209
AL,35.043956
AR,28.384615
AS,0.0
AZ,55.67033
CA,163.703297


**Q7.** Create a new variable called `comp_allstates` that:
- joins (full) `pre_allstates` and `post_allstates` by "state" (i.e., the state abbreviation)

Provide (and execute) your code below:

**Q8.** Create summaries as you did in **Q4** for ALL states:
- the difference between post and pre (i.e., `post - pre`)
- the ratio of post to pre (i.e., `post / pre`)

(hint: mutate!) Provide (and execute) your code below:

state,mort_pre,mort_post
<chr>,<dbl>,<dbl>
AK,0.2869955,1.79120879
AL,12.18919,35.04395604
AR,7.327354,28.38461538
AS,0.0,0.0
AZ,25.65333,55.67032967
CA,73.95111,163.7032967
CO,9.6,34.65934066
CT,20.43694,22.12087912
DC,2.848214,2.24175824
DE,2.959641,3.67032967


**Q9.** In your directory, we have provided you a file of state population sizes from the 2010 census: 

**2010_census_bystate.txt**

- load this data set into R. This file has a header, separated by the tab character, i.e., "\t".
- merge this with the variable **comp_allstates** you created in **Q7**
- calculate the difference above per 100K people: `1e5 * ((post - pre) / state_size )` for all states

(hint: mutate!) Provide (and execute) your code below:



[1mRows: [22m[34m56[39m [1mColumns: [22m[34m3[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m "\t"
[31mchr[39m (2): state_name, state
[32mdbl[39m (1): population_2010

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


state,mort_pre,mort_post,state_name,population_2010,dif
<chr>,<dbl>,<dbl>,<chr>,<dbl>,<dbl>
AK,0.2869955,1.79120879,Alaska,710231,0.211792118
AL,12.18919,35.04395604,Alabama,4779736,0.478159607
AR,7.327354,28.38461538,Arkansas,2915918,0.722148604
AS,0.0,0.0,American Samoa,55519,0.0
AZ,25.65333,55.67032967,Arizona,6392017,0.469601322
CA,73.95111,163.7032967,California,37253956,0.240919879
CO,9.6,34.65934066,Colorado,5029196,0.498277273
CT,20.43694,22.12087912,Connecticut,3574097,0.047115179
DC,2.848214,2.24175824,District of Columbia,601723,-0.100786582
DE,2.959641,3.67032967,Delaware,897934,0.079147066


Congrats! You've just made your first tidyverse data parsing job... and maybe even a pipeline! As good habit, report the session details of R you used.

Execute the below command:

In [30]:
sessionInfo()

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10    purrr_0.3.4    
[5] readr_2.1.2     tidyr_1.2.1     tibble_3.1.7    ggplot2_3.3.6  
[9] tidyverse_1.3.1

loaded via a namespace (and not attached):
 [1] pbdZMQ_0.3-7     tidyselect_1.1.2 repr_1.1.4       haven_2.5.1     
 [5] colorspace_2.0-3 vctrs_0.4.1      