# In Class: ggplot2 - I

## Loading packages and data

Now thay we are all experts in dplyr and ggplot2, we will use our knowledge to analyze some real world data about the COVID-19 pandemic in the US. Our goal is to compare case numbers, testing and fatality rates between several US states.
First we will load out libraries and data:

In [None]:
library(tidyverse)
options(repr.plot.width=10, repr.plot.height=3) #set size for plots in this notebook

Next we will load the data. The file "all-states-history.csv" contains daily reported data from each state which we have already downloaded for you from https://covidtracking.com/.
The file is in .csv format, luckily R has a built in function to load .csv file. Use the code below to load this data and convert it as before to tibble.

In [None]:
data <- read.csv('all-states-history.csv')
data = as_tibble(data)
head(data)

Note that the first column, which we will use for plotting the change in COVID-19 cases over time, is of type "chr". This time, instead of converting it to a numerical value, we will use the class "Date" to keep those values as actual dates.

In [None]:
data$date <- as.Date(data$date,format = '%Y-%m-%d')
head(data)

Note that now when you look at the data, the class type of the first column is "date".

Next, we will create a filter using dplyr to subset the (larger) table into one that contains fewer states and fewer columns with the subset of variables that we will use in further analysis.

**Q0.** Using the code below, modify the vector below to contain 5 states. Add one state of your choice to the four states below!

In [None]:
state_vec = c('NY','PA','SD','CA')

**Q1.** Using dplyr,
- filter the variable `data` to contain the subset of enries which includes ONLY the states in `state_vec`
- Retain the following columns of data
    - date
    - state
    - positiveIncrease
    - totalTestResultsIncrease
    - death
- save the subsetted table to a new varible called `data_sub`

Hints: Use the `filter()` function and the `%in%` operator!

Provide and execute your code below:

**Q2.** Now we are ready to plot the number of cases as a function of time! 
- Make a plot using `data_sub` as your input table. 
- Use date (x) and positiveIncrease (y) as the variables and color aesthetic mappings, respectively, for the plot
- Use points present your data: `geom_point()`
- Use `color=state` in coloring aestetics.

Provide and execute your code below:

**Q3.** Next, to improve its clarity:
- add a line geometry - `geom_line()` - to the same plot in **Q2**

Provide and execute your code below:

We know that the number of positive cases will strongly depend on the number of tests performed.

Next, let us look at the number of tests performed as a function of time in each of the 5 states.

**Q4.** Using a similar structure as the plot in **Q2/Q3** above,
- Create a plot for the number of tests (`totalTestResultsIncrease`) as a function of date

Provide and execute your code below:

You can see that the number of tests have been more or less increasing during the pandemic. 

An important measure that has been suggested as an estimate for the severity of the pandemic across regions is the *percent positive tests*. 

Execute the code below to create a new variable called `percentPositive`:

In [None]:
data_sub$percentPositive = data_sub$positiveIncrease/data_sub$totalTestResultsIncrease

**Q5.** Now, revise the above and create:
- a plot of the percentPositive (tests) over time 

What happened in South Dakota?!

Do note that R gives us some warnings about outliers on the plots that have been removed.

We *probably* should have filtered the data before starting. (A point we will return to in the homework, below). 

A central question is how likely it is, for an infected case, to be fatal.

To investigate that, let us examine the *Incidence Fatality Rate* (IFR): the percentage of reported cases that result in death. 

For this, let us create a new table that contains the cummulative number of confirmed cases and deaths for all states that this data is reported.

Execute the the code below to generate such table.

In [None]:
data_for_ifr <- data %>% filter(date == '2021-02-17') %>% select('state','positiveCasesViral','death')
head(data_for_ifr)

**Q6.** Build a plot to compare 
- The total infected (`positiveCasesViral`) (x) to total death (`death`) (y). 
- Save the plot object as a variable: `p`

(You will get a warning message, as some states have not reported these numbers on Feb 17, make sure you understand what the warning message means.)

Wouldn't it be nice to know which state is represented in each data point?

Forutnately, ggplot can help us annotate the plot by adding another geometry - text - to the existing plot object. 

While we are at it, let us also add a regression line to fit the data via the function `lm()`; the slope of this line can give us a rough idea of the IFR.

Execute the code below to annotate the previous graph. 

In [None]:
p + geom_text(aes(label=state)) + geom_smooth(method = "lm", se = FALSE)

**Q7.** Make the following modifications:
- remake the plot but change the color of the text to be red (instead of black)

Note: This is **not** an aesthetic mapping! So should not be defined within the aes parentheses.

Finally, to get an estimate of the IFR, let's extract the slope of the fitted line. 

Unfourtunatly, it is very difficult to get this parameter from the *fitted line* in `geom_smooth()`. 

Instead, let's call the regression directly, save the results, and review a summary of that regression output. 

So, a small preview to some of our future classes!

Execute the code below.

In [None]:
ft <- lm(death ~ positiveCasesViral,data = data_for_ifr)
summary(ft)

Can you identify the slope parameter? What is the IFR that we found? Does that fit what you know about COVID-19?

Don't stress out :) This is most probably much higher than the real IFR due to under reporting of cases....

## Homework

Remember the plot from **Q5**? 

It had some outlier values that are *probably* due to incomplete and/or wrong reporting at specific days. 

Let's remake this plot but create a table that is filtered for these outliers.

To do this, create code which does each of these steps:

(Step 1) Filter `data_sub` to report only rows resulting in outlier `percentPositive` values. When you look at the figure generated in Q5 you will see two types of outliers: very high `percentPositive` values and negative values. Filter `data_sub` (using an OR operator) to show only lines where percentPositive is larger than 0.8 or smaller than 0.

(Step 2) Now we see what is wrong: `positiveIncrease` should always be smaller than `totalTestResultsIncrease` (you cannot have more positive results than actual tests!). In many cases, these two variables have the same value which must be a mistake in reporting. Filter `data_sub` to carry forward only those rows where *totalTestResultsIncrease* is greater than *positiveIncrease*.

(Step 3) Replot the graph made in **Q5** using the data you just filtered in Step 2.

In [None]:
# Step 1:


In [None]:
# Step 2: 


In [None]:
# Step 3: 
