# In Class: ggplot I

## Loading packages and data

Now thay we are all experts in dplyr and ggplot2, we will use our knowledge to analyze some real world data about the COVID-19 pandemic in the US. Our goal is to compare case numbers, testing and fatality rates between several US states.
First we will load out libraries and data:

In [None]:
library(tidyverse)
options(repr.plot.width=10, repr.plot.height=3) #set size for plots in this notebook

Next we will load the data. The file "all-states-history.csv" contains daily reported data from each state which we have already downloaded for you from https://covidtracking.com/.
The file is in .csv format, luckily R has a built in function to load .csv file. Use the code below to load this data and convert it as before to tibble.

In [None]:
data <- read.csv('all-states-history.csv')
data = as_tibble(data)
head(data)

Note that the first column, which we will use for plotting the change in COVID-19 cases over time, is of type "chr". This time, instead of converting it to a numerical value, we will use the class "Date" to keep those values as actual dates.

In [None]:
data$date <- as.Date(data$date,format = '%Y-%m-%d')
head(data)

Note that now when you look at the data, the class type of the first column is "date".

Next we will write dplyr filter in order to subset the big table into a smaller table that contains only a few states and reduces the number of columns to only variables that we will use in further analysis. 
Using the code below, make a vector containing 5 states, add one state of your choice to the four states below.

In [None]:
state_vec = c('NY','PA','SD','CA')

**Q1.** Write a dplyr filter to subset data into data_sub taking only rows reported from the states in state_vec (use filter with the operator %in%) and only the following columns: 'date','state','positiveIncrease','totalTestResultsIncrease','deathConfirmed'

**Q2.** Now we are ready to plot the number of cases as a function of time!
make a plot using data_sub as your data. Use date,positiveIncrease and state as your x,y and color aesthetic mappings respectively. Use point geometry to present your data.

**Q3.** Add a line geometry to the same plot to make it even more clear

The number of positive cases will strongly depend on the number of tests performed, next we will take a look at the number of tests performed as a function of time in each of the 5 states.

**Q4.** Using a similar structure as the plot above, examine the number of tests (totalTestResultsIncrease) as a function of date.

You can see that the number of tests have been more or less increasing during the pandemic. An important measure that has been suggested as an estimae for the severity of the pandemic across regions is the percent positive tests. Use the code below to add a new variable percentPositive.

In [None]:
data_sub$percentPositive = data_sub$positiveIncrease/data_sub$totalTestResultsIncrease

**Q5.** Now plot change in percent positive tests.

What just happened to South Dekota?!

Note the outliers and the warning message, we probably should have filtered the data before starting, we will get back to this as part of the homework.

One question that is on everyones mind is how likely it is, for an infected case, to be fatal.

To investigate that, will examine the Incidence Fatality Rate (IFR): what is the percentage of reported cases that result in death. For this we will create a new table that contains the cummulative number of confirmed cases and deaths for all states that this data is reported.

Use the code below to generate such table.

In [None]:
data_for_ifr <- data %>% filter(date == '2021-02-17') %>% select('state','positiveCasesViral','death')
head(data_for_ifr)

**Q6.** Build a plot to compare the total infected (positiveCasesViral) to total death (death). **This time, save the plot object as a variable 'p'**

(You will get a warning message, as some states have not reported these numbers on Feb 17, make sure you understand what the warning message means.)

Wouldn't it be nice to know which state is represented in each data point?

Using ggplot we can easily annotate the plot by adding another geometry, text, to the already existing plot object, use the code below to annotate the previous graph. We are also adding a regression line to fit the data, as the slope of this line can give us a rough idea of the IFR.

In [None]:
p + geom_text(aes(label=state)) + geom_smooth(method = "lm", se = FALSE)

**Q7.** To better highlight the text, color it in red. Add text color to the code above. Note that this is not an aesthetic mapping! So should not be defined within the aes parentheses.

Last thing, we would like to extract the slope of the fitted line, to get some estimate of the IFR. Unfourtunatly it is not trivial to get this parameters from the fitted line in geom_smooth, so as a preview to some of our future classes, we will directly calculate a regression line on the data and take a look at the fitted parameters using the code below.

In [None]:
ft <- lm(death ~ positiveCasesViral,data = data_for_ifr)
summary(ft)

Can you identify the slope parameter? What is the IFR that we got? Does that fit what you know about COVID-19?

Don't stress out :) This is most probably much higher than the real IFR due to under reporting of cases....

## Homework

Remember the plot from Q5? It had some weird values that are most probably due to incomplete or wrong reporting at specific days. Filter the table used in Q5 and replot the graph to get rid of the outliers and the warning messages. Follow these steps:

(1) Filter data_sub to show only the rows resulting in outlier percentPositive values. When you look at the figure generated in Q5 you will see two types of outliers, very high percentPositive values and negative values. Filter data_sub (using an OR operator) to show only lines where percentPositive is larger than 0.8 or smaller than 0.

(2) Now we see what is wrong, positiveIncrease should always be smaller than totalTestResultsIncrease (you cannot have more positive results than actual tests!). In many cases these two variables have the same value which is obviously a mistake in reporting. Filter data_sub to make sure that totalTestResultsIncrease is greater than positiveIncrease.

(3) Redo Q5 with your filtered data

In [None]:
# Step 1:


In [None]:
# Step 2: 

In [None]:
# Step 3: 