# In Class: ggplot II

## Loading packages and data

In this session we will continue our analysis of COVID-19 data in the US, but this time look more locally at individual case data from Montgomery county in Pennsylvania (Data which one of your course directors may or may not have been obsessively analyzing since March 2020).

We will look at the number of cases across townships, the age distribution of these cases and how outcomes are associated with age.

Unlike the previous session, where we used mostly geoms with 'identity' stats (e.g. geom_point), here we will use functions that calculate different stats from the data including counts, bins and density.

We have already downloaded the data for you. As before, load the data using the code below and take a look at it so you can get an idea of how it is structured.



In [None]:
library(tidyverse)
options(repr.plot.width=10, repr.plot.height=3) #set size for plots in this notebook
data <- as_tibble(read.csv('data.csv'))
head(data)

As before, run the code below to convert the column DateReported to Class "Date".

In [None]:
data$DateReported = as.Date(data$DateReported,format = '%Y-%m-%d')

In this dataset every line is a single confirmed COVID-19 case. First thing we would like to do is plot the number of cases in the county as a function of time. In the last in_class session we had the total number cases reportd each day as a variable, so it was very intutive to plot that number over the date using geom_point.

Here, we would need to summarize the number of rows reported each day. You have already done this two classes ago! (Hint: you would like to make a bar plot)

**Q1** Write the ggplot code to produce a barplot of the number of cases reported per day.

**Q2** What is the stat that is being used in this plot?

Next we would like to investigae how many cases have been reported in each township (indicated in the coloumn Name). To do that we will produce a new table (count_data) containing the name of each Township and a coloumn (n) with the number of cases (which is the number of rows reported by each township.

**Q3** Use dyplr with group_by and summarise to do that. If you cannot remember take a look at RIII, there you used summarise with the function first, this time you will use summarise with another parameter to count the number of rows, look at the summarise manual page if needed.

Before moving forward, let's add another piece of data. Would be interesting to compare the number of confirmed cases with the population size in each township and calculate the precent of population that had a confirmed infection. For that we will need to add another column to count_data with the population size for each township and another coloumn with the ratio of these numbers.

Use the code below to load a table with population size.

In [None]:
pop_size <- as_tibble(read.csv('montco_2010_census.csv'))

(This data is from the 2010 census, but not much happens in the suburbs....)

Now you will have to join the two tables and add another column (prec_infected) with the ratio. 

**Q4** We have provided some of this code, add the missing part to calculate the ratio.

In [None]:
count_data <- count_data %>% inner_join(pop_size) %>% mutate(___________________)
head(count_data)

(note the variable that was used to join the two tables, this time it was easy because both tables had a single shared variable, in other cases you might need to specify which variable to use for the join)

Let's choose four townships for further analyis, first let's compare the size of each township with the precent of confirmed infections.

**Q5** produce a dot plot to compare these two variables, is there a clear relationship?

Now let's choose the top four townships with the highest **number** of infections. Use the code below to sort count_data by the total number of confirmed infections.

In [None]:
count_data <- count_data %>% arrange(n)
head(count_data)

ooops... the code sorted the column n by an ascending order.

**Q6** modify the code such that the order would be reversed (you might need to google the command arrange to do that)

In [None]:
count_data <- count_data %>% arrange(______)
head(count_data)

Great! Now that we know what are the four townships with the highest number of infections we will focus on them for addtional analysis.

**Q7**
Replot Q1 for these four townships such that each will have a seperate plot, you will have to:
(1) Filter the data to have only these four townships.
(2) Use facet_wrap with two columns to plot the data for the four townships separatly.
(Remember that you can use %>% to input data directly into ggplot, try to write this in one command without saving any addtional variables.)

Each bar in our plot contains all the cases reported on that specific day, we can easily add information to these bars by dividing them according to an addtional categorical variable. For example, say we would like to know to what age groups do the cases reported on each day belong to. 

**Q8** Use the code above to add one more aesthetic mapping to your bar geometry by assigning the variable Age_Range to fill.

**Q9** Try to take a close look, do you see a difference in age composition between the first and second wave in lower merion?

Last, we all know that there is a strong relationship between age and outcome of Sars-CoV2 infection, let's check if this is reflected in the data. In the original dataset from montoco (data) there is a column called "Hospitalized" which contains some outcome data. 

Use the code below to make a density plot of the age distribution for the different outcomes.

In [None]:
ggplot(data) + geom_density(aes(x=Age,fill=Hospitalized))

This is nice but it is a but hard to see the distribution of cases that were not hospitalized, we can add a parameter to the geom which will make the fill semi-transparent (use google). 

**Q10** Add this to the plot above, did you just an aesthetic mapping?

# Homework

For the last part we will look at the rate of vaccination acccross different US states.

Load the data from the file 'us-daily-covid-vaccine-doses-administered.csv', it contains the number of reported shots per day for each US state. Start by plotting the number of shots per day over time for your four favorite states. 

Do a back-of-the-envelope calculation, assuming a the current rate (say mean rate of the last 7 days) at what point will these staes reach herd immunity (~75%)?
