In [None]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
from datascience import *
import matplotlib

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')

In [None]:
def plot_by_month(counties, col):
    
    """Makes overlaid lineplots of 
    the attribute in column labeled COL
    for all the counties in the array COUNTIES"""
    
    for county in counties:
        dta = covid_us.where("County", are.contained_in(county))
        plt.plot(dta['Date'], dta[col], label=county)
        plt.xticks(rotation=70)
        plt.legend();
    plt.title(col);

# Covid-19

## The Data Science Life Cycle - Table of Contents

<a href='#section 0'>Background Knowledge: Spread of Disease</a>

<a href='#subsection 1a'>Formulating a question or problem</a> 

<a href='#subsection 1b'>Acquiring and preparing data</a>

<a href='#subsection 1c'>Conducting exploratory data analysis</a>

<a href='#subsection 1d'>Using prediction and inference to draw conclusions</a>
<br><br>

## Background<a id='section 0'></a>


In March 2020, our lives were turned upside down as the COVID-19 virus spread throughout the United States.  The Centers for Disease Control (CDC) collects data to help health scientists better understand how disease spreads.

Making comparisons between counties and states can us understand how rapidly a virus spreads, the impact of restrictions on public gatherings on the spread of a virus, and measure the changes in fatality as the medical profession learns how to treat the virus and as people get vaccinated. 

## Formulating a question or problem <a id='subsection 1a'></a>

It is important to ask questions that will be informative and that will avoid misleading results. There are many different questions we could ask about Covid-19, for example, many researchers use data to predict the outcomes based on intervention techniques such as social distancing.

<div class="alert alert-info">
<b>Question:</b> Take some time to formulate questions you have about this pandemic and the data you would need to answer the questions. In addition, add the link of an article you found interesting with a description an why it interested you. 
   </div>

**Your questions:** *here*

**Data you would need:** *here*


**Article:** *link*

## Acquiring and preparing data <a id='subsection 1b'></a>

You will be looking at data from the COVID-19 Data Repository at Johns Hopkins University. You can find the raw data [here](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series). 

You will be investigating the cumulative number of cases, new cases, and fatalities in a month for counties in states accross the US, from March 2020 to May 2021.

The following table, `covid_data`, contains the data collected for each month from March 2020 through May 2021 for every county in the United States.

In [None]:
covid_data = Table().read_table("data/covid_timeseries.csv")

Here are some of the important fields in our data set that you will focus on:

|Variable Name   | Description |
|:---|:---|
|Admin2 | County name |
|Province_State | State name |
|month| Reporting month represented as the last day of the month, e.g., 3.31.20 |
|total_cases | Cumulative number of COVID cases |
|month_cases| New cases reported in the month |
|total_fatalities | Cumulative number of fatal COVID cases |
|month_fatalities| New fatal cases reported in the month |
|Population | Population in the county |

Let's take a look at the data.

In [None]:
# Run this cell show the first ten rows of the data
covid_data

<div class="alert alert-info">
<b>Question:</b> We want to learn more about the dataset. First, how many total rows are in this table? 
</div>

In [None]:
covid_data...

In [None]:
#KEY
covid_data.num_rows

**Your answer here:**

<div class="alert alert-info">
<b>Question:</b> What does each row represent?
   </div>    

**Your answer here:**

<div class="alert alert-info">
<b>Question:</b> This table has many columns that are not particularly informative for our investigation. Which ones can we ignore? Which ones do we need to keep for our analysis?  
</div>

**Your answer here:**

Before we eliminate these columns, let's take a look at some of them to confirm that we don't need them.

It looks like `iso3` has only the value "USA" and that `Country_Region` is always "US". Let's check that this is the case by grouping on each of these columns. Run the cell below to create a table with the number of times each value in the `iso3` column appears in our dataset.

In [None]:
covid_data.group('iso3')

<div class="alert alert-info">
<b>Question:</b> Now, do the same for the column country_region.
   </div> 

In [None]:
covid_data.group('...')

In [None]:
#KEY
covid_data.group('Country_Region')

<div class="alert alert-info">
<b>Question:</b> What did you learn? 
Try searching on the Internet to find out about these iso3 codes. 
What are they?
   </div> 

**Your answer here:**

We are primarily interested in the COVID cases in the states. 
Select the rows that correspond to states.

In [None]:
covid_us = covid_data.where('iso3','USA')
covid_us

Now how many rows remain?

In [None]:
covid_us.num_rows

For our purposes, we will not be using the columns: `iso3`, `Country_Region`, `Lat`, `Long_'`, `Combined_Key`

Keep the column `FIPS` because it uniquely identifies a county. For example, Montana and Wyoming both have a county called "Big Horn". 

Later, we will make maps, and then the columns `Lat` and `Long_` will be useful, but until then, drop them. 

<div class="alert alert-info">
<b>Question:</b> Fill the array "cols_to_drop" with the labels of the columns we seek to remove from our dataset.
   </div> 

In [None]:
cols_to_drop = make_array("...", "...", "...", "...", "...")

covid_us = covid_us.drop(cols_to_drop)

covid_us.show(10)

In [None]:
#KEY
cols_to_drop = make_array("iso3", "Country_Region", "Lat", "Long_", "Combined_Key")

covid_us = covid_us.drop(cols_to_drop)

covid_us

Let's give the remaining columns simpler, more meaningful names.

In [None]:
old_names = make_array('Admin2', 'Province_State', 'month')
new_names = make_array('County', 'State', 'Date')

In [None]:
covid_us = covid_us.relabel(old_names, new_names)

In [None]:
covid_us

<div class="alert alert-info">
<b>Question:</b> It's important to evalute our data source. What do you know about Johns Hopkins University? What motivations do they have for collecting this data? What data is missing?
   </div>

**Your answer here:**

One additional change we will execute is to format the date in our dataset. This will allow us to plot specific columns in our data such as `cases_new` or `fatalaties_new`, and allow us to see how these change throughout time. Simply run the cell below, which correctly formats the date in our dataset. 

In [None]:
# Converting date into datetime object
covid_us_pd = covid_us.to_df()
date = pd.to_datetime(covid_us_pd.Date, format='%m/%y')
covid_us['Date'] = date.dt.strftime('%m/%Y')
covid_us

### Cases pers 100,000 people

There is more than one way to measure the severity of the pandemic. Rather than looking at pure counts, we may want to adjust it according to how many people are in the county. For example, a county with 6,000 people, half of whom are sick, would have 3,000 infected people. Compared to Los Angeles county, this is not alot of cases. However, it is a lot if we think about it in terms of percentages. For this reason, we also want to compare the rates. We could calculate the percentage of cases in the population:

$$100 * cases/population$$


The percentage represents the average number of cases per 100 people. When percentages are small, we often use rates per 10,000 or 100,000 people, i.e.,

$$100000 * cases/population$$

Let's calculate this statistic for our entire dataset by adding a new column entitled `cases_per_100k`.

As a first step, we drop the counties that don't have a value for population. If you want, you can dig deeper and see which counties these are. It's just a hand full.

In [None]:
covid_us = covid_us.where('...', are....(0))
covid_us

In [None]:
#KEY
covid_us = covid_us.where('Population', are.not_equal_to(0))
covid_us

<div class="alert alert-info">
<b>Question:</b> Add a column called "cases_per100k" that has the number of cases in a county divided by the population of the county.
   </div>

In [None]:
#What columns should be in the numerator or the denominator 
cases_per100k_array = 100000 * covid_us.column('...') / covid_us.column('...')

#Create a new column called CASES_PER100K in our new table
covid_us = covid_us.with_columns('...', cases_per100k_array)

In [None]:
#KEY

#What columns should be in the numerator or the denominator 
cases_per100k_array = 100000 * covid_us.column('cases_new') / covid_us.column('Population')

#Create a new column called CASES_PER100K in our new table
covid_us = covid_us.with_columns('cases_per100k', cases_per100k_array)

Now that we have added our `cases_per100k` column, we are ready to begin our Exploratory Data Analysis (EDA) using our new and improved `covid_us` table. Run the following cell to see our finalized table!

In [None]:
covid_us

## Conducting exploratory data analysis <a id='subsection 1c'></a>

Often when we begin our explorations, we first narrow down the data to explore. For example, we might choose a particular month to examine, or a particular state, or both. To get us started, let's narrow our exploartions to the first month, March 2020. Of course, you may choose to examine a different month.

Visualizations help us to understand what the data is telling us. 

Also, the method of comparison is a common and powerful tool to help us understand the data. For example, we might want to compare the counties with the most confirmed cases via a bar chart. 

### Cases in March, 2020


To explore the counties that had the highest number of cases in March 2020, we will need to first select the rows in the table that correspond to March, 2020. 

<div class="alert alert-info">
<b>Question:</b> Fill in the code below to extract entries corresponding only to March 2020. 
   </div>

In [None]:
covid_mar20 = covid_us.where('...', '...')

In [None]:
#KEY
covid_mar20 = covid_us.where('Date', '03/2020')

In [None]:
covid_mar20

<div class="alert alert-info">
<b>Question:</b> Next, sort the dataset to show the counties with the highest number of new cases for that month.   
    
   </div>

In [None]:
new_cases_sorted = covid_mar20.sort('...', descending=...)
new_cases_sorted

In [None]:
#KEY
new_cases_sorted = covid_mar20.sort('cases_new', descending=True)
new_cases_sorted

<div class="alert alert-info">
<b>Question:</b> Now, cut down the table to only have the top twenty from sorted_cases above.
   </div>

In [None]:
top_twenty = new_cases_sorted...(np.arange(20))
top_twenty

In [None]:
#KEY
top_twenty = new_cases_sorted.take(np.arange(20))
top_twenty

<div class="alert alert-info">
<b>Question:</b> Next, create a bar chart to visualize the comparison between the top_ten counties for the number of cases in March, 2020.
   </div>

In [None]:
top_twenty...("...", "...")

In [None]:
top_twenty.barh("County", "cases_new")

<div class="alert alert-info">
<b>Question:</b> Do you recognize the counties? Where are the most of these counties? Why might this be the case?
</div>

**Your answer here:**

Let's find the top 20 counties that have highest number of cases per 100,000. 

<div class="alert alert-info">
<b>Question:</b>  Which 20 counties that have highest number of cases per 100,000 people?
</div>

In [None]:
cases_per100k_sorted = covid_mar20.sort('cases_per100k', descending=True)
cases_per100k_sorted

In [None]:
top_twenty_per100k = cases_per100k_sorted.take(np.arange(20))
top_twenty_per100k.barh("County", "cases_per100k")

<div class="alert alert-info">
<b>Question:</b> What are some possible reasons for the disparities in the counties shown in these two bar plots? Hint: Think about the size of the counties.
   </div>

**Your answer here:**

### Monthly changes 

These data have the number of new cases of COVID each month from March 2020 through May 2021. Another possible exploration is to see how a county's cases change in time.

Let's start by exploring one county in California

<div class="alert alert-info">
<b>Question:</b> First, return a table that only has the data for California counties. 
   </div>

In [None]:
ca_counties = covid_us.where("...", "...")
ca_counties

In [None]:
#KEY
ca_counties = covid_us.where("State", are.equal_to("California"))
ca_counties

There are numerous counties, and each county appears several times, once for each month. To visualize the data, it is a good idea to restrict to just a single county.

<div class="alert alert-info">
<b>Question:</b> Pick a California county and enter its name in the blank below. Then run the cell to see the data for just that county.
   </div>

In [None]:
selected_county = make_array("...")

# Table of rows of only the county you chose
my_county = ca_counties.where("County", are.contained_in(selected_county))
my_county

In [None]:

selected_county = make_array("Los Angeles")

# Table of rows of only the county you chose
my_county = ca_counties.where("County", are.contained_in(selected_county))
my_county

The function `plot_by_month` has been created for this project, to draw line plots of a quantitative variable versus the months in the `Date` column. It takes two arguments:
- an array of county names
- the label of the column containing the variable to plot

The function draws overlaid line plots of the specified variable, one plot for each county in the array.

As a starting point, let the county array be `selected_county`, which has just one element: the county you chose above. Run the cell below to see the plot of new cases for that county.

<div class="alert alert-info">
<b>Question:</b> For the county you picked above, draw a plot of the number of new cases every month from March 2020 to May 2021.
   </div>

In [None]:
plot_by_month(selected_county, ...)

In [None]:
plot_by_month(selected_county, 'cases_new')

<div class="alert alert-info">
<b>Question:</b> Can you use your knowledge about the context to describe the peaks in the cases? 
   </div>

**Your answer here:**

# Using prediction and inference to draw conclusions <a id='subsection 1a'></a>

Now that we have some experience making and reading visualizations, let's compare a few counties over time. 

Settle on a few counties to examine. They could all be in California, or in different states. 

Decide whether the comparison should be of new cases, cumulative cases, new cases per 100,000, or cumulative cases per 100,000. 

Remember that if you examine the variable "cases per 100,000" for counties outside California, you will have to first compute the values of that variable.

<div class="alert alert-info">
<b>Question:</b> Make line plots for the counties you have selected and compare them across time. Use the first cell below to identify the necessary code. After that, use as many cells as you need for your line plots.
   </div>

In [None]:
selected_counties = make_array("...", "...", "...", "...", "...")

column_to_compare = '...'

plot_by_month(selected_counties, column_to_compare)

In [None]:
#KEY
selected_counties = make_array("Los Angeles", "Alameda", "San Bernardino", "Kern", "Queens")

column_to_compare = 'cases_new'

plot_by_month(selected_counties, column_to_compare)

In [None]:
column_to_compare = 'cases_per100k'

plot_by_month(selected_counties, column_to_compare);

<div class="alert alert-info">
<b>Question:</b> After seeing these visualizations, tell us something interesting about this data. Tell us what you learned about the counties that you chose. What outside information about these counties do you think can explain what you see?
   </div>

**Your answer here:**