## Homework 7

## Data Wrangling – Understanding U.S. Covid Statistics

## Logistics

**Due date**: The homework is due 17:00 (5:00pm) on Tuesday, March 5.

You will submit your work on [MarkUs](https://markus-ds.teach.cs.toronto.edu).
To submit your work:

1. Download this file (`Homework_7.ipynb`) from JupyterHub. (See [our JupyterHub Guide](../../../guides/jupyterhub_guide.ipynb) for detailed instructions.)
2. Submit this file to MarkUs under the **hw7** assignment. (See [our MarkUs Guide](../../../guides/markus_guide.ipynb) for detailed instructions.)

All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.


# Introduction

The [GenderSci Lab](https://www.genderscilab.org) is "dedicated to generating feminist concepts, methods and theories for scientific research on sex and gender."

One of their research projects explores the impact of COVID-19 on women and men.
In this lab, we are using a set of data that is based on the information in their [US Gender/Sex COVID-19 Data Tracker](https://genderscilab.org/gender-and-sex-in-covid19/#DataTable). (You may need to search for "US Gender/Sex COVID-19 Data Tables".)

The table shows various pieces of information about US state COVID-19 cases and deaths counted by sex, including the total case count, male case count, and female case counts, as well as the death counts and percentages. Here's a snippet:

![US Gender/Sex COVID-19 Data Tables](tableclip.png)

We have added one more column of data to this, the state population. You'll find out more about this data below.

## Question

The question you're answering in this homework: 

> __Do states with large populations have a higher COVID-19 rate than states with low populations?__

## Problem 1: Read in the data file

The dataset we'll work with on this homework is a modified version of a dataset called `covid_raw.csv` you've seen earlier in the course.

To begin, read in the dataset using `pandas`, storing the result as a `DataFrame` called `covid_raw_data`.

In [1]:
# This import statement is provided to you; do not change it.
import pandas as pd

# Write your code here.
covid_raw_data = pd.read_csv("covid_raw.csv")
covid_raw_data.head()

Unnamed: 0.1,Unnamed: 0,State,Date,Total_cases,Male_cases,Female_cases,Male_cases_pct,Female_cases_pct,Male_cases_rate,Female_cases_rate,Total_deaths,Male_deaths,Female_deaths,Male_deaths_pct,Female_deaths_pct,Male_deaths_rate,Female_deaths_rate,pop,source
0,1,Alaska,30-Oct,132645.0,67287.0,64852.0,50.73,48.89,17450.9,18374.95,699.0,425.0,274.0,60.8,39.2,110.22,77.63,731545,https://coronavirus-response-alaska-dhss.hub.a...
1,2,Arizona,30-Oct,1166060.0,561976.0,599451.0,48.19,51.41,16272.94,17160.29,21153.0,12392.0,8748.0,58.58,41.36,358.83,250.43,7278717,https://www.azdhs.gov/preparedness/epidemiolog...
2,4,California,30-Oct,4647587.0,2221547.0,2356327.0,47.8,50.7,11419.62,11964.09,71519.0,41696.0,29537.0,58.3,41.3,214.33,149.97,39512223,https://update.covid19.ca.gov/
3,5,Colorado,30-Oct,740461.0,360012.0,370008.0,48.62,49.97,12946.21,13453.33,8186.0,4508.0,3666.0,55.07,44.78,162.11,133.28,5758736,https://covid19.colorado.gov/case-data
4,6,Connecticut,30-Oct,402583.0,192749.0,208210.0,47.88,51.72,11032.32,11350.47,8764.0,4338.0,4416.0,49.5,50.39,248.29,240.74,3565287,https://portal.ct.gov/Coronavirus


## Problem 2: Cleaning the data

You'll now perform three different data cleaning operations.
At each step, we've specified a variable to store the result in, so that all of your work can be autograded.
Note that as we saw in lecture, all of these steps create a new `DataFrame`, rather than modifying an existing `DataFrame`. (That makes it easier for you to check your work at each step.)

1. Extract just the `'State'`, `'Total_cases'`, and `'pop'` columns from `covid_raw_data`, storing the resulting `DataFrame` in a variable called `covid_pop_data`.
    The columns must appear in the order listed in this question.
    You are encouraged, but not required, to create a new list variable to store the column names, just like we did in lecture.
2. Take `covid_pop_data` and rename the `Total_cases` column to `'Total Cases'` and `'pop'` column to `'Population'`, storing the resulting `DataFrame` in a variable called `covid_renamed_data`.
3. Finally, take `covid_renamed_data` and use the `DataFrame.convert_dtypes()` method to automatically convert each column into its most appropriate type, storing the resulting `DataFrame` in a variable called `covid_final_data`. You will use `covid_final_data` for the rest of this notebook.

In [2]:
# Write your code here
covid_pop_data = covid_raw_data[['State', 'Total_cases', 'pop']]
covid_pop_data.head()

covid_renamed_data = covid_pop_data.rename(columns={'Total_cases': 'Total Cases', 'pop': 'Population'})
covid_renamed_data.head()

covid_final_data = covid_renamed_data.convert_dtypes()
covid_final_data.head()

Unnamed: 0,State,Total Cases,Population
0,Alaska,132645,731545
1,Arizona,1166060,7278717
2,California,4647587,39512223
3,Colorado,740461,5758736
4,Connecticut,402583,3565287


## Problem 3: Calculating on a `Series`

### Problem 3a: Extracting a `Series`

Extract the `'Total Cases'` column from `covid_final_data` as a `Series`, and store it in a variable called `total_cases`.

In [3]:
# Write your code here
total_cases = covid_final_data['Total Cases']

## Problem 3b: Calculating a summary statistic

Use `total_cases` to calculate the *average* number of cases per state, and store the result in a variable called `average_cases_per_state`.

**Note**: you shouldn't need to calculate the sum and count separately; there's a `Series` method that will calculate the average of a numerical `Series` for you in one step!

In [4]:
# Write your code here
average_cases_per_state = total_cases.mean()
average_cases_per_state

869570.0

## Problem 3c: Interpret

Why is the average number of cases per state not a particularly useful statistic when analysing COVID cases? (**1 pt**)


**Sample solution**: it doesn't take into account the size of the state.

## Problem 4: Data Transformation

### Problem 4a: Creating a new `Series`

Create a new `Series` called `case_rates` that contains the percentage of COVID cases within each state relative to that state's population, rounded to two decimal places.

> For example, Alaska has 132645 cases and a population of 731545, and so its "case rate" would be
>
> $$
> \frac{132645}{731545} \times 100 = 18.13217232022637
> $$
>
> rounded to two decimal places, or 18.13.

To perform this data transformation, you'll need to extract the two relevant columns from `covid_data_final` and them combine them appropriately.
As we did in lecture, take advantage of operations like `+` and `*` and `Series` methods like `.round()` to operate on and combine numerical `Series`, rather than using for loops.

You *may*, but are not required to, create additional variables to store intermediate steps in this calculation.

In [5]:
# Write your code here
total_cases = covid_final_data["Total Cases"]
populations = covid_final_data["Population"]
base_rates = total_cases / populations * 100

case_rates = base_rates.round(2)
case_rates

0     18.13
1     16.02
2     11.76
3     12.86
4     11.29
5     14.78
6       9.1
7      11.9
8      5.72
9      16.3
10    13.38
11    15.13
12    16.56
13    14.92
14    16.63
15    16.32
16     7.76
17     9.27
18     11.3
19    13.96
20    11.44
21    16.45
22    14.23
23     9.74
24    11.73
25    13.04
26    14.09
27    13.22
28     16.2
29     8.62
30    12.22
31    14.86
32    17.42
33    17.46
34    18.74
35    12.11
36    17.15
37     6.44
38    10.83
39     9.53
40    17.78
dtype: Float64

### Problem 4b: Add the column to the dataset

Once you're confident you've computed the `case_rates` `Series` correctly, add it to `covid_final_data` with the name `"Case Rate (%)"`.

**Warning**: unlike all of the previous steps, this will modify `covid_final_data`, rather than creating a new `DataFrame`.
If you do it incorrectly or fix a problem in Problem 4a and want to restart, you should *re-run all above cells* to get a fresh version of `covid_final_data`.

In [6]:
# Write your code here
covid_final_data["Case Rate (%)"] = case_rates

covid_final_data.head()

Unnamed: 0,State,Total Cases,Population,Case Rate (%)
0,Alaska,132645,731545,18.13
1,Arizona,1166060,7278717,16.02
2,California,4647587,39512223,11.76
3,Colorado,740461,5758736,12.86
4,Connecticut,402583,3565287,11.29


## Problem 5: Large and Small State Analysis

Now we can do some work to compare large and small states.
For the purposes of this homework, we define a **large state** to be a state whose population is **greater than 10,000,000**, and a **small state** to be a state whose population is **less than 1,000,000**.

### Problem 5a: Filtering the data

We'll now filter our dataset in two different ways to obtain just the large states and small states (separately).
To do this:

1. Create a boolean `Series` called `is_large` that contains one boolean entry per state, which is `True` if the state's population is greater than 10,000,000, and `False` otherwise.
2. Use `is_large` to filter `covid_final_data` to obtain a new `DataFrame` called `covid_large_state_data` which contains only the rows for the large states.
3. Repeat Step 1, but to create a `Series` called `is_small` that contains `True` when the state's population is less than 1,000,000, and `False` otherwise.
4. Repeat Step 2, but to create a `DataFrame` called `covid_small_state_data` which contains only the rows for the small states.

In [7]:
# We've provided these variables for you to use (so that you don't need to manually count the 0's!)
large_threshold = 10000000
small_threshold = 1000000

# Write your code here
is_large = covid_final_data['Population'] > large_threshold
covid_large_state_data = covid_final_data[is_large]
display(covid_large_state_data.head())

is_small = covid_final_data['Population'] < small_threshold
covid_small_state_data = covid_final_data[is_small]
display(covid_small_state_data.head())

Unnamed: 0,State,Total Cases,Population,Case Rate (%)
2,California,4647587,39512223,11.76
7,Georgia,1263757,10617423,11.9
10,Illinois,1695524,12671821,13.38
25,New York,2537145,19453561,13.04
26,North Carolina,1477514,10488084,14.09


Unnamed: 0,State,Total Cases,Population,Case Rate (%)
0,Alaska,132645,731545,18.13
5,Delaware,143950,973764,14.78
6,District of Columbia,64240,705749,9.1
33,South Dakota,154482,884659,17.46
37,Vermont,40191,623989,6.44


### Problem 5b: Computing average rates

Finally, compute the *average case rate* for the large states (store in a variable called `large_state_avg_rate`) and the *average case rate* for the small states (store in a variable called `small_state_avg_rate`).

Do not perform any rounding.

In [8]:
# Write your code here
large_state_avg_rate = covid_large_state_data["Case Rate (%)"].mean()
small_state_avg_rate = covid_small_state_data["Case Rate (%)"].mean()

print(large_state_avg_rate)
print(small_state_avg_rate)

12.715
13.948333333333332


## Problem 6: Analysis

Recall the central research question introduced at the top of this notebook:

> __Do states with large populations have a higher COVID-19 rate than states with low populations?__

In [9]:
# Run the following cells
print("Large state min: {} / Large state max: {}".format(covid_large_state_data["Case Rate (%)"].min(), covid_large_state_data["Case Rate (%)"].max()))
print("Small state min: {} / Small state max: {}".format(covid_small_state_data["Case Rate (%)"].min(), covid_small_state_data["Case Rate (%)"].max()))

Large state min: 11.76 / Large state max: 14.09
Small state min: 6.44 / Small state max: 18.13


Based on your work on this homework, write 1-2 paragraphs answering each of the following questions:

1. Can you conclude whether states with large populations have a higher or lower COVID-19 rate than states with small populations? How confident are you in this conclusion? **(2 pt)**
2. What other data would you want to collect to help answer the original research question? What other analyses would you want to perform? **(2 pt)**

**Sample solutions**

**1.** Large states on average have lower rates of COVID-19 (1 mark). However the range of the average rates for large states and small states overlap quite a bit. The minimum rate for small states is lower than the minimum for large states (1 mark).

**2.** (Any reasonable answer is acceptable) The standard deviation to help understand the range of the rates. A statistical test of significance. Data from more states. More granular divisions of large and small states.