## Part 2: Hypothesis Tests for Chi-Squared Analysis of 2x2 Tables

We are again using the *NYTimes* COVID-19 data. This time our dataset contains cumulative case and deaths counts for states across the US, instead of the counts of new cases and deaths per day.

***

### Load packages and dataset

In [None]:
# import packages

import pandas as pd
import numpy as np

from scipy.stats import chi2_contingency

In [None]:
# load dataset with cumulative counts

us_state_cuml = pd.read_csv("datasets/covid-19-data-master/us-states.csv")

display(us_state_cuml.head())

## Task 1: Compare the proportion of COVID-19 deaths to COVID-19 survivals by April 1 in New York and California

Let us first extract the case and death counts for all states as of April 1st (2020-04-01).

**Run the cell below in which we do this.**

In [None]:
april_data = us_state_cuml[us_state_cuml.date == "2020-04-01"]

# set index
april_data = april_data.set_index("state")

print(april_data.shape)
display(april_data.head())

Now we will extract the rows that contain the data for our two states of interest for this task: New York and California.

**Complete the code below to do this.**

In [None]:
ny_data = 
ca_data = 

print(ny_data)
print(ca_data)

<p>
<details><summary>Click to show solution</summary><br>

```python
ny_data = april_data.loc["New York"]
ca_data = april_data.loc["California"]
```

</details>
</p>

Now, we will pull out our values of interest:
1. The number of COVID-19 patients who had died by April 1st in each state.
2. The number of COVID-19 patients who had survived by April 1st in each state (this is the number of cases *minus* the number of deaths).

**Run the cell below to do this.**

In [None]:
ny_dead = ny_data.deaths
ny_not_dead = ny_data.cases - ny_data.deaths
print(ny_dead, ny_not_dead)

ca_dead = ca_data.deaths
ca_not_dead = ca_data.cases - ca_data.deaths
print(ca_dead, ca_not_dead)

Next, **calculate and print out the proportion of deaths to cases for each state.**

In [None]:
ny_prop = ny_data.deaths/ny_data.cases
ca_prop = ca_data.deaths/ca_data.cases

print(ny_prop)
print(ca_prop)

Looing at the raw data obtained above, it is easy to say "Woah NY had 1941 deaths by April 1st and CA only had 212. And the proportion of deaths to cases is higher in New York. So many more people are dying in NY than in CA!" But, ***this is not necessarily true!!*** To determine whether or not this difference in proportions is significantly different or not, we should perform a chi-square test.

For this test, our hypotheses will be:
> $H_0$: there is no association between dying of COVID-19 and the state in which one lives.<br>
> $H_A$: there is an association.

To perform a chi-square test in Python we use the following code:
>```python
> chi2, p, dof, expected = chi2_contingency(contig_table)
>```

[Read more about the `chi2_contingency` function here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html).

This function performs the chi-square test for a provided contingency table and then returns 4 values: the chi-square test statistic, the p-value, the degrees of freedom and a table of expected counts.

So, first we need a contingency table. The code below creates this contingency table. **Run the cell.**

In [None]:
contig_table1 = pd.DataFrame([[ny_dead, ny_not_dead],
                             [ca_dead, ca_not_dead]], columns=["dead", "not dead"], index=["NY", "CA"])
display(contig_table1)

**Now run the cell below to perform the chi-square test.**

In [None]:
chi2, p, dof, expected = chi2_contingency(contig_table1)

print("chi2:", chi2)
print("p:", p)
print("dof:", dof)
print("expected:", expected)

With a p-value of `0.353` we do not have sufficient evidence to reject our null hypothesis. There does not appear to be any statistically significant association between dying of COVID-19 and the state in which one lives.

## Task 2: Compare the proportion of COVID-19 deaths to COVID-19 survivals by April 1 in New York and Washington

Follow the same steps as we used above to perform the same hypothesis test, but this time for New York and Washington.

In [None]:
# obtain April data for states of interest
ny_data = 
wa_data = 

# obtain data points of interest
ny_dead = ny_data.deaths
ny_not_dead = ny_data.cases - ny_data.deaths
print(ny_dead, ny_not_dead)

wa_dead = wa_data.deaths
wa_not_dead = wa_data.cases - wa_data.deaths
print(ny_dead, ny_not_dead)

Next, **calculate and print out the proportion of deaths to cases for each state.**

**Create the appropriate contingency table.**

**Perform a chi-square test.**

***What conclusion can we make?***

<p>
<details><summary>Click to show answer</summary><br>

```
chi2: 99.79827316939875
p: 1.687373143636742e-23
dof: 1
expected: [[ 2053.23055087 81835.76944913]
 [  136.76944913  5451.23055087]]
```

</details>
</p>

<p>
<details><summary>Click to show solutions</summary><br>

```python
# obtain April data for states of interest

ny_data = april_data.loc["New York"]
wa_data = april_data.loc["Washington"]

# obtain data points of interest

ny_dead = ny_data.deaths
ny_not_dead = ny_data.cases - ny_data.deaths
print(ny_dead, ny_not_dead)

wa_dead = wa_data.deaths
wa_not_dead = wa_data.cases - wa_data.deaths
print(ny_dead, ny_not_dead)

# calculate proportions

ny_prop = ny_data.deaths/ny_data.cases
wa_prop = wa_data.deaths/wa_data.cases

print(ny_prop)
print(wa_prop)

# create contingency table

contig_table2 = pd.DataFrame([[ny_dead, ny_not_dead],
                             [wa_dead, wa_not_dead]], columns=["dead", "not dead"], index=["NY", "WA"])
contig_table2

## chi-square test

chi2, p, dof, expected = chi2_contingency(contig_table2)

print("chi2:", chi2)
print("p:", p)
print("dof:", dof)
print("expected:", expected)

```

</details>
</p>

## Task 3: We can also do a similar chi-square test using a contingency table of COVID-19 cases and state population.

We will do this for New York and California.

In [None]:
# load state population dataset

usa_pop = pd.read_csv("datasets/usa-pop-2019.csv", index_col="State")

display(usa_pop.head())

In [None]:
# obtain populations of these states

ny_pop = usa_pop.loc["New York", "2019_pop"]
ca_pop = usa_pop.loc["California", "2019_pop"]

In [None]:
# create table of NY vs CA

# the following variables have already been defined above
    # ny_data
    # ca_data
# and we can access the number of cases for each state like this:
    # ny_data.cases
    # ca_data.cases

ny_ca = pd.DataFrame([[ny_data.cases, ny_pop],
                      [ca_data.cases, ca_pop]], columns=["cases", "population"], index=["NY", "CA"])
display(ny_ca)

In [None]:
# compute and print proportions of cases to population



<p>
<details><summary>Click to show answer</summary><br>

```
0.004312269614802144
0.000248429454348848
```

</details>
</p>

In [None]:
## chi-square test



***What conclusion can we make?***

<p>
<details><summary>Click to show answer</summary><br>

```
chi2: 135065.92463718733
p: 0.0
dof: 1
expected: [[3.09985200e+04 1.95064515e+07]
 [6.27064800e+04 3.94593325e+07]]
```

</details>
</p>