## Part 1: Hypothesis Tests for Comparing Two Means

For this lab, we will be using [COVID-19 data](https://github.com/nytimes/covid-19-data) collected by the *New York Times*.

These data include daily counts of COVID-19 cases and deaths in all states and counties of the US (including US territories).

***

### Load packages and dataset

In [None]:
# import packages

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

from scipy.stats import ttest_ind

In [None]:
# load dataset with NEW counts recorded each day

us_state_perday = pd.read_csv("datasets/state_data_perday.csv")

display(us_state_perday.head())

***
## Task 1: Comparing mean case diagnoses and deaths in New York between first half and second half of March

For this section we will use data on the number of new COVID-19 cases and deaths reported in New York over the most recent four-week period in the dataset.

**Run the first two Task 1 cells in which we import this dataset, and then split this dataset into separate dataframes for each of the two week segments.**

In [None]:
# obtain entries for NY

new_york = us_state_perday[us_state_perday.state == "New York"]

# obtain entries for the past 4 weeks
# last 4 weeks = last 28 days = last 28 entries

new_york_4weeks = new_york.tail(28)

print(new_york_4weeks.shape)
display(new_york_4weeks)

In [None]:
# split into two dataframes each corresponding to 14 days

ny_half1 = new_york_4weeks.iloc[0:14, :]
ny_half2 = new_york_4weeks.iloc[14:, :]

print(ny_half1.shape, ny_half2.shape)
display(ny_half1, ny_half2)

### Task 1a: Difference between mean number of *cases* in 1st half of March and mean number of *cases* in 2nd half of March (NY)

Now, we will compare the means of the number of cases reported each day between the 1st half of March and the 2nd half of March.

**Run the cell below** in which we pull out the case values in our dataframes that correspond to each 14-day period. We then compute the mean of the daily reported cases and print these mean values out.

In [None]:
# pull out case values for each week

ny_half1_case_vals = ny_half1["cases"]
ny_half2_case_vals = ny_half2["cases"]

# compute mean daily cases for each week

mean_ny_half1 = np.mean(ny_half1_case_vals)
mean_ny_half2 = np.mean(ny_half2_case_vals)

print(mean_ny_half1)
print(mean_ny_half2)

In the cell below we will create a *box plot* to observe the distributions of the daily reported case counts in New York. These daily case counts are split between half 1 (Mar5-Mar18) and half 2 (Mar19-Apr1) so that we can observe differences in the distribution between these two time periods.

**Run the cell below.**

In [None]:
# create a box plot to observe distributions of daily case counts
# for 1st half and 2nd half of March respectively

fig, axs = plt.subplots(figsize=(12,7))
axs.boxplot([ny_half1_case_vals, ny_half2_case_vals])
plt.title("Number of COVID cases reported per day for last 4 weeks (NY)", fontsize=20)
axs.set_xticklabels(["Mar5-Mar18", "Mar19-Apr1"])
axs.set_ylabel("Number of cases reported per day", fontsize=18)
axs.tick_params(labelsize=15)
plt.show()

It appears that there is a difference in the distributions of daily reported cases between these two time periods. Let's perform a hypothesis test to determine whether or not the distributions of daily reported cases is significantly different between the two time periods. For this hypothesis test, our hypotheses are as follows:
> $H_0$: there *is no difference* in the mean number of daily reported cases between the time periods ($\mu_1=\mu_2$).<br>
> $H_A$: there *is a difference* in the mean number of daily reported cases between the time periods ($\mu_1\ne\mu_2$).

We will use a two-sample T-test which assumes our two samples have unequal variances (Welch's T-test). To perform this hypothesis test in Python we use the code:
>```python
> t_val, p_val = ttest_ind(sample1_vals, sample2_vals, equal_var=False)
>```

[Read more about the `ttest_ind` function here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html).

**Run the cell below to perform this T-test.**

In [None]:
## TWO-SAMPLE T-TEST (WELCH'S T-TEST) ##

# calculate test statistic and p-value

t_val, p_val = ttest_ind(ny_half1_case_vals, ny_half2_case_vals, equal_var=False)

print("Test statistic:", t_val)
print("p-value:", p_val)

$\therefore$ with a significance level of $\alpha = 0.05$ do we ***reject*** or ***accept*** the null hypothesis (i.e. is our p-value less than or greater than our signficance level)?

<p>
<details><summary>Click to show answer</summary><br>

```
Test statistic: -9.714027593945213
p-value: 1.9538601309935193e-07
```

`1.9538601309935193e-07` is less than 0.05, so we ***reject*** the null hypothesis. There is statistically significant evidence that the mean number of daily reported cases in the 1st half of March is *different from* the mean number of daily reported cases in the 2nd half of March in New York. Specifically, because the mean number of cases for the 2nd half is greater, we can conclude that there is statistically significant evidence to suggest that the number of cases rose sharply over the past two weeks.

</details>
</p>

### Task 1b: Difference between mean number of *deaths* in 1st half of March and mean number of *deaths* in 2nd half of March (NY)

Now, we will compare the means of the number of *deaths* reported each day between the 1st half of March and the 2nd half of March.

**Complete the code in the cell below** in order to pull out the death values in our dataframes that correspond to each 14-day period. Then compute the mean of the daily reported deaths in each time period and print these mean values out.

In [None]:
# pull out death values for each week

ny_half1_death_vals = 
ny_half2_death_vals = 

# compute mean daily deaths for each week

mean_ny_half1 = 
mean_ny_half2 = 

print(mean_ny_half1)
print(mean_ny_half2)

<p>
<details><summary>Click to show answer</summary><br>

```
1.9285714285714286
136.71428571428572
```

</details>
</p>

<p>
<details><summary>Click to show solution</summary><br>

```python
# pull out death values for each week

ny_half1_death_vals = ny_half1["deaths"]
ny_half2_death_vals = ny_half2["deaths"]

# compute mean daily deaths for each week

mean_ny_half1 = np.mean(ny_half1_death_vals)
mean_ny_half2 = np.mean(ny_half2_death_vals)

print(mean_ny_half1)
print(mean_ny_half2)
```

</details>
</p>

**In the cell below complete the code to create a *box plot* to observe the distributions of the daily reported deaths in New York.**

In [None]:
# create a box plot to observe distributions of daily death counts
# for 1st half and 2nd half of March respectively

fig, axs = plt.subplots(figsize=(12,7))

axs.boxplot([sample1_vals_here, sample2_vals_here]) ##fill in correct variables

plt.title("Number of COVID deaths reported per day for last 4 weeks (NY)", fontsize=20)
axs.set_xticklabels(["Mar5-Mar18", "Mar19-Apr1"])
axs.set_ylabel("Number of deaths reported per day", fontsize=18)
axs.tick_params(labelsize=15)
plt.show()

<p>
<details><summary>Click to show solution</summary><br>

```python
fig, axs = plt.subplots(figsize=(12,7))

axs.boxplot([ny_half1_death_vals, ny_half2_death_vals])

plt.title("Number of COVID deaths reported per day for last 4 weeks (NY)", fontsize=20)
axs.set_xticklabels(["Mar5-Mar18", "Mar19-Apr1"])
axs.set_ylabel("Number of deaths reported per day", fontsize=18)
axs.tick_params(labelsize=15)
plt.show()
```

</details>
</p>

**Finally, complete the code in the cell below to perform a T-test.**

In [None]:
# calculate test statistic and p-value

##fill in correct variables
t_val, p_val = ttest_ind(sample1_vals_here, sample2_vals_here, equal_var=False)

print("Test statistic:", t_val)
print("p-value:", p_val)

$\therefore$ with a significance level of $\alpha = 0.05$ do we ***reject*** or ***accept*** the null hypothesis (i.e. is our p-value less than or greater than our signficance level)?

<p>
<details><summary>Click to show answer</summary><br>

```
Test statistic: -4.0600322328917064
p-value: 0.00134719357246435
```

`0.00134719357246435` is less than 0.05, so we ***reject*** the null hypothesis. There is statistically significant evidence that the mean number of daily reported deaths in the 1st half of March is *different from* the mean number of daily reported deaths in the 2nd half of March in New York.

</details>
</p>


<p>
<details><summary>Click to show solution</summary><br>

```python
t_val, p_val = ttest_ind(ny_half1_death_vals, ny_half2_death_vals, equal_var=False)
```

</details>
</p>

## Task 2: Comparing mean deaths in Nebraska between first half and second half of March

Practice our hypothesis tests carried out above for a new state, Nebraska.

**Complete the code in the cells below.**

In [None]:
# obtain entries for NE

nebraska = 

# obtain entries for the past 4 weeks
# last 4 weeks = last 28 days = last 28 entries

nebraska_4weeks = 

print(nebraska_4weeks.shape)
display(nebraska_4weeks)

In [None]:
# split into two dataframes each corresponding to 14 days

ne_half1 = 
ne_half2 = 

print(ne_half1.shape, ne_half2.shape)
display(ne_half1, ne_half2)

In [None]:
# pull out death values for each week

ne_half1_death_vals = 
ne_half2_death_vals = 

# compute mean daily deaths for each week

mean_ne_half1 = 
mean_ne_half2 = 

print(mean_ne_half1)
print(mean_ne_half2)

In [None]:
# create a box plot to observe distributions of daily death counts
# for 1st half and 2nd half of March respectively

fig, axs = plt.subplots(figsize=(12,7))
axs.boxplot([sample1_vals_here, sample2_vals_here])
plt.title("Number of COVID deaths reported per day for last 4 weeks (NE)", fontsize=20)
axs.set_xticklabels(["Mar5-Mar18", "Mar19-Apr1"])
axs.set_ylabel("Number of deaths reported per day", fontsize=18)
axs.tick_params(labelsize=15)
plt.show()

In [None]:
# calculate test statistic and p-value for hypothesis test of difference in means

t_val, p_val = ttest_ind(sample1_vals_here, sample2_vals_here, equal_var=False)

print("Test statistic:", t_val)
print("p-value:", p_val)

<p>
<details><summary>Click to show answer</summary><br>

```
Test statistic: -2.7993260539543776
p-value: 0.014678471312875647
```

`0.014678471312875647` is less than 0.05, so we ***reject*** the null hypothesis. There is statistically significant evidence that the mean number of daily reported deaths in the 1st half of March is *different from* the mean number of daily reported deaths in the 2nd half of March in Nebraska.

</details>
</p>