<a href="https://colab.research.google.com/github/cbsebastian24/Trex/blob/main/Another_copy_of_plenary_3_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Confidence Intervals (cont.) with COVID-19 Hospitalization Data

The COVID-19 Hospitalization Surveillance Network (COVID-NET) is a network that conducts active, population-based surveillance for laboratory-confirmed COVID-19-associated hospitalizations among children and adults.

COVID-NET is the CDC’s source for important data on rates of hospitalizations associated with COVID-19. **The monthly hospitalization rates represent the number of COVID-19 hospitalizations (not percent) per 100,000 population in the surveillance area.**

We will continue using this data to conduct hypothesis tests and construct confidence intervals.


## Load the data

In [2]:
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np

In [3]:
file_path = "https://raw.githubusercontent.com/UM-Data-Science-101/data-FA2025/refs/heads/main/Monthly_Rates_of_Laboratory-Confirmed_COVID-19_Hospitalizations_from_the_COVID-NET_Surveillance_System_Cleaned.csv" # REPLACE
df = pd.read_csv(file_path)

In [4]:
df.shape

(41787, 9)

In [5]:
df.columns

Index(['State', 'Season', 'AgeCategory_Legend', 'Sex_Label', 'Race_Label',
       'MonthlyRate', 'Type', 'Year', 'Month'],
      dtype='object')

In [6]:
df.head()

Unnamed: 0,State,Season,AgeCategory_Legend,Sex_Label,Race_Label,MonthlyRate,Type,Year,Month
0,Utah,2021-22,6mo-<12 months,All,All,39.7,Crude Rate,2021,11
1,Utah,2021-22,6mo-<12 months,All,All,0.0,Crude Rate,2021,10
2,Utah,2021-22,6mo-<12 months,All,All,53.0,Crude Rate,2022,2
3,Utah,2021-22,6mo-<12 months,All,All,74.0,Crude Rate,2022,6
4,Utah,2021-22,6mo-<12 months,All,All,29.6,Crude Rate,2022,8


## Confidence Intervals

### Question: Is the average monthly COVID-19 hospitalization rate for White, non-Hispanic Californians during the 2022-23 season 20 per 100,000? Use $\alpha = 0.0015$.

Step 1: State the null and alternative hypotheses.

The null hypothesis is the average monthly COVID-19 hospitalization rate for White Californians during the 2022-23 season is 20 per 100,000 The alternative hypothesis is the average monthly COVID-19 hospitalization rate for White Californians during the 2022-23 season is not 20 per 100,000

<details>

$$
H_0: \text{The average monthly COVID-19 hospitalization rate for White Californians during the 2022-23 season is 20 per 100,000} \\
H_1: \text{The average monthly COVID-19 hospitalization rate for White Californians during the 2022-23 season is not 20 per 100,000}
$$
</details>

Step 2: Create a subset of df (Hint: You only want the entries where the State is California, the Season is 2022-23, the Age Category is All, the Sex Label is All, the Race Label is White, non-Hispanic, and the Type of rate is Crude Rate).

In [7]:
subset = df[
    (df['State'] == 'California') &
    (df['Season'] == '2022-23') &
    (df['AgeCategory_Legend'] == 'All') &
    (df['Sex_Label'] == 'All') &
    (df['Race_Label'] == 'White, non-Hispanic') &
    (df['Type'] == 'Crude Rate')
]

<details>

````
subset = df[
    (df['State'] == 'California') &
    (df['Season'] == '2022-23') &
    (df['AgeCategory_Legend'] == 'All') &
    (df['Sex_Label'] == 'All') &
    (df['Race_Label'] == 'White, non-Hispanic') &
    (df['Type'] == 'Crude Rate')
]
````
</details>

Step 3: Calculate the SEM.

In [8]:
Caliornian_rate = subset['MonthlyRate']
sem = Caliornian_rate.std() / np.sqrt(len(Caliornian_rate))

<details>

````
californian_rates = subset['MonthlyRate']
sem = californian_rates.std() / np.sqrt(len(californian_rates))
````
</details>

Step 4: Calculate the confidence interval.

In [9]:
middle = Caliornian_rate.mean()
(middle - 3*sem, middle + 3*sem)

(np.float64(19.009951103279338), np.float64(36.07338223005398))

<details>

````
xbar = californian_rates.mean()
(xbar - 3*sem, xbar + 3*sem)
````

</details>

Step 5: What does this confidence interval tell us? Does the hypothesized rate of 20 fall inside or outside of the interval? What conclusion can you draw?

The confidence interval tells us that we are 99.7% conident that the true mean hospitlization rate is between 19 and 36.1 per 100,000. The hypothesized rate falls inside the conidence interval meaning we cannot say that the true mean rate is signiicantly different from 20.

<details>

This confidence interval means that we are 99.7% confident that the true mean hospitalization rate lies between 19.01 and 36.07 per 100,000. The hypothesized rate of 20 falls inside of the confidence interval. Therefore, we cannot say that the true mean rate is significantly different from 20.

</details>

### Question: Is the average monthly COVID-19 hospitalization rate for male Michiganders between 2021 and 2023 greater than 30 per 100,000? Use $\alpha = 0.05$.

Conduct a hypothesis test on your own using confidence intervals.

In [11]:
subset = df[
    (df['State'] == 'Michigan') &
    (df['Year'].isin([2021, 2022, 2023])) &
    (df['AgeCategory_Legend'] == 'All') &
    (df['Sex_Label'] == 'Male') &
    (df['Race_Label'] == 'All') &
    (df['Type'] == 'Crude Rate')
]
michigan_rates = subset['MonthlyRate']
sem = michigan_rates.std() / np.sqrt(len(michigan_rates))

import math

middle = michigan_rates.mean()
(middle - 2*sem, math.inf)

(np.float64(12.616601664264316), np.float64(33.983398335735686))

We can be 95% confident that the true average monthly COVID-19 hospitalization rate for male Michiganders between 2021 and 2023 was at least 27.40 per 100,000, and the evidence does not support the claim that it exceeded 30. Therefore, we would fail to reject the null hypothesis.

<details>

$$H_0: \text{The average monthly rate for male Michiganders between 2021 and 2023 is less than or equal to 30 per 100,000} \\
H_1: \text{The average monthly rate for male Michiganders between 2021 and 2023 is greater than 30 per 100,000}$$

````
subset = df[
    (df['State'] == 'Michigan') &
    (df['Year'].isin([2021, 2022, 2023])) &
    (df['AgeCategory_Legend'] == 'All') &
    (df['Sex_Label'] == 'Male') &
    (df['Race_Label'] == 'All') &
    (df['Type'] == 'Crude Rate')
]

michigan_rates = subset['MonthlyRate']
sem = michigan_rates.std() / np.sqrt(len(michigan_rates))

import math

xbar = michigan_rates.mean()
(xbar - 2*sem, math.inf)
````
We can be 95% confident that the true average monthly COVID-19 hospitalization rate for male Michiganders between 2021 and 2023 was at least 27.40 per 100,000, and the evidence does not support the claim that it exceeded 30. Therefore, we would fail to reject the null hypothesis.

</details>