# Team Members
- [David Cherney](https://www.linkedin.com/in/dmcherney/)
- [Shania Thomas](https://www.linkedin.com/in/shania-thomas-atx22/)
- [Thomas Prich](https://www.linkedin.com/in/thomas-prich/)
- [Afolabi Cardoso](https://www.linkedin.com/in/afolabi-cardoso/)

---
## Content

[Problem Statement](#Problem-Statement) | [Data](#DATA) | [Methodology](#Methodology) | [Conclusions and Recommendations](#Conclusions-and-Recommendations)

---

##  Problem Statement

Do dynamics specific models or generic times series models provide the best next day predictions of COVID case load for a state population?

### Background
<a href="https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average">ARMIMA models</a> are a general tool for time series prediction that take into account past trends in a variety of ways. <a href="https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#The_SIR_model_without_vital_dynamics">SIR models</a> by contrast, aim to model the dyanamics of spead of disease in a finite population. By comparing SIR and ARIMA models using national COVID-19 data from the CDC, we want to determine if SIR or ARIMA is a better predictor of new COVID-19 infections.

The Federal Department of Health has put together a team of data scientist to help investigate the Covid-19 outbreak in 2020 and 2021. By comparing the SIR model and ARIMA model with the actual data for each state, we look to determine if the SIR model over predicts the total number of Covid cases.

---
## DATA

[United States COVID-19 Cases and Deaths by State over Time](https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36)


[Population data from census.gov](https://www.census.gov/data/datasets/time-series/demo/popest/2010s-state-total.html#par_textimage_1873399417)

---
## Methodology

 
See <a href="https://github.com/davidcherney/COVID-Infection-Predictions-from-SIR-and-ARIMA/blob/master/Intro_to_SIR_math.pdf">Intro_to_SIR_math.pdf</a> for an introduction to SIR models. 

---
### Data Gathering and Cleaning
UsingUsing the [requests](https://docs.python-requests.org/en/latest/#) python Library,
we collected 
- cumulative number of covid cases in each state on each day in  Jan 23, 2020 to Dec 29, 2021 from [data.cdc.gov](https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36). 
- the cumulative total number of  people with completed vaccination scheduel in each day in each state from 

We collected data from multiple government sources, the bulk being from [data.cdc.gov](https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36). Using the [requests](https://docs.python-requests.org/en/latest/#) python Library, we fetched the total number of reported COVID-19 cases in all states from Jan 23, 2020 to Dec 29, 2021. We also gathered the total number of vaccinations administered from the day the vaccine was made public. This process can be seen in the notebook titled [01 data gathering and cleaning](http://localhost:8890/lab/tree/code/01%20data%20gathering%20and%20cleaning.ipynb).

>>>>>>> 

---

### Data Feature Engineering

- The data from [data.cdc.gov](https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36) had data on total number of people infected to date, a cumulative sum. However, for the SIR model, we needed the number of people infected _daily_. In order to get this data, we used the .diff() method on the column titled tot_cases and assigned the values to a I_actual column. We did a .diff(14) to reflect the number of days a person has an active COVID-19 infection.


- The data from [data.cdc.gov](https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36) only had data on total number of people infected to date. However, for the SIR model, we needed the number of people infected daily. We engineered this feature, we used the pandas .diff(14) method on the total infected column and assigned the values to a I_actual column under the assumption that COVID cases last 14 days.

- Using the to_datetime Pandas method, we converted the submission_date column into date time format and set it as the index

- Using the states, I_actual, and total vaccinated columns, we created a new dataframe with each states currently infected numbers, and number of daily vaccinations administered as their own columns. This data is called shabang (because it made us smile in a needed time) within the data folder.

- Using the states column, I_actual, and total vaccinated columns, we created a new dataframe with each states infected on each day I and vaccinated on each day V as it's own columns. 

From that from those engineered features, I and V, for each state, we ran SIR one day predictions to generate 
- the prediction I_SIR for each day in each state, 
- the prediction of number of people susceptable S_SIR in each state
- the ARIMA predicted number of infections I_arima in each state on each day.

Further, from that data we generated
- the binary feature H-I for each day in each state indicating 1 if S was herd immunity threshold on that day in that state. 

---
### Vizualizations
Click [here](https://public.tableau.com/app/profile/thomas.prich/viz/Magic8BallsGroupProject/Dashboard1) to view our tableau dashboard. 

---
## Conclusions and Recommendations

In our comparison of SIR and ARIMA models, SIR consistently overshoots the actual values. Therefore, we recommend that the Federal Department of Health use SIR more to shpw worst case scenarios, and use ARIMA for more accurate predicitions. The downside is that ARIMA's best predictions are short term. 

A possible reasons why SIR consistently overpredicts is because it does take take into account changes in behavior such as:

- Social distancing
- Remote work/school
- Masking
- Hand washing
- Reduced travel
- Closure of public places 

and other COVID-19 related precautions. These precautions very likely prevented the U.S. from reaching the infection numbers predicted by SIR. SIR also does not take into account how far apart populations are. For example, cities in rural parts of the country are less dense and therefore will have slower spread statewide comparitive to cities and states such as New York.

SIR also does not account for variants in the virus which have different rates of transmission (R0).

Based on this, SIR isn't the best for actual predictions. However, it is appropriate for predicting worst case scenarios. Next steps would include testing SIR models on time frames "medium" time frames such as 1 week, 1 month etc., and further attempting to adjust the SIR parameters to account for changes in behaviors. 