# Covid 19 Regression Analysis Project

# 0. Objectives and motivation

The objective of this project is to identify which policies had the highest impact on the covid cases / deaths in the first year of the pandemic (Feburary - December of 2020). This notebook is meant to serve as a high-level summary of the methodologies and results of this project. 

# 1. Data Sources


This project uses 2 datasets:


Notebooks 01 and 02 provide a detailed discussion of the data cleaning pipeline. Notebook 03 demos a few visualizations (this was originally a visualization project before the regression / correlation analysis took off). 

# 2. Correlations

The first attempt at a correlation analysis investigated the difference in the number of cases on the day a policy was implemented and 14 days later: $(\text{cases at day of implementation} - \text{cases after day of implementation}) / \Delta t$, as well as the average acceleration in cases / deaths (e.g. $(\text{cases at day of implementation} - \text{cases after day of implementation}) / \Delta t ^ 2$. 

There weren't any super interesting results, so I decided to try out some more sophisticated models

# 3. Regression analysis

The next step was to use a multilinear regression model to model the number of cases / deaths, where the input variables represent how long ago a policy was implemented.

$$
c_{new} = \omega_1 p_{1, 0-5} + \omega_2 p_{1, 6-10} + \omega_3 p_{1, 11-999} + \omega_4 p_{2, 0-5} + ...
$$

where $\omega_n$ represents coefficients / weights and $p_{i, \text{start}-\text{end}}$ is 1 if policy $i$ was implemented within the last *start* to *end* days. The set of start-end days is commonly referred to 'bins_list' throughout the project. For example, let $p_1$ represent the start of a policy related to gyms and $p_2$ represent the end of a policy related to gyms. Let the bin_list in this analysis be [(0-3), (4-10), (11-999)]. If, in a given county, gyms were closed 30 days ago and reopened 2 days ago, then: $p_{1, 0-3} = p_{1, 4-10} = 0$,
$p_{1, 11-999} = 1$, $p_{2, 0-3} = 1$, and $p_{2, 4-10} = p_{2, 11-999} = 0$.

Additionally, policies are divided into whether it was the start or end of a policy and whether it was implemented at the state or county level.

In an earlier version of this project, I attempted to run the regression analysis using all the available policies, resulting in an analysis with *number of policies* x *number of bins* features, which resulted in some serious computational bottlenecks. In this version, I am doing runs with only a single policy, and will experiment with 2 or 3 - policy runs in later versions.

# 4. Results

see notebook 07 for a more comprehensive discussion

- TODO: include visuals
- TODO: include a brief summary of the key findings

# 5. Next steps

- take the most policies most highly correlated with changes in cases / deaths and experiment with different bin sets. Optimize for 2 bins ([0-2, 3-999], [0-3, 4-999], etc.) and do the same with 3, 4, etc. bins
- introduce multiple policies into the analysis
- explore other models: there may be a way to treat the data for each policy as a token into a transformer or LSTM