# Covid 19 Correlation Analysis Project

# 0. Objectives and motivation

The objective of this project is to identify which policies had the highest impact on the covid cases / deaths in the first year of the pandemic (Feburary - December of 2020). This notebook is meant to serve as a high-level summary of the methodologies and results of this project. 

# 1. Data Sources


This project uses 2 datasets:


Notebooks 01 and 02 provide a detailed discussion of the data cleaning pipeline. Notebook 03 demos a few visualizations (this was originally a visualization project before the regression / correlation analysis took off). 

# 2. Average increase in cases / deaths after policy


The first attempt at this analysis investigated the differences in the number of cases / deaths on the day a policy was implemented and 14 days after. Additionally, I looked at the "acceleration" in cases /deaths - that is, the differences in velocity of cases / deaths around when the policy was implemented and 14 days later (using a 1-day delta). 

Almost none of the results were signficant, and those that were were likely due to spurious correlations (see notebook 04). 

# 3. Regression analysis

The next step was to use a multilinear regression model to model the number of cases / deaths, where the input variables represent how long ago a policy was implemented.

$$
c_{new} = \omega_1 p_{1, 0-5} + \omega_2 p_{1, 6-10} + \omega_3 p_{1, 11-999} + \omega_4 p_{2, 0-5} + ...
$$

where $\omega_n$ represents coefficients / weights and $p_{i, \text{start}-\text{end}}$ is 1 if policy $i$ was implemented within the last *start* to *end* days. For example, let $p_1$ represent the start of a policy related to gyms and $p_2$ represent the end of a policy related to gyms. Let the the set of bins in this analysis be [(0-3), (4-10), (11-999)]. If, in a given county, gyms were closed 30 days ago and reopened 2 days ago, then: $p_{1, 0-3} = p_{1, 4-10} = 0$,
$p_{1, 11-999} = 1$, $p_{2, 0-3} = 1$, and $p_{2, 4-10} = p_{2, 11-999} = 0$.

Additionally, policies are divided into whether it was the start or end of a policy and whether it was implemented at the state or county level.

In an earlier version of this project, I attempted to run the regression analysis using all the available policies, resulting in an analysis with *number of policies* x *number of bins* features, which resulted in some serious bottlenecks when the number of bins started growing (i.e. generating dataframes with 500+ columns). In this version, I am doing runs with only a single policy, and will experiment with 2 or 3 - policy runs in later versions.

# 4. Results

All of the bin+policy combinations resulted in $R^2$ values < ~0.1, the results for each independent variable are summarized below:


| dependent variable         | approx highest $R^2$ |
| ------------------         | -------------------- |
| new cases                  | 0.35                 |
| new cases (7 day average)  | 0.1                  |
| new deaths                 | 0.05                 |
| new deaths (7 day average) | 0.025                |

Generally speaking, there was statistical significance for almost every single variable (using a 99% CI), with many p-values being effectively zero. An analysis of these p-values is in progress.

# 5. Next steps / Ideas

- finish analyzing the p-values. Use the significant coefficients to analyze which policies had the greatest effect.
    - change the p-value analysis to account for the fact that many hypothesis tests have been conducted (e.g. Turkey's HSD)
- re-run the analysis without a bias term
- take the most significant policies and experiment with different bin sets. Optimize for 2 bins ([0-2, 3-999], [0-3, 4-999], etc.) and do the same with 3, 4, etc. bins
- introduce multiple policies into the analysis
- explore other models: there may be a way to treat the data for each policy as a token into a transformer or LSTM. We can then take the best performing model and feed it isolated policies to study the predicted effects of each policy in isolate to measure their effects. 