# Run regression analysis

Last edit: 2023-11-25


Run linear regression using a 'bin' model. Independent variables are one-hot encoded based on how long it was since a given policy was implemented. The choice of bins is left as a hyperparameter optimization task. The transformed dataset has this structure: 

| info<br>state |  <br>county  | <br>date | <br>num_new_cases | policy name<br>0-2| <br>3-5 | <br>6-999|
| ------------- | ------------ | -------  | ----------------- | ------ | ------ | ------ |
| state   | county  | date - 1 | # of new cases | 0 | 0 | 0 |
| state   | county  | policy enacted today | # of new cases | 1 | 0 | 0 |
| state   | county  | date + 1 | # of new cases | 1 | 0 | 0 |
| state   | county  | date + 2 | # of new cases | 1 | 0 | 0 |
| state   | county  | date + 3 | # of new cases | 0 | 1 | 0 |
| state   | county  | date + 4 | # of new cases | 0 | 1 | 0 |
| state   | county  | date + 5 | # of new cases | 0 | 1 | 0 |
| state   | county  | date + 6 | # of new cases | 0 | 0 | 1 |
| state   | county  | date + 7 | # of new cases | 0 | 0 | 1 |
| state   | county  | date + 8 | # of new cases | 0 | 0 | 1 |
|    |   | ... |  |  |  | |
| state   | county  | today | # of new cases | 0 | 0 | 1 |


In previous runs, the dataset consisted of all policies, resulting in a feature matrix with $\text{num policies} \times \text{num bins}$ features, which got very computationally expensive.

In this iteration, we're going to run linear regression using sklearn on 3 different kinds of bin "sets" with different numbers of bins and different sizes. 

In [5]:
from covid_project.policy_mappings import policy_dict_v2
from covid_project.data_utils import get_processed_data, get_all_policies
from tqdm.notebook import tqdm

import numpy as np