# Run regression analysis

Outside of this notebook, I ran a script to generate about 200 datasets for a regression analysis, which took about 2 hours and takes up about 30gb of space. 

The regression models here use a "bin" model - where independent variables are one-hot encoded based on how long it was since a given policy was implemented. The choice of bins is left as a hyperparameter optimization task. 

| info<br>state |  <br>county  | <br>date | <br>num_new_cases | policy name<br>0-2| <br>3-5 | <br>6-999|
| ------------- | ------------ | -------  | ----------------- | ------ | ------ | ------ |
| state   | county  | date - 1 | # of new cases | 0 | 0 | 0 |
| state   | county  | policy enacted today | # of new cases | 1 | 0 | 0 |
| state   | county  | date + 1 | # of new cases | 1 | 0 | 0 |
| state   | county  | date + 2 | # of new cases | 1 | 0 | 0 |
| state   | county  | date + 3 | # of new cases | 0 | 1 | 0 |
| state   | county  | date + 4 | # of new cases | 0 | 1 | 0 |
| state   | county  | date + 5 | # of new cases | 0 | 1 | 0 |
| state   | county  | date + 6 | # of new cases | 0 | 0 | 1 |
| state   | county  | date + 7 | # of new cases | 0 | 0 | 1 |
| state   | county  | date + 8 | # of new cases | 0 | 0 | 1 |
|    |   | ... |  |  |  | |
| state   | county  | today | # of new cases | 0 | 0 | 1 |

Trial runs in the notebook used to develop the models indicates that most of these models will result in very low R-squared values (<<0.1) and very low p-values (<<0.01). The low R-squared values are unsurprising since we are using so few features to model such a complicated phenomenon. The low p-values, however, tell a much more interesting story - that these regression models are statistically signficiant and we should look into them further. 

In [35]:
### run this command to generate the datasets used (if you're on linux of mac). Run the command in the terminal
### if on windows

#!python scripts/generate_dataset_for_regression.py

# 0. Imports

In [2]:
from covid_project.regression_funcs import fit_ols_model_single_policy
from covid_project.data_utils import get_all_policies, get_processed_data
from covid_project.policy_mappings import policy_dict_v1
from tqdm.notebook import tqdm
import os
import json

# 1. Run models

In [3]:
all_bins = [
        [(0, 14), (15, 999)],
        [(0, 14), (15, 28), (29, 999)],
        [(0, 7), (8, 14), (15, 999)],
        [(0, 7), (8, 14), (15, 28), (29, 60), (61, 999)],
    ]

all_policies = get_all_policies(policy_dict = policy_dict_v1,
                                min_samples = 3)

dep_vars = [
    'new_cases_1e6',
    'new_deaths_1e6',
    'new_cases_7day_1e6',
    'new_deaths_7day_1e6',
]

In [4]:
def run_model_on_policies(bins,
                          all_policies,
                          dep_var,
                          pbar=True):
    """Loop to run the regression model on all policies"""
    
    results = dict()
    for policy in tqdm(all_policies, desc='running models'):
        suc, data = get_processed_data(policy, bins)
        if not suc:
            print(f"[ERROR] data read failed: bins={bins}, policy={policy}, var={dep_var}")
            continue
        res = fit_ols_model_single_policy(data,
                                          policy,
                                          dep_var,
                                          True)
        results[policy] = res
    return results

In [5]:
def run_batch_of_models(all_bins,
                        all_policies,
                        dep_vars,
                        save_path="./data/regression_results/"):
    if not os.path.exists(save_path):
        os.makedirs(save_path)
        
    for bins_list in tqdm(all_bins, desc="looping through bins"):
        for var in tqdm(dep_vars, desc="looping through dependent variables"):
            results = run_model_on_policies(bins=bins_list,
                                            all_policies=all_policies,
                                            dep_var=var,
                                            pbar=True)
            filename = var + "_bins=" + ''.join([str(b[0])+"-"+str(b[1])+"_" for b in bins_list])[:-1] + ".json"
            full_path = save_path + filename
            
            with open(full_path, "w") as f:
                json.dump(results, f, indent=2)

In [6]:
run_batch_of_models(all_bins=all_bins,
                    all_policies=all_policies,
                    dep_vars=dep_vars,)

looping through bins:   0%|          | 0/4 [00:00<?, ?it/s]

looping through dependent variables:   0%|          | 0/4 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

looping through dependent variables:   0%|          | 0/4 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

looping through dependent variables:   0%|          | 0/4 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

looping through dependent variables:   0%|          | 0/4 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]

running models:   0%|          | 0/50 [00:00<?, ?it/s]