## old discussion
in future: Make some modifications to existing regression models. Instead of a bin model, set up a regressor variable that represents a ramp up / down when that policy starts / stops

First: think of new ways to visualize existing results.

# Linear Regression Version 1

In this notebook, I build out some linear regression models to model policy impacts. This is using what I call a "bin" model - where independent variables are one-hot encoded based on how long it was since the given policy was implemented. The choice of bins is left as a hyperparameter optimization task. 

Here is the schema for the datasets:

| info<br>state |  <br>county  | <br>date | <br>num_new_cases | policy name<br>0-7| <br>8-14 | <br>15-999|
| ------------- | ------------ | -------  | ----------------- | ------ | ------ | ------ |
| state   | county  | date - 1 | # of new cases | 0 | 0 | 0 |
| state   | county  | policy enacted today | # of new cases | 1 | 0 | 0 |
| state   | county  | date + 1 | # of new cases | 1 | 0 | 0 |
| state   | county  | date + 2 | # of new cases | 1 | 0 | 0 |
| state   | county  | date + 3 | # of new cases | 0 | 1 | 0 |
| state   | county  | date + 4 | # of new cases | 0 | 1 | 0 |
| state   | county  | date + 5 | # of new cases | 0 | 1 | 0 |
| state   | county  | date + 6 | # of new cases | 0 | 0 | 1 |
| state   | county  | date + 7 | # of new cases | 0 | 0 | 1 |
| state   | county  | date + 8 | # of new cases | 0 | 0 | 1 |
|    |   | ... |  |  |  | |
| state   | county  | today | # of new cases | 0 | 0 | 1 |



Run this script to generate the datasets. Should take about 4 hours with the current selection of bins.

In [1]:
# !python ./scripts/linreg_single_policy.py --run_what generate_dataset

Run this script to run linear regression - should take about an hour

In [2]:
# !python ./scripts/linreg_single_policy.py --run_what run_models

In [3]:
from covid_project.regression_funcs_bins import collect_all_regression_results_to_df, plot_rsquared_heatmap, BINS
from IPython.display import display
import matplotlib.pyplot as plt
from tqmd.auto import tqdm

path = "./data/regression_results_single_policy_bins/"


df = collect_all_regression_results_to_df(path)

ModuleNotFoundError: No module named 'tmux'

# 1. Check significance

In [None]:

def pivot_df_to_pvalues(data, dep_var, bins='[(0, 14), (15, 999)]'):
    def _color_sig_values(val, p=0.05):
        color = 'green' if val < p else ''
        return 'color: ' + color

    data = data[(data['dep_var']==dep_var)]
    data = data[['policy', 'bins_list', 'bin', 'p_value']]
    data = data.drop_duplicates()
    data = data.set_index('policy')
    data = data[data['bins_list']==bins]
    data = data.pivot(columns='bin')['p_value']
    data = data.sort_index()
    data = data.style.map(_color_sig_values)
    return data

## 1.1 New cases

In [None]:
for b in BINS:
    d = pivot_df_to_pvalues(df, 'new_cases_1e6', str(b))
    display(d)

## 1.2 New cases (7-day average)

In [None]:
for b in BINS:
    d = pivot_df_to_pvalues(df, 'new_cases_7day_1e6', str(b))
    display(d)

## 1.3 New deaths

In [None]:
for b in BINS:
    d = pivot_df_to_pvalues(df, 'new_deaths_1e6', str(b))
    display(d)

## 1.4 New deaths (7-day average)

In [None]:
for b in BINS:
    d = pivot_df_to_pvalues(df, 'new_deaths_7day_1e6', str(b))
    display(d)

# 2. Check R-squared

In [None]:

dep_vars = [
    'new_cases_1e6',
    'new_cases_7day_1e6',
    'new_deaths_1e6',
    'new_deaths_7day_1e6'
]

for var in dep_vars:
    _, bins_ids = plot_rsquared_heatmap(
        data = df,
        dep_var = var,
        sort_values = True,
        ax = None
    )

In [None]:
bins_ids

# 3. Model Diagnostics

In [None]:
from covid_project.regression_funcs_bins import get_single_policy_regression_data, fit_ols_model_single_policy
bins = [(0, 7), (8, 14), (15, 28), (29, 60), (61, 999)]
dep_var = 'new_cases_1e6'


In [None]:


def plot_pred_vs_residuals(policy, dep_var, bins, ax):
    succ, data = get_single_policy_regression_data(policy, bins)
    if not succ:
        return ax
    results = fit_ols_model_single_policy(
        data,
        policy,
        dep_var,
        True,
        True)
    ax.scatter(results['predictions'], results['residuals'])
    ax.set_title(policy)
    ax.set_ylabel("residuals")
    ax.set_xlabel("predictions")
    return ax

## 3.1 Linear relationship

In [None]:
policies = df[df['bins_list']==str(bins)]['policy'].unique()

fig, axes = plt.subplots(nrows = len(policies)//4, ncols=4, figsize=[16, 3*(len(policies)//4)])

for i, ax in enumerate(tqdm(axes.flatten())):
    if i > len(policies):
        continue
    policy = policies[i]
    plot_pred_vs_residuals(policy, 'new_cases_1e6', bins, ax)

fig.tight_layout()

## 3.2 Independence

## 3.3 Multicollinearity

## 3.4 Heteroskedasticity

# 4. Analyze predictions