# Model development

This is the notebook where I am currently experimenting with different ML models to predict new covid cases / deaths. 

First, run the script to load and process the data

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, explained_variance_score
import pandas as pd

# import covid_data_script
%run covid_data_script.py

covid data script loaded


In [2]:
# retrieve and load the cleaned data
df, df2 = retrieve_data(load_local=True)

# preprocess df2
df2_preprocessed = prep_policy_data(df2)

  mask |= (ar1 == a)


# Models and Hyperparameters to explore: 

**Machine Learning Models**

- Linear Regression
- Ridge Regression
- Lasso Regression
- ElasticNet
- Stochastic Gradient Descent
- Decision Tree
- Random Forest

**Machine learning models that I'm not familiar with but might want to try**
- XGBoost

**Neural networks**
- Multilayer Perceptron
- Convolutional NN? (update 2/19: CNN wouldn't make sense for this application)

Technically, this is time-series data, but I'm not currently *processing* this as time series data. 

**The Big Issue**: According to best practice, one should do the train-test split before any kind of work with model selection (to prevent overfitting and picking up patterns that are actually random). This problem is that the primary feature engineering step requires the Full dataset -> this probably means I need to change the way I'm implementing the feature engineering. 

**Possible solution**: Instead of using scikit-learn's implementation of the train_test split, implement my own but do the split by a random selection of counties. This means that data from different points in the pandemic, but from the same county, will either all be in the test set or all in the train set. Do the same thing for K-fold cross validation.

**Final note**: Also try RNN-LSTM

In [3]:
# get the processed dataframe with default bins
# df3 = join_policies(case_df=df, policy_df=df2_preprocessed)
# df3.head()

Implement a custom version of the train_test_split

In [4]:
df.head()

Unnamed: 0,uid,location_type,fips_code,county,state,date,full_loc_name,total_population,cumulative_cases,cumulative_cases_1e6,cumulative_deaths,cumulative_deaths_1e6,new_cases,new_deaths,new_cases_1e6,new_deaths_1e6,new_cases_7day,new_deaths_7day,new_cases_7day_1e6,new_deaths_7day_1e6
31440,84001001,county,1001,autauga,Alabama,2020-01-22,"Autauga, Alabama",55200,0,0.0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
31441,84001001,county,1001,autauga,Alabama,2020-01-23,"Autauga, Alabama",55200,0,0.0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
31442,84001001,county,1001,autauga,Alabama,2020-01-24,"Autauga, Alabama",55200,0,0.0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
31443,84001001,county,1001,autauga,Alabama,2020-01-25,"Autauga, Alabama",55200,0,0.0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
31444,84001001,county,1001,autauga,Alabama,2020-01-26,"Autauga, Alabama",55200,0,0.0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [7]:
def county_split(df, test_size, split_type="train_test"):
    """
    split_type = {"train_test", "train_validate"}
    """
    
    #if split_type == "train_test": 
    # Get a list of all unique counties
    all_counties = df['full_loc_name'].unique()

    # shuffle the list
    np.random.shuffle(all_counties)

    # split the data
    counties_test = all_counties[: int(len(all_counties)*test_size)]
    counties_train = all_counties[int(len(all_counties)*test_size) :]

    df_test = df[df['full_loc_name'].isin(counties_test)]
    df_train = df[df['full_loc_name'].isin(counties_train)]
    #else: 
        #all_counties = df[('info', 'full_loc')].unique()
    
    return df_test, df_train

In [8]:
df_test, df_train = county_split(df, test_size=0.2)
df_train

Unnamed: 0,uid,location_type,fips_code,county,state,date,full_loc_name,total_population,cumulative_cases,cumulative_cases_1e6,cumulative_deaths,cumulative_deaths_1e6,new_cases,new_deaths,new_cases_1e6,new_deaths_1e6,new_cases_7day,new_deaths_7day,new_cases_7day_1e6,new_deaths_7day_1e6
31440,84001001,county,1001,autauga,Alabama,2020-01-22,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31441,84001001,county,1001,autauga,Alabama,2020-01-23,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31442,84001001,county,1001,autauga,Alabama,2020-01-24,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31443,84001001,county,1001,autauga,Alabama,2020-01-25,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31444,84001001,county,1001,autauga,Alabama,2020-01-26,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1266241,84056045,county,56045,weston,Wyoming,2021-02-13,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.67,0.0,9.436620,0.0
1266242,84056045,county,56045,weston,Wyoming,2021-02-14,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.33,0.0,4.647887,0.0
1266243,84056045,county,56045,weston,Wyoming,2021-02-15,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
1266244,84056045,county,56045,weston,Wyoming,2021-02-16,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.00,0.0,0.000000,0.0


# Feature engineering on training set only

- All this will go into a function once I know what I'm doing

Now that we have the split, apply the feature engineering

In [9]:
df_train_proc = join_policies(case_df=df_train, policy_df=df2_preprocessed)

data shaped
bins: [(0, 6), (7, 13), (14, 999)]
time elapsed: 84.44930720329285


In [10]:
df_train_proc

Unnamed: 0_level_0,info,info,info,info,info,entertainment - start - county,entertainment - start - county,entertainment - start - county,houses of worship - start - state,houses of worship - start - state,...,outdoor and recreation - stop - county,manufacturing - stop - county,manufacturing - stop - county,manufacturing - stop - county,personal care - stop - state,personal care - stop - state,personal care - stop - state,personal care - start - county,personal care - start - county,personal care - start - county
Unnamed: 0_level_1,state,county,full_loc,date,new_cases_1e6,0-6,7-13,14-999,0-6,7-13,...,14-999,0-6,7-13,14-999,0-6,7-13,14-999,0-6,7-13,14-999
31440,Alabama,autauga,"Autauga, Alabama",2020-01-22,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31441,Alabama,autauga,"Autauga, Alabama",2020-01-23,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31442,Alabama,autauga,"Autauga, Alabama",2020-01-24,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31443,Alabama,autauga,"Autauga, Alabama",2020-01-25,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31444,Alabama,autauga,"Autauga, Alabama",2020-01-26,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1266241,Wyoming,weston,"Weston, Wyoming",2021-02-13,0.00,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1266242,Wyoming,weston,"Weston, Wyoming",2021-02-14,0.00,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1266243,Wyoming,weston,"Weston, Wyoming",2021-02-15,0.00,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1266244,Wyoming,weston,"Weston, Wyoming",2021-02-16,0.00,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


Implement linear regression

In [11]:
X = df_train_proc.loc[:, df_train_proc.columns[5:]].values
y = df_train_proc.loc[:, ('info', 'new_cases_1e6')].values

In [12]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [13]:
y

array([ 0.  ,  0.  ,  0.  , ...,  0.  ,  0.  , 14.08])

In [14]:
model = LinearRegression()
model.fit(X, y)

LinearRegression()

In [15]:
model.score(X, y)

0.0871585475712563

Now implement K-fold cross validation (shuffling counties, not individual datapoints)

**Q:** What metrics should we use to evaluate performance? This sounds like a good excuse to go through sklearn's metrics library. Stick with R squared for now

In [16]:
def train_model(df_train_proc, model_in, K=10, verbose=True):
    """Function to train models using K-fold cross validation
    Parameters
    -----------
    df_train_proc: DataFrame
        processed dataframe (after selecting bins and joining with cases)
    model_in: call to an sklearn object
        call to the models constructor method
    K: integer
        number of cross-validation folds
    verbose: Boolean
        detailed outputs
    
    """
    R_scores = []
    # batch size = int(number of counties / K)
    counties = df_train_proc[('info', 'full_loc')].unique()

    # shuffle the counties
    np.random.shuffle(counties)
    batch_size = int(len(counties) / K)
    
    if verbose: 
        print("K = ", K)
        print("batch size = ", batch_size)

    for k in range(K): 
        # select the train and validation portion
        df_train = df_train_proc[~df_train_proc[
            ('info', 'full_loc')].isin(counties[k*batch_size:(k+1)*batch_size])]
        df_validate = df_train_proc[df_train_proc[
            ('info', 'full_loc')].isin(counties[k*batch_size:(k+1)*batch_size])]

        # Implement and train the model
        X_train = df_train.loc[:, df_train.columns[5:]].values
        y_train = df_train.loc[:, ('info', 'new_cases_1e6')].values 

        X_validate = df_validate.loc[:, df_validate.columns[5:]].values
        y_validate = df_validate.loc[:, ('info', 'new_cases_1e6')].values

        model = model_in
        model.fit(X_train, y_train)

        # Then output the results
        R_score = model.score(X_validate, y_validate)
        
        if verbose: 
            print("test = ", k, "score = ", R_score)
        R_scores.append(R_score)
    return R_scores

In [17]:
R_OLS = train_model(df_train_proc, model_in=LinearRegression(), K=10, verbose=True)

K =  10
batch size =  251
test =  0 score =  0.08696784308945726
test =  1 score =  0.10199879162525571
test =  2 score =  0.096760287287608
test =  3 score =  0.08870535294447013
test =  4 score =  0.08211539449691418
test =  5 score =  0.08645003108310312
test =  6 score =  0.09758751731913007
test =  7 score =  0.07712857352630187
test =  8 score =  0.06244226155234078
test =  9 score =  0.09776465580252836


In [18]:
from sklearn.linear_model import Ridge

R_Ridge = train_model(df_train_proc, model_in=Ridge(), K=10, verbose=True)

K =  10
batch size =  251
test =  0 score =  0.09515591070122587
test =  1 score =  0.09747234449492304
test =  2 score =  0.054526284601025155
test =  3 score =  0.08285990365481022
test =  4 score =  0.09794697544642494
test =  5 score =  0.10460737700044598
test =  6 score =  0.10889602256030562
test =  7 score =  0.06953231216681999
test =  8 score =  0.10871204043355742
test =  9 score =  0.08222489664090626


In [19]:
from sklearn.linear_model import Lasso

R_Lasso = train_model(df_train_proc, model_in=Lasso(), K=10, verbose=True)

K =  10
batch size =  251
test =  0 score =  0.0583586834583989
test =  1 score =  0.07924207965499508
test =  2 score =  0.0762169875428913
test =  3 score =  0.09385219804693823
test =  4 score =  0.08343628049296725
test =  5 score =  0.1017800732645302
test =  6 score =  0.07251207798023862
test =  7 score =  0.04798858393835426
test =  8 score =  0.07171877080196265
test =  9 score =  0.10147043060019989


In [20]:
from sklearn.tree import DecisionTreeRegressor

R_Desc_tree = train_model(df_train_proc, model_in=DecisionTreeRegressor(), K=10, verbose=True)

K =  10
batch size =  251
test =  0 score =  0.09170984966553963
test =  1 score =  0.12216387106051796
test =  2 score =  0.10379691890670162
test =  3 score =  0.11569165232044298
test =  4 score =  0.1401220344145746
test =  5 score =  0.17803467567543052
test =  6 score =  0.13689340529741123
test =  7 score =  0.14819135879201828
test =  8 score =  0.11370743857002863
test =  9 score =  0.09829322291847509


Wrap all this into a function

In [39]:
def train_model(df_train_proc, model_in, metrics_dict, K=10, verbose=True):
    """Function to train models using K-fold cross validation
    Parameters
    -----------
    df_train_proc: DataFrame
        processed dataframe (after selecting bins and joining with cases)
    model_in: call to an sklearn object
        call to the models constructor method
    K: integer
        number of cross-validation folds
    verbose: Boolean
        detailed outputs
    
    """
    results_dict = {metric: [] for metric in metrics_dict.keys()}

    counties = df_train_proc[('info', 'full_loc')].unique()

    # shuffle the counties
    np.random.shuffle(counties)
    batch_size = int(len(counties) / K)
    
    if verbose: 
        print("K = ", K)
        print("batch size = ", batch_size)

    for k in range(K): 
        # select the train and validation portion
        df_train = df_train_proc[~df_train_proc[
            ('info', 'full_loc')].isin(counties[k*batch_size:(k+1)*batch_size])]
        df_validate = df_train_proc[df_train_proc[
            ('info', 'full_loc')].isin(counties[k*batch_size:(k+1)*batch_size])]

        # Implement and train the model
        X_train = df_train.loc[:, df_train.columns[5:]].values
        y_train = df_train.loc[:, ('info', 'new_cases_1e6')].values 

        X_validate = df_validate.loc[:, df_validate.columns[5:]].values
        y_validate = df_validate.loc[:, ('info', 'new_cases_1e6')].values

        model = model_in
        model.fit(X_train, y_train)

        # compute scores
        for metric in metrics_dict.keys():
            score = metrics_dict[metric](y_validate, model.predict(X_validate))
            results_dict[metric].append(score)
            
        if verbose: 
            results = [(str(metric) + ": " + str(results_dict[metric][k])) for metric in metrics_dict.keys()]
            print("fold: ", k, "scores: ", results)

    return results_dict, model.get_params()
results, params = train_model(df_train_proc, model_in=LinearRegression(), metrics_dict=metrics_dict)

K =  10
batch size =  251
fold:  0 scores:  ['R^2: 0.12830553989051752', 'MSE: 1841.5917133041496']
fold:  1 scores:  ['R^2: 0.10539552509821304', 'MSE: 2423.744242949214']
fold:  2 scores:  ['R^2: 0.06675681939154843', 'MSE: 3843.3312806579424']
fold:  3 scores:  ['R^2: 0.10376128640123794', 'MSE: 2265.783135114561']
fold:  4 scores:  ['R^2: 0.0923673623672876', 'MSE: 2516.406349167967']
fold:  5 scores:  ['R^2: 0.07643476208349143', 'MSE: 3237.2006653049834']
fold:  6 scores:  ['R^2: 0.06585578521741775', 'MSE: 4244.7109766622525']
fold:  7 scores:  ['R^2: 0.10348062631856647', 'MSE: 2340.635047387664']
fold:  8 scores:  ['R^2: 0.06135241353317111', 'MSE: 4179.622893311203']
fold:  9 scores:  ['R^2: 0.10962632690941443', 'MSE: 2336.0078076654204']


In [40]:
results

{'R^2': [0.12830553989051752,
  0.10539552509821304,
  0.06675681939154843,
  0.10376128640123794,
  0.0923673623672876,
  0.07643476208349143,
  0.06585578521741775,
  0.10348062631856647,
  0.06135241353317111,
  0.10962632690941443],
 'MSE': [1841.5917133041496,
  2423.744242949214,
  3843.3312806579424,
  2265.783135114561,
  2516.406349167967,
  3237.2006653049834,
  4244.7109766622525,
  2340.635047387664,
  4179.622893311203,
  2336.0078076654204]}

In [54]:
# import some metrics

models_dict = {
    'OLS': LinearRegression(),
    'Ridge': Ridge()
}
metrics_dict = {
    'R^2': r2_score,
    'MSE': mean_squared_error
}

# Note - this function will go into another function that processes all the data with different bins
def run_models(df_train_proc, models_dict, metrics_dict, K=10, verbose=True): 
    
    # declare an empty dictionary to hold all results
    results = {}
    
    # loop through all the models passed
    for model in models_dict.keys():
        
        # declare empty dictionary for results from this one run
        model_results = {}
        scores, params = train_model(df_train_proc=df_train_proc, 
                                     model_in=models_dict[model], 
                                     metrics_dict=metrics_dict, 
                                     K=K, 
                                     verbose=verbose)
        
        # save the results in a dictionary
        model_results['params'] = params
        model_results['scores'] = scores
        
        results[model] = model_results
    return results

In [55]:
results = run_models(df_train_proc, models_dict, metrics_dict)

K =  10
batch size =  251
fold:  0 scores:  ['R^2: 0.08502940473134468', 'MSE: 3078.9774524573622']
fold:  1 scores:  ['R^2: 0.0798397229013128', 'MSE: 3167.766020543073']
fold:  2 scores:  ['R^2: 0.09915641294144761', 'MSE: 2492.4163449064567']
fold:  3 scores:  ['R^2: 0.09576105682579705', 'MSE: 2680.1514159696812']
fold:  4 scores:  ['R^2: 0.096109152562458', 'MSE: 2673.3516089421314']
fold:  5 scores:  ['R^2: 0.0910841219493036', 'MSE: 2826.15632316663']
fold:  6 scores:  ['R^2: 0.0879138667756103', 'MSE: 2618.2010680468475']
fold:  7 scores:  ['R^2: 0.07175497722187829', 'MSE: 3724.902630599902']
fold:  8 scores:  ['R^2: 0.10826575295179452', 'MSE: 2206.617015407437']
fold:  9 scores:  ['R^2: 0.06158237897095542', 'MSE: 3766.007542297555']
K =  10
batch size =  251
fold:  0 scores:  ['R^2: 0.08331404930453168', 'MSE: 3108.4965645342445']
fold:  1 scores:  ['R^2: 0.0629810077760592', 'MSE: 3853.986087938123']
fold:  2 scores:  ['R^2: 0.08236665916064811', 'MSE: 3031.7935433993266']

In [56]:
results

{'OLS': {'params': {'copy_X': True,
   'fit_intercept': True,
   'n_jobs': None,
   'normalize': False,
   'positive': False},
  'scores': {'R^2': [0.08502940473134468,
    0.0798397229013128,
    0.09915641294144761,
    0.09576105682579705,
    0.096109152562458,
    0.0910841219493036,
    0.0879138667756103,
    0.07175497722187829,
    0.10826575295179452,
    0.06158237897095542],
   'MSE': [3078.9774524573622,
    3167.766020543073,
    2492.4163449064567,
    2680.1514159696812,
    2673.3516089421314,
    2826.15632316663,
    2618.2010680468475,
    3724.902630599902,
    2206.617015407437,
    3766.007542297555]}},
 'Ridge': {'params': {'alpha': 1.0,
   'copy_X': True,
   'fit_intercept': True,
   'max_iter': None,
   'normalize': False,
   'random_state': None,
   'solver': 'auto',
   'tol': 0.001},
  'scores': {'R^2': [0.08331404930453168,
    0.0629810077760592,
    0.08236665916064811,
    0.09705893901016449,
    0.06563832421299132,
    0.11340379724829197,
    0.09395

Now, automate the data preprocessing for a variety of bins

In [16]:
def county_split(df, test_size, split_type="train_test"):
    """
    split_type = {"train_test", "train_validate"}
    """
    
    all_counties = df['full_loc_name'].unique()

    # shuffle the list
    np.random.shuffle(all_counties)

    # split the data
    counties_test = all_counties[: int(len(all_counties)*test_size)]
    counties_train = all_counties[int(len(all_counties)*test_size) :]

    df_test = df[df['full_loc_name'].isin(counties_test)]
    df_train = df[df['full_loc_name'].isin(counties_train)]
    
    return df_test, df_train
df_test, df_train = county_split(df, test_size=0.2)
df_train

Unnamed: 0,uid,location_type,fips_code,county,state,date,full_loc_name,total_population,cumulative_cases,cumulative_cases_1e6,cumulative_deaths,cumulative_deaths_1e6,new_cases,new_deaths,new_cases_1e6,new_deaths_1e6,new_cases_7day,new_deaths_7day,new_cases_7day_1e6,new_deaths_7day_1e6
31833,84001003,county,1003,baldwin,Alabama,2020-01-22,"Baldwin, Alabama",208107,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31834,84001003,county,1003,baldwin,Alabama,2020-01-23,"Baldwin, Alabama",208107,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31835,84001003,county,1003,baldwin,Alabama,2020-01-24,"Baldwin, Alabama",208107,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31836,84001003,county,1003,baldwin,Alabama,2020-01-25,"Baldwin, Alabama",208107,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31837,84001003,county,1003,baldwin,Alabama,2020-01-26,"Baldwin, Alabama",208107,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1266241,84056045,county,56045,weston,Wyoming,2021-02-13,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.67,0.0,9.436620,0.0
1266242,84056045,county,56045,weston,Wyoming,2021-02-14,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.33,0.0,4.647887,0.0
1266243,84056045,county,56045,weston,Wyoming,2021-02-15,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
1266244,84056045,county,56045,weston,Wyoming,2021-02-16,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.00,0.0,0.000000,0.0


In [10]:
df_train

Unnamed: 0,uid,location_type,fips_code,county,state,date,full_loc_name,total_population,cumulative_cases,cumulative_cases_1e6,cumulative_deaths,cumulative_deaths_1e6,new_cases,new_deaths,new_cases_1e6,new_deaths_1e6,new_cases_7day,new_deaths_7day,new_cases_7day_1e6,new_deaths_7day_1e6
31440,84001001,county,1001,autauga,Alabama,2020-01-22,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31441,84001001,county,1001,autauga,Alabama,2020-01-23,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31442,84001001,county,1001,autauga,Alabama,2020-01-24,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31443,84001001,county,1001,autauga,Alabama,2020-01-25,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
31444,84001001,county,1001,autauga,Alabama,2020-01-26,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1266241,84056045,county,56045,weston,Wyoming,2021-02-13,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.67,0.0,9.436620,0.0
1266242,84056045,county,56045,weston,Wyoming,2021-02-14,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.33,0.0,4.647887,0.0
1266243,84056045,county,56045,weston,Wyoming,2021-02-15,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.00,0.0,0.000000,0.0
1266244,84056045,county,56045,weston,Wyoming,2021-02-16,"Weston, Wyoming",7100,617,8690.14,5,70.42,0,0,0.00,0.0,0.00,0.0,0.000000,0.0


In [17]:
import os

def train_model(df_train_proc, model_in, metrics_dict, K=10, verbose=True, save_output=True, filename="log.txt"):
    """Function to train models using K-fold cross validation
    Parameters
    -----------
    df_train_proc: DataFrame
        processed dataframe (after selecting bins and joining with cases)
    model_in: call to an sklearn object
        call to the models constructor method
    K: integer
        number of cross-validation folds
    verbose: Boolean
        detailed outputs
    
    """
    results_dict = {metric: [] for metric in metrics_dict.keys()}

    counties = df_train_proc[('info', 'full_loc')].unique()

    # shuffle the counties
    np.random.shuffle(counties)
    batch_size = int(len(counties) / K)
    
    msg1 = f"number of cross-validation folds: {K}"
    msg2 = f"num counties in validation set: {batch_size}"
    
    if verbose: 
        print(msg1)
        print(msg2)
    if save_output: 
        with open(filename, "a") as log: 
            log.write(msg1)
            log.write(msg2)

    for k in range(K): 
        # select the train and validation portion
        df_train = df_train_proc[~df_train_proc[
            ('info', 'full_loc')].isin(counties[k*batch_size:(k+1)*batch_size])]
        df_validate = df_train_proc[df_train_proc[
            ('info', 'full_loc')].isin(counties[k*batch_size:(k+1)*batch_size])]

        # Implement and train the model
        X_train = df_train.loc[:, df_train.columns[5:]].values
        y_train = df_train.loc[:, ('info', 'new_cases_1e6')].values 

        X_validate = df_validate.loc[:, df_validate.columns[5:]].values
        y_validate = df_validate.loc[:, ('info', 'new_cases_1e6')].values

        model = model_in
        model.fit(X_train, y_train)

        # compute scores
        for metric in metrics_dict.keys():
            score = metrics_dict[metric](y_validate, model.predict(X_validate))
            results_dict[metric].append(score)
        
        results = [(str(metric) + ": " + str(results_dict[metric][k])) for metric in metrics_dict.keys()]
        
        msg = f"fold: {k}, scores: {results}"
        if verbose: 
            print(msg)
        if save_output: 
            with open(filename, "a") as log: 
                log.write(msg)

    return results_dict, model.get_params()

def run_models(df_train_proc, models_dict, metrics_dict, K=10, verbose=True, save_output=True, filename="log.txt"): 
    
    # declare an empty dictionary to hold all results
    results = {}
    
    # loop through all the models passed
    for model in models_dict.keys():
        msg = f"running models: {model}"
        if verbose: 
            print(msg)
        if save_output: 
            with open(filename, "a") as log: 
                log.write(msg)

        # declare empty dictionary for results from this one run
        model_results = {}
        scores, params = train_model(df_train_proc=df_train_proc, 
                                     model_in=models_dict[model], 
                                     metrics_dict=metrics_dict, 
                                     K=K, 
                                     verbose=verbose, 
                                     save_output=save_output, 
                                     filename=filename)
        
        # save the results in a dictionary
        model_results['params'] = params
        model_results['scores'] = scores
        
        results[model] = model_results
    return results

def run_features_models(df_train, df2_preprocessed, bins_dict, models_dict, metrics_dict, K=10, verbose=True,
                        save_output=True, filename="log.txt", overwrite=True): 
    
    results = {}
    
    if overwrite & os.path.exists(filename): 
        os.remove(filename)
    
    for i, key in enumerate(bins_dict):
        bins_list = bins_dict[key]
        
        msg = f"bins: {bins_list}"
        if verbose: 
            print(msg)
        if save_output: 
            with open(filename, "a") as log: 
                log.write(msg)
                
        df_train_proc = join_policies(case_df=df_train, 
                                      policy_df=df2_preprocessed, 
                                      output=True, 
                                      bins_list=bins_list, 
                                      state_output=False)
        
        models_results = run_models(df_train_proc=df_train_proc, 
                                    models_dict=models_dict,
                                    metrics_dict=metrics_dict, 
                                    K=K,
                                    verbose=verbose, 
                                    save_output=save_output, 
                                    filename=filename)
        models_results['bins'] = bins_list
        
        results[i] = models_results
    
    return results    

In [18]:
bins_dict = {
    1: [(0, 3), (4, 10), (11, 999)], 
    2: [(0, 5), (6, 20), (21, 999)]
}

models_dict = {
    'OLS': LinearRegression(),
    'Ridge': Ridge()
}

metrics_dict = {
    'R^2': r2_score,
    'MSE': mean_squared_error
}

#df_train = pd.read_csv("df_train.csv", index_col=0)
results = run_features_models(df_train, df2_preprocessed, bins_dict, models_dict, metrics_dict, K=10, verbose=True, 
                             save_output=True, filename="test_run.txt", overwrite=True)

bins: [(0, 3), (4, 10), (11, 999)]
data shaped
bins: [(0, 3), (4, 10), (11, 999)]
time elapsed: 105.47064447402954
running models: OLS
number of cross-validation folds: 10
num counties in validation set: 251
fold: 0, scores: ['R^2: 0.09021119601377114', 'MSE: 2784.514438273059']
fold: 1, scores: ['R^2: 0.08950285727662444', 'MSE: 2641.1314637806604']
fold: 2, scores: ['R^2: 0.09529017197439571', 'MSE: 2491.7468160082594']
fold: 3, scores: ['R^2: 0.10400291366620218', 'MSE: 2431.3908189641675']
fold: 4, scores: ['R^2: 0.08228963682040624', 'MSE: 2932.171067874669']
fold: 5, scores: ['R^2: 0.0757728894936529', 'MSE: 3201.895234002477']
fold: 6, scores: ['R^2: 0.1188110669700484', 'MSE: 2075.0722477406357']
fold: 7, scores: ['R^2: 0.08477258629783546', 'MSE: 2774.602213091949']
fold: 8, scores: ['R^2: 0.0863549334008662', 'MSE: 2830.392444997944']
fold: 9, scores: ['R^2: 0.09767570952475813', 'MSE: 2607.749016365591']
running models: Ridge
number of cross-validation folds: 10
num counties

In [14]:
bins_list = [(0, 3), (4, 10), (11, 999)]
df_train_proc = join_policies(case_df=df_train, 
                                      policy_df=df2_preprocessed, 
                                      output=True, 
                                      bins_list=bins_list, 

                                      state_output=False)

data shaped
bins: [(0, 3), (4, 10), (11, 999)]
time elapsed: 99.93020415306091


In [15]:
df_train_proc

Unnamed: 0_level_0,info,info,info,info,info,entertainment - start - county,entertainment - start - county,entertainment - start - county,houses of worship - start - state,houses of worship - start - state,...,outdoor and recreation - stop - county,manufacturing - stop - county,manufacturing - stop - county,manufacturing - stop - county,personal care - stop - state,personal care - stop - state,personal care - stop - state,personal care - start - county,personal care - start - county,personal care - start - county
Unnamed: 0_level_1,state,county,full_loc,date,new_cases_1e6,0-3,4-10,11-999,0-3,4-10,...,11-999,0-3,4-10,11-999,0-3,4-10,11-999,0-3,4-10,11-999
31440,Alabama,autauga,"Autauga, Alabama",2020-01-22,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31441,Alabama,autauga,"Autauga, Alabama",2020-01-23,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31442,Alabama,autauga,"Autauga, Alabama",2020-01-24,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31443,Alabama,autauga,"Autauga, Alabama",2020-01-25,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31444,Alabama,autauga,"Autauga, Alabama",2020-01-26,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1266241,Wyoming,weston,"Weston, Wyoming",2021-02-13,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1266242,Wyoming,weston,"Weston, Wyoming",2021-02-14,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1266243,Wyoming,weston,"Weston, Wyoming",2021-02-15,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1266244,Wyoming,weston,"Weston, Wyoming",2021-02-16,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
