# Model development

This is the notebook where I am currently experimenting with different ML models to predict new covid cases / deaths. 

First, run the script to load and process the data

In [113]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression


# import covid_data_script
%run covid_data_script.py

covid data script loaded


In [114]:
# retrieve and load the cleaned data
df, df2 = retrieve_data(load_local=True)

# preprocess df2
df2_preprocessed = prep_policy_data(df2)

  mask |= (ar1 == a)


# Models and Hyperparameters to explore: 

**Machine Learning Models**

- Linear Regression
- Ridge Regression
- Lasso Regression
- ElasticNet
- Stochastic Gradient Descent
- Decision Tree
- Random Forest

**Machine learning models that I'm not familiar with but might want to try**
- XGBoost

**Neural networks**
- Multilayer Perceptron
- Convolutional NN? 

Technically, this is time-series data, but I'm not currently *processing* this as time series data. 

**The Big Issue**: According to best practice, one should do the train-test split before any kind of work with model selection (to prevent overfitting and picking up patterns that are actually random). This problem is that the primary feature engineering step requires the Full dataset -> this probably means I need to change the way I'm implementing the feature engineering. 

**Possible solution**: Instead of using scikit-learn's implementation of the train_test split, implement my own but do the split by a random selection of counties. This means that data from different points in the pandemic, but from the same county, will either all be in the test set or all in the train set. Do the same thing for K-fold cross validation.

**Final note**: Also try RNN-LSTM

In [115]:
# get the processed dataframe with default bins
# df3 = join_policies(case_df=df, policy_df=df2_preprocessed)
# df3.head()

Implement a custom version of the train_test_split

In [116]:
df.head()

Unnamed: 0,uid,location_type,fips_code,county,state,date,full_loc_name,total_population,cumulative_cases,cumulative_cases_1e6,cumulative_deaths,cumulative_deaths_1e6,new_cases,new_deaths,new_cases_1e6,new_deaths_1e6,new_cases_7day,new_deaths_7day,new_cases_7day_1e6,new_deaths_7day_1e6
30720,84001001,county,1001,autauga,Alabama,2020-01-22,"Autauga, Alabama",55200,0,0.0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
30721,84001001,county,1001,autauga,Alabama,2020-01-23,"Autauga, Alabama",55200,0,0.0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
30722,84001001,county,1001,autauga,Alabama,2020-01-24,"Autauga, Alabama",55200,0,0.0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
30723,84001001,county,1001,autauga,Alabama,2020-01-25,"Autauga, Alabama",55200,0,0.0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
30724,84001001,county,1001,autauga,Alabama,2020-01-26,"Autauga, Alabama",55200,0,0.0,0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [117]:
def county_split(df, test_size, split_type="train_test"):
    """
    split_type = {"train_test", "train_validate"}
    """
    
    #if split_type == "train_test": 
    # Get a list of all unique counties
    all_counties = df['full_loc_name'].unique()

    # shuffle the list
    np.random.shuffle(all_counties)

    # split the data
    counties_test = all_counties[: int(len(all_counties)*test_size)]
    counties_train = all_counties[int(len(all_counties)*test_size) :]

    df_test = df[df['full_loc_name'].isin(counties_test)]
    df_train = df[df['full_loc_name'].isin(counties_train)]
    #else: 
        #all_counties = df[('info', 'full_loc')].unique()
    
    return df_test, df_train

In [118]:
df_test, df_train = county_train_test_split(df, test_size=0.2)
df_train

Unnamed: 0,uid,location_type,fips_code,county,state,date,full_loc_name,total_population,cumulative_cases,cumulative_cases_1e6,cumulative_deaths,cumulative_deaths_1e6,new_cases,new_deaths,new_cases_1e6,new_deaths_1e6,new_cases_7day,new_deaths_7day,new_cases_7day_1e6,new_deaths_7day_1e6
30720,84001001,county,1001,autauga,Alabama,2020-01-22,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.00,0.00,0.00,0.000000,0.000000
30721,84001001,county,1001,autauga,Alabama,2020-01-23,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.00,0.00,0.00,0.000000,0.000000
30722,84001001,county,1001,autauga,Alabama,2020-01-24,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.00,0.00,0.00,0.000000,0.000000
30723,84001001,county,1001,autauga,Alabama,2020-01-25,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.00,0.00,0.00,0.000000,0.000000
30724,84001001,county,1001,autauga,Alabama,2020-01-26,"Autauga, Alabama",55200,0,0.00,0,0.00,0,0,0.00,0.00,0.00,0.00,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1237243,84056045,county,56045,weston,Wyoming,2021-02-04,"Weston, Wyoming",7100,613,8633.80,5,70.42,1,1,14.08,14.08,0.67,0.33,9.436620,4.647887
1237244,84056045,county,56045,weston,Wyoming,2021-02-05,"Weston, Wyoming",7100,613,8633.80,5,70.42,0,0,0.00,0.00,0.67,0.33,9.436620,4.647887
1237245,84056045,county,56045,weston,Wyoming,2021-02-06,"Weston, Wyoming",7100,614,8647.89,5,70.42,1,0,14.08,0.00,0.83,0.33,11.690141,4.647887
1237246,84056045,county,56045,weston,Wyoming,2021-02-07,"Weston, Wyoming",7100,616,8676.06,5,70.42,2,0,28.17,0.00,1.17,0.33,16.478873,4.647887


# Feature engineering on training set only

- All this will go into a function once I know what I'm doing

Now that we have the split, apply the feature engineering

In [119]:
df_train_proc = join_policies(case_df=df_train, policy_df=df2_preprocessed)

data shaped
bins: [(0, 6), (7, 13), (14, 999)]
time elapsed: 83.57977247238159


In [120]:
df_train_proc

Unnamed: 0_level_0,info,info,info,info,info,entertainment - start - county,entertainment - start - county,entertainment - start - county,houses of worship - start - state,houses of worship - start - state,...,outdoor and recreation - stop - county,manufacturing - stop - county,manufacturing - stop - county,manufacturing - stop - county,personal care - stop - state,personal care - stop - state,personal care - stop - state,personal care - start - county,personal care - start - county,personal care - start - county
Unnamed: 0_level_1,state,county,full_loc,date,new_cases_1e6,0-6,7-13,14-999,0-6,7-13,...,14-999,0-6,7-13,14-999,0-6,7-13,14-999,0-6,7-13,14-999
30720,Alabama,autauga,"Autauga, Alabama",2020-01-22,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
30721,Alabama,autauga,"Autauga, Alabama",2020-01-23,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
30722,Alabama,autauga,"Autauga, Alabama",2020-01-24,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
30723,Alabama,autauga,"Autauga, Alabama",2020-01-25,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
30724,Alabama,autauga,"Autauga, Alabama",2020-01-26,0.00,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1237243,Wyoming,weston,"Weston, Wyoming",2021-02-04,14.08,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1237244,Wyoming,weston,"Weston, Wyoming",2021-02-05,0.00,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1237245,Wyoming,weston,"Weston, Wyoming",2021-02-06,14.08,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1237246,Wyoming,weston,"Weston, Wyoming",2021-02-07,28.17,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


Implement linear regression

In [128]:
X = df_train_proc.loc[:, df_train_proc.columns[5:]].values
y = df_train_proc.loc[:, ('info', 'new_cases_1e6')].values

In [129]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [130]:
y

array([ 0.  ,  0.  ,  0.  , ..., 14.08, 28.17, 28.17])

In [131]:
model = LinearRegression()
model.fit(X, y)

LinearRegression()

In [132]:
model.score(X, y)

0.09597636145181387

Now implement K-fold cross validation (shuffling counties, not individual datapoints)

**Q:** What metrics should we use to evaluate performance? This sounds like a good excuse to go through sklearn's metrics library. Stick with R squared for now

In [141]:
def train_model(df_train_proc, model_in, K=10, verbose=True):
    """Function to train models using K-fold cross validation
    Parameters
    -----------
    df_train_proc: DataFrame
        processed dataframe (after selecting bins and joining with cases)
    model_in: call to an sklearn object
        call to the models constructor method
    K: integer
        number of cross-validation folds
    verbose: Boolean
        detailed outputs
    
    """
    R_scores = []
    # batch size = int(number of counties / K)
    counties = df_train_proc[('info', 'full_loc')].unique()

    # shuffle the counties
    np.random.shuffle(counties)
    batch_size = int(len(counties) / K)
    
    if verbose: 
        print("K = ", K)
        print("batch size = ", batch_size)

    for k in range(K): 
        # select the train and validation portion
        df_train = df_train_proc[~df_train_proc[
            ('info', 'full_loc')].isin(counties[k*batch_size:(k+1)*batch_size])]
        df_validate = df_train_proc[df_train_proc[
            ('info', 'full_loc')].isin(counties[k*batch_size:(k+1)*batch_size])]

        # Implement and train the model
        X_train = df_train.loc[:, df_train.columns[5:]].values
        y_train = df_train.loc[:, ('info', 'new_cases_1e6')].values 

        X_validate = df_validate.loc[:, df_validate.columns[5:]].values
        y_validate = df_validate.loc[:, ('info', 'new_cases_1e6')].values

        model = model_in
        model.fit(X_train, y_train)

        # Then output the results
        R_score = model.score(X_validate, y_validate)
        
        if verbose: 
            print("test = ", k, "score = ", R_score)
        R_scores.append(R_score)
    return R_scores

In [144]:
R_OLS = train_model(df_train_proc, model_in=LinearRegression(), K=10, verbose=True)

K =  10
batch size =  251
test =  0 score =  0.11060428006312528
test =  1 score =  0.0942813393959252
test =  2 score =  0.09496151915554674
test =  3 score =  0.10344527579274776
test =  4 score =  0.10113755303227479
test =  5 score =  0.07882658627709183
test =  6 score =  0.10028583542930558
test =  7 score =  0.09842316793946071
test =  8 score =  0.09787256255161059
test =  9 score =  0.07985011647030904


In [146]:
from sklearn.linear_model import Ridge

R_Ridge = train_model(df_train_proc, model_in=Ridge(), K=10, verbose=True)

K =  10
batch size =  251
test =  0 score =  0.09847505051021388
test =  1 score =  0.09939946785052678
test =  2 score =  0.10561974966428522
test =  3 score =  0.09752468133356396
test =  4 score =  0.08840448249830024
test =  5 score =  0.10382033144185865
test =  6 score =  0.10505164965487823
test =  7 score =  0.10065795446428893
test =  8 score =  0.10625447118052811
test =  9 score =  0.06436540205511365


In [147]:
from sklearn.linear_model import Lasso

R_Lasso = train_model(df_train_proc, model_in=Lasso(), K=10, verbose=True)

K =  10
batch size =  251
test =  0 score =  0.05285810351205267
test =  1 score =  0.0693284534143539
test =  2 score =  0.08837286809110134
test =  3 score =  0.09397246867182596
test =  4 score =  0.11472298145901483
test =  5 score =  0.07224126245499773
test =  6 score =  0.08885682560793295
test =  7 score =  0.08667649183387882
test =  8 score =  0.09166949113836098
test =  9 score =  0.10065113981062068


In [148]:
from sklearn.tree import DecisionTreeRegressor

R_Desc_tree = train_model(df_train_proc, model_in=DecisionTreeRegressor(), K=10, verbose=True)

K =  10
batch size =  251
test =  0 score =  0.10200835100095895
test =  1 score =  0.16151591557022216
test =  2 score =  0.1559417402524591
test =  3 score =  0.1457276764964578
test =  4 score =  0.12067647486917377
test =  5 score =  0.1698705121340297
test =  6 score =  0.14138246185988756
test =  7 score =  0.09793471935375175
test =  8 score =  0.1357704770008188
test =  9 score =  0.14268561625964526
