<a class="anchor" id="top"></a>
# Modeling Notebook
**Authors: Ainesh Pandey, Demian Gass, Gabriel Gilling**

In this notebook, we read in the prepped datasets and start modeling on our selected outcome variables.

## Table of Contents

[Step 1: Import Required Packages](#step-1) <br>
[Step 2: Read and Prepare Datasets](#step-2) <br>
[Step 3: Modeling](#step-3) <br>

<a class="anchor" id="step-1"></a>

## Import Required Packages

In [5]:
import pandas as pd
import numpy as np
from scipy import stats
import pickle

from imblearn.under_sampling import NearMiss
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier

import warnings
warnings.filterwarnings('ignore')

import time

from my_path import PATH

[Back to Top](#top)

<a class="anchor" id="step-2"></a>

## Read and Prepare Datasets

### Input Datasets

**Deltas Dataframe**: These are the primary inputs to our model.

In [3]:
df_deltas = pd.read_csv("Data/delta_df.csv")

# Normalize numeric features in deltas dataframe
for column in df_deltas.columns:
    if df_deltas[column].dtype == 'float64':
        df_deltas[column] = (df_deltas[column] - df_deltas[column].min()) / (df_deltas[column].max() - df_deltas[column].min())

print('Rows:',    df_deltas.shape[0])
print('Columns:', df_deltas.shape[1])
df_deltas.head()

Rows: 9289
Columns: 209


Unnamed: 0,PublicID,V1BA01_LB_delta_V2BA01_LB,V2BA01_LB_delta_V3BA01_LB,V1BA01_KG_delta_V2BA01_KG,V2BA01_KG_delta_V3BA01_KG,V2BA02b2_delta_V3BA02b2,V2BA02a2_delta_V3BA02a2,V2BA02a1_delta_V3BA02a1,V2BA02b1_delta_V3BA02b1,V1CA06_delta_V3CA06,V1CA05_delta_V3CA05,V1CA03_delta_V3CA03,V1CA02_delta_V3CA02,V1CA08_delta_V3CA08,V1CA04_delta_V3CA04,V1CA09_delta_V3CA09,V1CA07_delta_V3CA07,V1CA10_delta_V3CA10,V1CA01_delta_V3CA01,V1KB01_delta_V3KB01,V1KB03c_delta_V3KB03c,V1KB05_delta_V3KB05,V1KA04_HR_delta_V3KA04_HR,V1KA02_HR_delta_V3KA02_HR,V1KA01_MIN_delta_V3KA01_MIN,V1KB03b_delta_V3KB03b,V1KA02_AMPM_delta_V3KA02_AMPM,V1KA04_MIN_delta_V3KA04_MIN,V1KA03_MIN_delta_V3KA03_MIN,V1KA03_HR_delta_V3KA03_HR,V1KA03_AMPM_delta_V3KA03_AMPM,V1KB04c_delta_V3KB04c,V1KA01_AMPM_delta_V3KA01_AMPM,V1KB02a_delta_V3KB02a,V1KA01_HR_delta_V3KA01_HR,V1KB04b_delta_V3KB04b,V1KB02b_delta_V3KB02b,V1KB06_delta_V3KB06,V1KB03a_delta_V3KB03a,V1KA02_MIN_delta_V3KA02_MIN,V1KB02d_delta_V3KB02d,V1KB02c_delta_V3KB02c,V1KB04a_delta_V3KB04a,V1LF01a_delta_V3LF01a,V1LB03_delta_V3LB03,V1LA04_delta_V3LA04,V1LA05b_AMPM_delta_V3LA05b_AMPM,V1LA05b_MIN_delta_V3LA05b_MIN,V1LB06_delta_V3LB06,V1LB10a_delta_V3LB10a,V1LB01_delta_V3LB01,V1LC01g_delta_V3LC01g,V1LA05b_HR_delta_V3LA05b_HR,V1LE02_delta_V3LE02,V1LE05a_delta_V3LE05a,V1LC01b_delta_V3LC01b,V1LD01b_delta_V3LD01b,V1LB07a_delta_V3LB07a,V1LD01_delta_V3LD01,V1LB09b_delta_V3LB09b,V1LE01c_delta_V3LE01c,V1LE05_delta_V3LE05,V1LA05a_HR_delta_V3LA05a_HR,V1LF02_delta_V3LF02,V1LB07b_delta_V3LB07b,V1LE03_delta_V3LE03,V1LA03_delta_V3LA03,V1LA06b_MIN_delta_V3LA06b_MIN,V1LB10b_delta_V3LB10b,V1LC01c_delta_V3LC01c,V1LA06b_HR_delta_V3LA06b_HR,V1LA01a_delta_V3LA01a,V1LA07_delta_V3LA07,V1LA01b_delta_V3LA01b,V1LC01e_delta_V3LC01e,V1LE01a_delta_V3LE01a,V1LA05a_AMPM_delta_V3LA05a_AMPM,V1LE01_delta_V3LE01,V1LA02b_delta_V3LA02b,V1LD01a_delta_V3LD01a,V1LC01h_delta_V3LC01h,V1LA05a_MIN_delta_V3LA05a_MIN,V1LA06b_AMPM_delta_V3LA06b_AMPM,V1LE04_delta_V3LE04,V1LC01d_delta_V3LC01d,V1LB05_delta_V3LB05,V1LB04_delta_V3LB04,V1LA06a_MIN_delta_V3LA06a_MIN,V1LB08a_delta_V3LB08a,V1LB09a_delta_V3LB09a,V1LF01b_delta_V3LF01b,V1LA06a_AMPM_delta_V3LA06a_AMPM,V1LE01b_delta_V3LE01b,V1LB02_delta_V3LB02,V1LC01i_delta_V3LC01i,V1LA06a_HR_delta_V3LA06a_HR,V1LA01_delta_V3LA01,V1LC01f_delta_V3LC01f,V1LA02a_delta_V3LA02a,V1LD01c_delta_V3LD01c,V1LF01c_delta_V3LF01c,V1LB08b_delta_V3LB08b,V1LC01a_delta_V3LC01a,U1CD05_delta_U2CD05,U2CD05_delta_U3CD05,U1CB04b_delta_U2CB04b,U2CB04b_delta_U3CB04b,U1CD01_delta_U2CD01,U2CD01_delta_U3CD01,U1CB03_delta_U2CB03,U2CB03_delta_U3CB03,U1CC02_delta_U2CC02,U2CC02_delta_U3CC02,U1CC12_delta_U2CC12,U2CC12_delta_U3CC12,U1CC09_delta_U2CC09,U2CC09_delta_U3CC09,U1CD08_delta_U2CD08,U2CD08_delta_U3CD08,U1CD06_delta_U2CD06,U2CD06_delta_U3CD06,U1CB04c_delta_U2CB04c,U2CB04c_delta_U3CB04c,U1CC06_delta_U2CC06,U2CC06_delta_U3CC06,U1CB04a_delta_U2CB04a,U2CB04a_delta_U3CB04a,U1CC01_delta_U2CC01,U2CC01_delta_U3CC01,U1CC04_delta_U2CC04,U2CC04_delta_U3CC04,U1CA02_DY_delta_U2CA02_DY,U2CA02_DY_delta_U3CA02_DY,U1CD11_delta_U2CD11,U2CD11_delta_U3CD11,U1CC07_delta_U2CC07,U2CC07_delta_U3CC07,U1CD09_delta_U2CD09,U2CD09_delta_U3CD09,U1CD03_delta_U2CD03,U2CD03_delta_U3CD03,U1CD02_delta_U2CD02,U2CD02_delta_U3CD02,U1CD04_delta_U2CD04,U2CD04_delta_U3CD04,U1CB04d_delta_U2CB04d,U2CB04d_delta_U3CB04d,U1CD10_delta_U2CD10,U2CD10_delta_U3CD10,U1CC10_delta_U2CC10,U2CC10_delta_U3CC10,U1CC05_delta_U2CC05,U2CC05_delta_U3CC05,U1CC03_delta_U2CC03,U2CC03_delta_U3CC03,U1CB01_delta_U2CB01,U2CB01_delta_U3CB01,U1CD13_delta_U2CD13,U2CD13_delta_U3CD13,U1CC11_delta_U2CC11,U2CC11_delta_U3CC11,U1CD09a_delta_U2CD09a,U2CD09a_delta_U3CD09a,U1CC13_delta_U2CC13,U2CC13_delta_U3CC13,U1CD07_delta_U2CD07,U2CD07_delta_U3CD07,U1CC08_delta_U2CC08,U2CC08_delta_U3CC08,U1CD12_delta_U2CD12,U2CD12_delta_U3CD12,U1CC09a_delta_U2CC09a,U2CC09a_delta_U3CC09a,U1CB02_delta_U2CB02,U2CB02_delta_U3CB02,U1CA02_WK_delta_U2CA02_WK,U2CA02_WK_delta_U3CA02_WK,U2AC03c_delta_U3AC03c,U2AB04_delta_U3AB04,U2AC02_delta_U3AC02,U2AC03d_delta_U3AC03d,U2AB02_delta_U3AB02,U2AB05_delta_U3AB05,U2AC03e_delta_U3AC03e,U2AA04_delta_U3AA04,U2AC03f_delta_U3AC03f,U2AA02_DY_delta_U3AA02_DY,U2AC03a_delta_U3AC03a,U2AB01_delta_U3AB01,U2AC03b_delta_U3AC03b,U2AA02_WK_delta_U3AA02_WK,U2AC01_delta_U3AC01,U2AB03_delta_U3AB03,U2AB07_delta_U3AB07,U2BA02_DY_delta_U3BA02_DY,U2BA04_delta_U3BA04,U2BC02_delta_U3BC02,U2BB03_delta_U3BB03,U2BB04_delta_U3BB04,U2BC03b_delta_U3BC03b,U2BC01_delta_U3BC01,U2BC03e_delta_U3BC03e,U2BC03c_delta_U3BC03c,U2BB02_delta_U3BB02,U2BB05_delta_U3BB05,U2BC03a_delta_U3BC03a,U2BA02_WK_delta_U3BA02_WK,U2BC03d_delta_U3BC03d,U2BB01_delta_U3BB01
0,00001U,0.51268,0.516508,0.522239,0.663086,0.514788,0.5745,0.440188,0.339237,0-3.0,0-3.0,0-2.0,0-2.0,0-3.0,0-3.0,0-3.0,0-3.0,0-4.0,0-1.0,0-2.0,0-3.0,0.520566,0.508436,0.472849,0.502378,0-2.0,0-1.0,0.491935,0.502271,0.471739,0-1.0,D-4.0,0-1.0,0-4.0,0.505661,D-3.0,0-3.0,0.363529,0-3.0,0.491284,0-3.0,0-1.0,D-1.0,M-M,0-0,0.612364,0-0,0.494075,0-0,0.805934,0-0,0-0,0.390001,0-0,0-0,0-0,M-M,D-D,M-M,0.294765,M-M,M-M,0.539761,M-M,D-D,0-0,0.738607,0.469542,0.45954,0-0,0.495013,0-0,0-0,0.649214,0-0,0-0,0-0,M-M,0.518427,M-M,0-0,0.469123,0-0,0-0,0-0,0-0,0-0,0.48979,D-D,0.632884,M-M,0-0,0-0,0-0,0-0,0.522161,M-M,0-0,0.556651,M-M,M-M,D-D,0-0,0.402477,0.426625,S-S,S-S,M-M,M-M,S-S,S-S,0.58733,0.620834,M-M,M-M,M-M,M-M,0.508805,0.490593,0.794132,0.476815,S-S,S-S,0.399309,0.480069,S-S,S-S,M-M,M-M,0.504052,0.508796,0.492826,0.502878,0-0,0-0,0.54744,0.52993,M-M,M-M,0.498806,0.46868,0.484482,0.484906,0.226622,0.483672,S-S,S-S,0-0,0-0,0-0,0-0,0.210555,0.186867,0.624775,0.499804,0-0,0-0,0-0,0-0,0-0,0-0,0.510512,0.544375,0-0,0-0,0.498574,0.554709,0.509151,0.505634,M-M,M-M,0.684467,0.371689,0-0,0-0,0.496271,0.573151,S-S,0.546891,S-0.0,S-S,0.556432,0.528562,S-S,0-1.0,S-S,0.740694,S-S,0-1.0,S-S,0.580595,0-0,0.059347,0.482712,0.502593,0-0,S-S,0-0,0.580346,S-S,0-0,S-S,S-S,0.519531,M-M,S-S,0.453007,S-S,0-0
1,00004O,0.472393,0.482693,0.522239,0.663086,0.514788,0.5745,0.601595,0.3367,3.0-2.0,3.0-3.0,2.0-1.0,1.0-1.0,3.0-3.0,3.0-3.0,2.0-3.0,2.0-3.0,3.0-3.0,1.0-1.0,3.0-4.0,2.0-3.0,0.551703,0.527128,0.619115,0.550328,3.0-2.0,1.0-1.0,0.764985,0.351591,0.571552,1.0-1.0,7.0-3.0,2.0-2.0,3.0-3.0,0.503678,7.0-6.0,4.0-4.0,0.409332,4.0-4.0,0.296955,1.0-1.0,1.0-1.0,7-4.0,D-D,3.0-5.0,0.609442,1.0-1.0,0.495957,5.0-5.0,0.808713,4.0-5.0,3.0-4.0,0.389927,4.0-4.0,0-5.0,3.0-2.0,M-M,7.0-5,2-2,0.289252,M-M,2.0-1.0,0.578268,2.0-2,1-7.0,1.0-1.0,0.734926,0.855687,0.454917,3.0-1.0,0.544049,0-0,2.0-2.0,0.649214,4.0-4.0,0-0,2.0-2.0,D-2,0.539543,M-M,1.0-3.0,0.529579,1.0-1.0,1.0-2.0,3.0-4.0,1.0-1.0,3.0-4.0,0.498072,6.0-5,0.605237,D-D,1.0-1.0,0-0,4.0-5.0,2.0-1.0,0.474096,2.0-2.0,2.0-2.0,0.543977,M-M,2-2,6.0-6.0,3.0-3.0,0.364425,0.426625,S-S,S-S,1.0-2.0,M-1.0,0.0-0.0,S-0.0,0.60974,0.620834,2.0-2.0,M-2.0,2.0-2.0,M-2.0,0.513085,0.490593,0.751279,0.476815,S-S,S-S,0.4878,0.480069,S-S,S-S,2.0-2.0,M-2.0,0.493572,0.508796,0.498829,0.502878,1.0-1.0,0-1.0,0.622815,0.52993,2.0-2.0,M-2.0,0.575438,0.46868,0.446076,0.484906,0.221589,0.483672,S-S,S-S,1.0-1.0,0-1.0,1.0-1.0,0-1.0,0.209653,0.186867,0.73547,0.499804,1.0-1.0,0-1.0,1.0-1.0,0-1.0,1.0-1.0,0-1.0,0.510512,0.544375,1.0-1.0,0-1.0,0.34573,0.554709,0.505211,0.505634,2.0-2.0,M-2.0,0.684467,0.371689,2.0-2.0,0-2.0,0.364039,0.573151,S-S,0.637255,0.0-0.0,S-S,0.784081,0.719782,S-S,2.0-2.0,S-S,0.738879,S-S,1.0-1.0,S-S,0.75303,2.0-2.0,0.078868,0.482712,0.499938,2.0-2.0,0.0-0.0,2.0-2.0,0.580346,S-S,2.0-2.0,S-S,S-S,0.54175,2.0-2.0,S-S,0.585123,S-S,1.0-1.0
2,00007I,0.510677,0.521473,0.522239,0.663086,0.514788,0.5745,0.431082,0.395469,3.0-2.0,4.0-4.0,3.0-3.0,1.0-1.0,3.0-3.0,3.0-3.0,4.0-3.0,3.0-4.0,4.0-4.0,1.0-1.0,0-0,0-0,0.520566,0.508436,0.472849,0.502378,0-0,0-0,0.491935,0.502271,0.471739,0-0,D-D,0-0,0-0,0.505661,D-D,0-0,0.363529,0-0,0.491284,0-0,0-0,D-D,2-2,4.0-2.0,0.605333,1.0-1.0,0.495957,5.0-4.0,0.821356,5.0-5.0,1.0-1.0,0.366654,5.0-3.0,0-0,2.0-3.0,1-D,1.0-6,1-1,0.284087,M-M,2.0-2.0,0.495153,2.0-D,1-1.0,1.0-4.0,0.740032,0.470852,0.435161,1.0-1.0,0.909559,0-0,2.0-1.0,0.649214,2.0-3.0,0-0,2.0-1.0,D-2,0.507025,1-1.0,1.0-1.0,0.482536,1.0-2.0,2.0-4.0,1.0-1.0,1.0-1.0,5.0-4.0,0.500386,4.0-1,0.614637,2-2,1.0-1.0,0-0,5.0-4.0,1.0-1.0,0.429951,2.0-2.0,1.0-1.0,0.516193,D-D,2-D,4.0-3.0,2.0-2.0,0.402477,0.426625,S-S,S-S,M-M,M-M,0.0-S,S-0.0,0.58733,0.620834,2.0-M,M-2.0,2.0-M,M-2.0,0.508805,0.490593,0.794132,0.476815,S-S,S-S,0.399309,0.480069,S-S,S-S,M-M,M-M,0.504052,0.508796,0.492826,0.502878,1.0-0,0-1.0,0.54744,0.52993,2.0-M,M-2.0,0.498806,0.46868,0.484482,0.484906,0.226622,0.483672,S-S,S-S,1.0-0,0-1.0,1.0-0,0-1.0,0.210555,0.186867,0.624775,0.499804,1.0-0,0-1.0,2.0-0,0-2.0,1.0-0,0-1.0,0.510512,0.544375,2.0-0,0-2.0,0.498574,0.554709,0.509151,0.505634,2.0-M,M-2.0,0.684467,0.371689,2.0-0,0-2.0,0.496271,0.573151,S-S,0.501886,0.0-0.0,S-S,0.543974,0.541368,S-S,2.0-1.0,S-S,0.738589,S-S,1.0-1.0,S-S,0.606818,2.0-0,0.055333,0.482712,0.666791,2.0-2.0,0.0-0.0,2.0-2.0,0.580346,S-S,2.0-2.0,S-S,S-S,0.513992,2.0-2.0,S-S,0.418275,S-S,1.0-1.0
3,00008G,0.563604,0.554183,0.522239,0.663086,0.514788,0.5745,0.447863,0.33119,3.0-3.0,3.0-2.0,3.0-2.0,1.0-1.0,4.0-3.0,3.0-2.0,4.0-4.0,4.0-4.0,4.0-4.0,1.0-1.0,3.0-1.0,1.0-2.0,0.578304,0.696367,0.380532,0.64377,3.0-2.0,1.0-1.0,0.119559,0.630634,0.333303,1.0-1.0,1.0-1.0,2.0-1.0,3.0-3.0,0.829215,1.0-3.0,2.0-2.0,0.363564,3.0-3.0,0.18237,3.0-2.0,2.0-2.0,1-1.0,2-2,4.0-2.0,0.730271,2.0-1.0,0.495957,5.0-4.0,0.813747,4.0-2.0,1.0-4.0,0.33489,5.0-5.0,0-0,3.0-4.0,M-1,1.0-3,2-1,0.326489,M-M,2.0-2.0,0.498693,2.0-2,1-1.0,1.0-4.0,0.750243,0.470852,0.454917,2.0-4.0,0.453608,1.0-1.0,1.0-2.0,0.656031,3.0-4.0,0-0,2.0-2.0,D-2,0.520032,M-1.0,1.0-1.0,0.476736,1.0-1.0,1.0-4.0,4.0-4.0,1.0-1.0,4.0-2.0,0.495759,1.0-3,0.646323,2-2,1.0-1.0,0-0,5.0-4.0,1.0-1.0,0.520795,1.0-1.0,1.0-2.0,0.558365,M-1,2-2,1.0-3.0,3.0-4.0,0.402477,0.426625,S-S,S-S,2.0-M,M-2.0,0.0-S,S-0.0,0.58733,0.620834,2.0-M,M-2.0,2.0-M,M-2.0,0.508805,0.490593,0.794132,0.476815,S-S,S-S,0.399309,0.480069,S-S,S-S,2.0-M,M-2.0,0.504052,0.508796,0.492826,0.502878,1.0-0,0-1.0,0.54744,0.52993,2.0-M,M-2.0,0.498806,0.46868,0.484482,0.484906,0.226622,0.483672,S-S,S-S,1.0-0,0-1.0,1.0-0,0-1.0,0.210555,0.186867,0.624775,0.499804,1.0-0,0-1.0,2.0-0,0-2.0,1.0-0,0-1.0,0.510512,0.544375,2.0-0,0-2.0,0.498574,0.554709,0.509151,0.505634,2.0-M,M-2.0,0.684467,0.371689,2.0-0,0-2.0,0.496271,0.573151,S-S,0.705715,0.0-0.0,S-S,0.68552,0.630575,S-S,2.0-2.0,S-S,0.826112,S-S,1.0-1.0,S-S,0.730303,2.0-2.0,0.067288,0.482712,0.499814,2.0-2.0,0.0-0.0,2.0-2.0,0.580346,S-S,2.0-2.0,S-S,S-S,0.547669,2.0-2.0,S-S,0.519989,S-S,1.0-1.0
4,00015J,0.530073,0.483132,0.522239,0.663086,0.514788,0.5745,0.579686,0.421487,4.0-4.0,4.0-4.0,4.0-4.0,1.0-1.0,4.0-4.0,1.0-1.0,4.0-4.0,4.0-4.0,4.0-4.0,1.0-1.0,0-0,0-0,0.520566,0.508436,0.472849,0.502378,0-0,0-0,0.491935,0.502271,0.471739,0-0,D-D,0-0,0-0,0.505661,D-D,0-0,0.363529,0-0,0.491284,0-0,0-0,D-D,2-2,1.0-1.0,0.633406,1.0-2.0,0.243776,2.0-2.0,0.804967,1.0-4.0,1.0-2.0,0.41787,5.0-5.0,0-4.0,1.0-1.0,M-M,3.0-4,2-2,0.278381,M-M,2.0-1.0,0.502234,2.0-2,3-2.0,2.0-3.0,0.727959,0.188341,0.484662,1.0-1.0,0.453608,1.0-1.0,2.0-2.0,0.643141,4.0-4.0,0-0,2.0-2.0,2-2,0.507025,M-M,1.0-2.0,0.482536,1.0-1.0,2.0-3.0,3.0-2.0,1.0-1.0,3.0-3.0,0.500386,3.0-4,0.640412,2-2,1.0-1.0,0-0,3.0-4.0,2.0-2.0,0.52335,1.0-1.0,1.0-1.0,0.565346,M-M,2-2,3.0-4.0,1.0-3.0,0.412681,0.426625,S-S,S-S,2.0-2.0,M-2.0,0.0-0.0,S-0.0,0.605388,0.620834,2.0-2.0,M-2.0,2.0-2.0,M-2.0,0.525839,0.490593,0.834683,0.476815,S-S,S-S,0.311059,0.480069,S-S,S-S,1.0-1.0,M-1.0,0.493464,0.508796,0.582748,0.502878,1.0-1.0,0-1.0,0.512751,0.52993,2.0-1.0,M-2.0,0.566368,0.46868,0.667047,0.484906,0.208319,0.483672,S-S,S-S,1.0-1.0,0-1.0,1.0-1.0,0-1.0,0.210304,0.186867,0.717389,0.499804,1.0-1.0,0-1.0,2.0-1.0,0-2.0,1.0-1.0,0-1.0,0.510512,0.544375,2.0-1.0,0-2.0,0.565546,0.554709,0.514487,0.505634,2.0-2.0,M-2.0,0.684467,0.371689,2.0-2.0,0-2.0,0.569705,0.573151,S-S,0.506493,0.0-0.0,S-S,0.62464,0.564749,S-S,2.0-1.0,S-S,0.739023,S-S,1.0-1.0,S-S,0.618181,2.0-0,0.054203,0.482712,0.58324,2.0-2.0,0.0-0.0,2.0-2.0,0.580346,S-S,2.0-2.0,S-S,S-S,0.602954,2.0-2.0,S-S,0.316562,S-S,1.0-1.0


**Covariates Dataframe**: We will adjust our models for selected demographic and socio-economic variables.

In [4]:
with open('Data/df_covariates.pkl', 'rb') as f:
    df_covariates = pickle.load(f)
    
# Normalize numeric features in covariates dataframe
for column in df_covariates.columns:
    if df_covariates[column].dtype == 'float64':
        df_covariates[column] = (df_covariates[column] - df_covariates[column].min()) / (df_covariates[column].max() - df_covariates[column].min())

print('Rows:',    df_covariates.shape[0])
print('Columns:', df_covariates.shape[1])
df_covariates.head()

Rows: 9289
Columns: 17


Unnamed: 0,GAwks_screen,Age_at_V1,eRace,BMI,Education,GravCat,SmokeCat1,SmokeCat2,Ins_Govt,Ins_Mil,Ins_Comm,Ins_Pers,Ins_Othr,V1AF14,V1AG01,V1AG11,PublicID
0,0.75,0.5,7,0.265029,6.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,2.0,11,1,2,00001U
1,0.75,0.25,6,0.180132,3.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,5,2,2,00004O
2,0.625,0.1875,5,0.147336,3.0,3.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,4,1,1,00007I
3,0.625,0.53125,5,0.239248,2.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,10,1,2,00008G
4,0.75,0.59375,5,0.114749,6.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,12,1,2,00015J


**Base Dataframe**: Combine the inputs (the deltas and the covariates) into the base dataset.

In [5]:
df_base = df_deltas.merge(df_covariates, on='PublicID', how='inner')

display(df_base.shape)
df_base.head()

(9289, 225)

Unnamed: 0,PublicID,V1BA01_LB_delta_V2BA01_LB,V2BA01_LB_delta_V3BA01_LB,V1BA01_KG_delta_V2BA01_KG,V2BA01_KG_delta_V3BA01_KG,V2BA02b2_delta_V3BA02b2,V2BA02a2_delta_V3BA02a2,V2BA02a1_delta_V3BA02a1,V2BA02b1_delta_V3BA02b1,V1CA06_delta_V3CA06,V1CA05_delta_V3CA05,V1CA03_delta_V3CA03,V1CA02_delta_V3CA02,V1CA08_delta_V3CA08,V1CA04_delta_V3CA04,V1CA09_delta_V3CA09,V1CA07_delta_V3CA07,V1CA10_delta_V3CA10,V1CA01_delta_V3CA01,V1KB01_delta_V3KB01,V1KB03c_delta_V3KB03c,V1KB05_delta_V3KB05,V1KA04_HR_delta_V3KA04_HR,V1KA02_HR_delta_V3KA02_HR,V1KA01_MIN_delta_V3KA01_MIN,V1KB03b_delta_V3KB03b,V1KA02_AMPM_delta_V3KA02_AMPM,V1KA04_MIN_delta_V3KA04_MIN,V1KA03_MIN_delta_V3KA03_MIN,V1KA03_HR_delta_V3KA03_HR,V1KA03_AMPM_delta_V3KA03_AMPM,V1KB04c_delta_V3KB04c,V1KA01_AMPM_delta_V3KA01_AMPM,V1KB02a_delta_V3KB02a,V1KA01_HR_delta_V3KA01_HR,V1KB04b_delta_V3KB04b,V1KB02b_delta_V3KB02b,V1KB06_delta_V3KB06,V1KB03a_delta_V3KB03a,V1KA02_MIN_delta_V3KA02_MIN,V1KB02d_delta_V3KB02d,V1KB02c_delta_V3KB02c,V1KB04a_delta_V3KB04a,V1LF01a_delta_V3LF01a,V1LB03_delta_V3LB03,V1LA04_delta_V3LA04,V1LA05b_AMPM_delta_V3LA05b_AMPM,V1LA05b_MIN_delta_V3LA05b_MIN,V1LB06_delta_V3LB06,V1LB10a_delta_V3LB10a,V1LB01_delta_V3LB01,V1LC01g_delta_V3LC01g,V1LA05b_HR_delta_V3LA05b_HR,V1LE02_delta_V3LE02,V1LE05a_delta_V3LE05a,V1LC01b_delta_V3LC01b,V1LD01b_delta_V3LD01b,V1LB07a_delta_V3LB07a,V1LD01_delta_V3LD01,V1LB09b_delta_V3LB09b,V1LE01c_delta_V3LE01c,V1LE05_delta_V3LE05,V1LA05a_HR_delta_V3LA05a_HR,V1LF02_delta_V3LF02,V1LB07b_delta_V3LB07b,V1LE03_delta_V3LE03,V1LA03_delta_V3LA03,V1LA06b_MIN_delta_V3LA06b_MIN,V1LB10b_delta_V3LB10b,V1LC01c_delta_V3LC01c,V1LA06b_HR_delta_V3LA06b_HR,V1LA01a_delta_V3LA01a,V1LA07_delta_V3LA07,V1LA01b_delta_V3LA01b,V1LC01e_delta_V3LC01e,V1LE01a_delta_V3LE01a,V1LA05a_AMPM_delta_V3LA05a_AMPM,V1LE01_delta_V3LE01,V1LA02b_delta_V3LA02b,V1LD01a_delta_V3LD01a,V1LC01h_delta_V3LC01h,V1LA05a_MIN_delta_V3LA05a_MIN,V1LA06b_AMPM_delta_V3LA06b_AMPM,V1LE04_delta_V3LE04,V1LC01d_delta_V3LC01d,V1LB05_delta_V3LB05,V1LB04_delta_V3LB04,V1LA06a_MIN_delta_V3LA06a_MIN,V1LB08a_delta_V3LB08a,V1LB09a_delta_V3LB09a,V1LF01b_delta_V3LF01b,V1LA06a_AMPM_delta_V3LA06a_AMPM,V1LE01b_delta_V3LE01b,V1LB02_delta_V3LB02,V1LC01i_delta_V3LC01i,V1LA06a_HR_delta_V3LA06a_HR,V1LA01_delta_V3LA01,V1LC01f_delta_V3LC01f,V1LA02a_delta_V3LA02a,V1LD01c_delta_V3LD01c,V1LF01c_delta_V3LF01c,V1LB08b_delta_V3LB08b,V1LC01a_delta_V3LC01a,U1CD05_delta_U2CD05,U2CD05_delta_U3CD05,U1CB04b_delta_U2CB04b,U2CB04b_delta_U3CB04b,U1CD01_delta_U2CD01,U2CD01_delta_U3CD01,U1CB03_delta_U2CB03,U2CB03_delta_U3CB03,U1CC02_delta_U2CC02,U2CC02_delta_U3CC02,U1CC12_delta_U2CC12,U2CC12_delta_U3CC12,U1CC09_delta_U2CC09,U2CC09_delta_U3CC09,U1CD08_delta_U2CD08,U2CD08_delta_U3CD08,U1CD06_delta_U2CD06,U2CD06_delta_U3CD06,U1CB04c_delta_U2CB04c,U2CB04c_delta_U3CB04c,U1CC06_delta_U2CC06,U2CC06_delta_U3CC06,U1CB04a_delta_U2CB04a,U2CB04a_delta_U3CB04a,U1CC01_delta_U2CC01,U2CC01_delta_U3CC01,U1CC04_delta_U2CC04,U2CC04_delta_U3CC04,U1CA02_DY_delta_U2CA02_DY,U2CA02_DY_delta_U3CA02_DY,U1CD11_delta_U2CD11,U2CD11_delta_U3CD11,U1CC07_delta_U2CC07,U2CC07_delta_U3CC07,U1CD09_delta_U2CD09,U2CD09_delta_U3CD09,U1CD03_delta_U2CD03,U2CD03_delta_U3CD03,U1CD02_delta_U2CD02,U2CD02_delta_U3CD02,U1CD04_delta_U2CD04,U2CD04_delta_U3CD04,U1CB04d_delta_U2CB04d,U2CB04d_delta_U3CB04d,U1CD10_delta_U2CD10,U2CD10_delta_U3CD10,U1CC10_delta_U2CC10,U2CC10_delta_U3CC10,U1CC05_delta_U2CC05,U2CC05_delta_U3CC05,U1CC03_delta_U2CC03,U2CC03_delta_U3CC03,U1CB01_delta_U2CB01,U2CB01_delta_U3CB01,U1CD13_delta_U2CD13,U2CD13_delta_U3CD13,U1CC11_delta_U2CC11,U2CC11_delta_U3CC11,U1CD09a_delta_U2CD09a,U2CD09a_delta_U3CD09a,U1CC13_delta_U2CC13,U2CC13_delta_U3CC13,U1CD07_delta_U2CD07,U2CD07_delta_U3CD07,U1CC08_delta_U2CC08,U2CC08_delta_U3CC08,U1CD12_delta_U2CD12,U2CD12_delta_U3CD12,U1CC09a_delta_U2CC09a,U2CC09a_delta_U3CC09a,U1CB02_delta_U2CB02,U2CB02_delta_U3CB02,U1CA02_WK_delta_U2CA02_WK,U2CA02_WK_delta_U3CA02_WK,U2AC03c_delta_U3AC03c,U2AB04_delta_U3AB04,U2AC02_delta_U3AC02,U2AC03d_delta_U3AC03d,U2AB02_delta_U3AB02,U2AB05_delta_U3AB05,U2AC03e_delta_U3AC03e,U2AA04_delta_U3AA04,U2AC03f_delta_U3AC03f,U2AA02_DY_delta_U3AA02_DY,U2AC03a_delta_U3AC03a,U2AB01_delta_U3AB01,U2AC03b_delta_U3AC03b,U2AA02_WK_delta_U3AA02_WK,U2AC01_delta_U3AC01,U2AB03_delta_U3AB03,U2AB07_delta_U3AB07,U2BA02_DY_delta_U3BA02_DY,U2BA04_delta_U3BA04,U2BC02_delta_U3BC02,U2BB03_delta_U3BB03,U2BB04_delta_U3BB04,U2BC03b_delta_U3BC03b,U2BC01_delta_U3BC01,U2BC03e_delta_U3BC03e,U2BC03c_delta_U3BC03c,U2BB02_delta_U3BB02,U2BB05_delta_U3BB05,U2BC03a_delta_U3BC03a,U2BA02_WK_delta_U3BA02_WK,U2BC03d_delta_U3BC03d,U2BB01_delta_U3BB01,GAwks_screen,Age_at_V1,eRace,BMI,Education,GravCat,SmokeCat1,SmokeCat2,Ins_Govt,Ins_Mil,Ins_Comm,Ins_Pers,Ins_Othr,V1AF14,V1AG01,V1AG11
0,00001U,0.51268,0.516508,0.522239,0.663086,0.514788,0.5745,0.440188,0.339237,0-3.0,0-3.0,0-2.0,0-2.0,0-3.0,0-3.0,0-3.0,0-3.0,0-4.0,0-1.0,0-2.0,0-3.0,0.520566,0.508436,0.472849,0.502378,0-2.0,0-1.0,0.491935,0.502271,0.471739,0-1.0,D-4.0,0-1.0,0-4.0,0.505661,D-3.0,0-3.0,0.363529,0-3.0,0.491284,0-3.0,0-1.0,D-1.0,M-M,0-0,0.612364,0-0,0.494075,0-0,0.805934,0-0,0-0,0.390001,0-0,0-0,0-0,M-M,D-D,M-M,0.294765,M-M,M-M,0.539761,M-M,D-D,0-0,0.738607,0.469542,0.45954,0-0,0.495013,0-0,0-0,0.649214,0-0,0-0,0-0,M-M,0.518427,M-M,0-0,0.469123,0-0,0-0,0-0,0-0,0-0,0.48979,D-D,0.632884,M-M,0-0,0-0,0-0,0-0,0.522161,M-M,0-0,0.556651,M-M,M-M,D-D,0-0,0.402477,0.426625,S-S,S-S,M-M,M-M,S-S,S-S,0.58733,0.620834,M-M,M-M,M-M,M-M,0.508805,0.490593,0.794132,0.476815,S-S,S-S,0.399309,0.480069,S-S,S-S,M-M,M-M,0.504052,0.508796,0.492826,0.502878,0-0,0-0,0.54744,0.52993,M-M,M-M,0.498806,0.46868,0.484482,0.484906,0.226622,0.483672,S-S,S-S,0-0,0-0,0-0,0-0,0.210555,0.186867,0.624775,0.499804,0-0,0-0,0-0,0-0,0-0,0-0,0.510512,0.544375,0-0,0-0,0.498574,0.554709,0.509151,0.505634,M-M,M-M,0.684467,0.371689,0-0,0-0,0.496271,0.573151,S-S,0.546891,S-0.0,S-S,0.556432,0.528562,S-S,0-1.0,S-S,0.740694,S-S,0-1.0,S-S,0.580595,0-0,0.059347,0.482712,0.502593,0-0,S-S,0-0,0.580346,S-S,0-0,S-S,S-S,0.519531,M-M,S-S,0.453007,S-S,0-0,0.75,0.5,7,0.265029,6.0,1.0,1.0,2.0,2.0,2.0,1.0,2.0,2.0,11,1,2
1,00004O,0.472393,0.482693,0.522239,0.663086,0.514788,0.5745,0.601595,0.3367,3.0-2.0,3.0-3.0,2.0-1.0,1.0-1.0,3.0-3.0,3.0-3.0,2.0-3.0,2.0-3.0,3.0-3.0,1.0-1.0,3.0-4.0,2.0-3.0,0.551703,0.527128,0.619115,0.550328,3.0-2.0,1.0-1.0,0.764985,0.351591,0.571552,1.0-1.0,7.0-3.0,2.0-2.0,3.0-3.0,0.503678,7.0-6.0,4.0-4.0,0.409332,4.0-4.0,0.296955,1.0-1.0,1.0-1.0,7-4.0,D-D,3.0-5.0,0.609442,1.0-1.0,0.495957,5.0-5.0,0.808713,4.0-5.0,3.0-4.0,0.389927,4.0-4.0,0-5.0,3.0-2.0,M-M,7.0-5,2-2,0.289252,M-M,2.0-1.0,0.578268,2.0-2,1-7.0,1.0-1.0,0.734926,0.855687,0.454917,3.0-1.0,0.544049,0-0,2.0-2.0,0.649214,4.0-4.0,0-0,2.0-2.0,D-2,0.539543,M-M,1.0-3.0,0.529579,1.0-1.0,1.0-2.0,3.0-4.0,1.0-1.0,3.0-4.0,0.498072,6.0-5,0.605237,D-D,1.0-1.0,0-0,4.0-5.0,2.0-1.0,0.474096,2.0-2.0,2.0-2.0,0.543977,M-M,2-2,6.0-6.0,3.0-3.0,0.364425,0.426625,S-S,S-S,1.0-2.0,M-1.0,0.0-0.0,S-0.0,0.60974,0.620834,2.0-2.0,M-2.0,2.0-2.0,M-2.0,0.513085,0.490593,0.751279,0.476815,S-S,S-S,0.4878,0.480069,S-S,S-S,2.0-2.0,M-2.0,0.493572,0.508796,0.498829,0.502878,1.0-1.0,0-1.0,0.622815,0.52993,2.0-2.0,M-2.0,0.575438,0.46868,0.446076,0.484906,0.221589,0.483672,S-S,S-S,1.0-1.0,0-1.0,1.0-1.0,0-1.0,0.209653,0.186867,0.73547,0.499804,1.0-1.0,0-1.0,1.0-1.0,0-1.0,1.0-1.0,0-1.0,0.510512,0.544375,1.0-1.0,0-1.0,0.34573,0.554709,0.505211,0.505634,2.0-2.0,M-2.0,0.684467,0.371689,2.0-2.0,0-2.0,0.364039,0.573151,S-S,0.637255,0.0-0.0,S-S,0.784081,0.719782,S-S,2.0-2.0,S-S,0.738879,S-S,1.0-1.0,S-S,0.75303,2.0-2.0,0.078868,0.482712,0.499938,2.0-2.0,0.0-0.0,2.0-2.0,0.580346,S-S,2.0-2.0,S-S,S-S,0.54175,2.0-2.0,S-S,0.585123,S-S,1.0-1.0,0.75,0.25,6,0.180132,3.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,5,2,2
2,00007I,0.510677,0.521473,0.522239,0.663086,0.514788,0.5745,0.431082,0.395469,3.0-2.0,4.0-4.0,3.0-3.0,1.0-1.0,3.0-3.0,3.0-3.0,4.0-3.0,3.0-4.0,4.0-4.0,1.0-1.0,0-0,0-0,0.520566,0.508436,0.472849,0.502378,0-0,0-0,0.491935,0.502271,0.471739,0-0,D-D,0-0,0-0,0.505661,D-D,0-0,0.363529,0-0,0.491284,0-0,0-0,D-D,2-2,4.0-2.0,0.605333,1.0-1.0,0.495957,5.0-4.0,0.821356,5.0-5.0,1.0-1.0,0.366654,5.0-3.0,0-0,2.0-3.0,1-D,1.0-6,1-1,0.284087,M-M,2.0-2.0,0.495153,2.0-D,1-1.0,1.0-4.0,0.740032,0.470852,0.435161,1.0-1.0,0.909559,0-0,2.0-1.0,0.649214,2.0-3.0,0-0,2.0-1.0,D-2,0.507025,1-1.0,1.0-1.0,0.482536,1.0-2.0,2.0-4.0,1.0-1.0,1.0-1.0,5.0-4.0,0.500386,4.0-1,0.614637,2-2,1.0-1.0,0-0,5.0-4.0,1.0-1.0,0.429951,2.0-2.0,1.0-1.0,0.516193,D-D,2-D,4.0-3.0,2.0-2.0,0.402477,0.426625,S-S,S-S,M-M,M-M,0.0-S,S-0.0,0.58733,0.620834,2.0-M,M-2.0,2.0-M,M-2.0,0.508805,0.490593,0.794132,0.476815,S-S,S-S,0.399309,0.480069,S-S,S-S,M-M,M-M,0.504052,0.508796,0.492826,0.502878,1.0-0,0-1.0,0.54744,0.52993,2.0-M,M-2.0,0.498806,0.46868,0.484482,0.484906,0.226622,0.483672,S-S,S-S,1.0-0,0-1.0,1.0-0,0-1.0,0.210555,0.186867,0.624775,0.499804,1.0-0,0-1.0,2.0-0,0-2.0,1.0-0,0-1.0,0.510512,0.544375,2.0-0,0-2.0,0.498574,0.554709,0.509151,0.505634,2.0-M,M-2.0,0.684467,0.371689,2.0-0,0-2.0,0.496271,0.573151,S-S,0.501886,0.0-0.0,S-S,0.543974,0.541368,S-S,2.0-1.0,S-S,0.738589,S-S,1.0-1.0,S-S,0.606818,2.0-0,0.055333,0.482712,0.666791,2.0-2.0,0.0-0.0,2.0-2.0,0.580346,S-S,2.0-2.0,S-S,S-S,0.513992,2.0-2.0,S-S,0.418275,S-S,1.0-1.0,0.625,0.1875,5,0.147336,3.0,3.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,4,1,1
3,00008G,0.563604,0.554183,0.522239,0.663086,0.514788,0.5745,0.447863,0.33119,3.0-3.0,3.0-2.0,3.0-2.0,1.0-1.0,4.0-3.0,3.0-2.0,4.0-4.0,4.0-4.0,4.0-4.0,1.0-1.0,3.0-1.0,1.0-2.0,0.578304,0.696367,0.380532,0.64377,3.0-2.0,1.0-1.0,0.119559,0.630634,0.333303,1.0-1.0,1.0-1.0,2.0-1.0,3.0-3.0,0.829215,1.0-3.0,2.0-2.0,0.363564,3.0-3.0,0.18237,3.0-2.0,2.0-2.0,1-1.0,2-2,4.0-2.0,0.730271,2.0-1.0,0.495957,5.0-4.0,0.813747,4.0-2.0,1.0-4.0,0.33489,5.0-5.0,0-0,3.0-4.0,M-1,1.0-3,2-1,0.326489,M-M,2.0-2.0,0.498693,2.0-2,1-1.0,1.0-4.0,0.750243,0.470852,0.454917,2.0-4.0,0.453608,1.0-1.0,1.0-2.0,0.656031,3.0-4.0,0-0,2.0-2.0,D-2,0.520032,M-1.0,1.0-1.0,0.476736,1.0-1.0,1.0-4.0,4.0-4.0,1.0-1.0,4.0-2.0,0.495759,1.0-3,0.646323,2-2,1.0-1.0,0-0,5.0-4.0,1.0-1.0,0.520795,1.0-1.0,1.0-2.0,0.558365,M-1,2-2,1.0-3.0,3.0-4.0,0.402477,0.426625,S-S,S-S,2.0-M,M-2.0,0.0-S,S-0.0,0.58733,0.620834,2.0-M,M-2.0,2.0-M,M-2.0,0.508805,0.490593,0.794132,0.476815,S-S,S-S,0.399309,0.480069,S-S,S-S,2.0-M,M-2.0,0.504052,0.508796,0.492826,0.502878,1.0-0,0-1.0,0.54744,0.52993,2.0-M,M-2.0,0.498806,0.46868,0.484482,0.484906,0.226622,0.483672,S-S,S-S,1.0-0,0-1.0,1.0-0,0-1.0,0.210555,0.186867,0.624775,0.499804,1.0-0,0-1.0,2.0-0,0-2.0,1.0-0,0-1.0,0.510512,0.544375,2.0-0,0-2.0,0.498574,0.554709,0.509151,0.505634,2.0-M,M-2.0,0.684467,0.371689,2.0-0,0-2.0,0.496271,0.573151,S-S,0.705715,0.0-0.0,S-S,0.68552,0.630575,S-S,2.0-2.0,S-S,0.826112,S-S,1.0-1.0,S-S,0.730303,2.0-2.0,0.067288,0.482712,0.499814,2.0-2.0,0.0-0.0,2.0-2.0,0.580346,S-S,2.0-2.0,S-S,S-S,0.547669,2.0-2.0,S-S,0.519989,S-S,1.0-1.0,0.625,0.53125,5,0.239248,2.0,1.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,10,1,2
4,00015J,0.530073,0.483132,0.522239,0.663086,0.514788,0.5745,0.579686,0.421487,4.0-4.0,4.0-4.0,4.0-4.0,1.0-1.0,4.0-4.0,1.0-1.0,4.0-4.0,4.0-4.0,4.0-4.0,1.0-1.0,0-0,0-0,0.520566,0.508436,0.472849,0.502378,0-0,0-0,0.491935,0.502271,0.471739,0-0,D-D,0-0,0-0,0.505661,D-D,0-0,0.363529,0-0,0.491284,0-0,0-0,D-D,2-2,1.0-1.0,0.633406,1.0-2.0,0.243776,2.0-2.0,0.804967,1.0-4.0,1.0-2.0,0.41787,5.0-5.0,0-4.0,1.0-1.0,M-M,3.0-4,2-2,0.278381,M-M,2.0-1.0,0.502234,2.0-2,3-2.0,2.0-3.0,0.727959,0.188341,0.484662,1.0-1.0,0.453608,1.0-1.0,2.0-2.0,0.643141,4.0-4.0,0-0,2.0-2.0,2-2,0.507025,M-M,1.0-2.0,0.482536,1.0-1.0,2.0-3.0,3.0-2.0,1.0-1.0,3.0-3.0,0.500386,3.0-4,0.640412,2-2,1.0-1.0,0-0,3.0-4.0,2.0-2.0,0.52335,1.0-1.0,1.0-1.0,0.565346,M-M,2-2,3.0-4.0,1.0-3.0,0.412681,0.426625,S-S,S-S,2.0-2.0,M-2.0,0.0-0.0,S-0.0,0.605388,0.620834,2.0-2.0,M-2.0,2.0-2.0,M-2.0,0.525839,0.490593,0.834683,0.476815,S-S,S-S,0.311059,0.480069,S-S,S-S,1.0-1.0,M-1.0,0.493464,0.508796,0.582748,0.502878,1.0-1.0,0-1.0,0.512751,0.52993,2.0-1.0,M-2.0,0.566368,0.46868,0.667047,0.484906,0.208319,0.483672,S-S,S-S,1.0-1.0,0-1.0,1.0-1.0,0-1.0,0.210304,0.186867,0.717389,0.499804,1.0-1.0,0-1.0,2.0-1.0,0-2.0,1.0-1.0,0-1.0,0.510512,0.544375,2.0-1.0,0-2.0,0.565546,0.554709,0.514487,0.505634,2.0-2.0,M-2.0,0.684467,0.371689,2.0-2.0,0-2.0,0.569705,0.573151,S-S,0.506493,0.0-0.0,S-S,0.62464,0.564749,S-S,2.0-1.0,S-S,0.739023,S-S,1.0-1.0,S-S,0.618181,2.0-0,0.054203,0.482712,0.58324,2.0-2.0,0.0-0.0,2.0-2.0,0.580346,S-S,2.0-2.0,S-S,S-S,0.602954,2.0-2.0,S-S,0.316562,S-S,1.0-1.0,0.75,0.59375,5,0.114749,6.0,1.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,12,1,2


### Targets Datasets

**Target Dataframe**: The target variables we will be predicting on.

In [6]:
with open('Target Pickles/targets_df.pkl', 'rb') as f:
    df_targets = pickle.load(f)

print('Rows:',    df_targets.shape[0])
print('Columns:', df_targets.shape[1])
df_targets.head()

Rows: 9289
Columns: 19


Unnamed: 0,PEgHTN,ChronHTN,CMAD01a,CMAD01b,CMAD01c,CMAD01d,CMAD01e,CMAD01f,CMAD01g,CMAD01h,CMAE04a1c,CMAE04a2c,CMAE04a3c,CMAE04a4c,CMAE04a5c,Stillbirth,Miscarriage,Termination,PublicID
0,,,,,,,,,,,,,,,,,,,00001U
1,0.0,0.0,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,00004O
2,0.0,0.0,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,00007I
3,0.0,0.0,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,00008G
4,0.0,0.0,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,00015J


### Auxiliary Datasets

#### Variables Dictionary

In [7]:
variables_df = pd.read_excel(PATH + '/nuMoM2b_Codebook_NICHD Data Challenge.xlsx',
                              sheet_name='nuMoM2b_Variables',
                              header=1,
                              usecols=['Variable Name', 'Variable Label', 'Variable Type', 'Variable Code List\n(if Coded)'],
                              engine='openpyxl')
variables_df.columns = ['Variable Name', 'Variable Label', 'Variable Type', 'Variable Code List']

variables_df.head()

Unnamed: 0,Variable Name,Variable Label,Variable Type,Variable Code List
0,PublicID,Public nuMoM2b ID,Character,
1,A02_Complete,(A02) Data entry status,Character,
2,A02_Complete_1,(A02) Data entry status,Character,
3,A02_Status,(A02) Validation status,Character,
4,A02_Status_1,(A02) Validation status,Character,


### Results Datasets

**Features Importance Dictionary**: We will store the feature importances for each target variable in this dictionary.

In [8]:
feature_results_dict = {}

**Model Results Dictionary**: We will store the model results for each target variables in this dictionary.

In [9]:
model_results_dict = {}

[Back to Top](#top)

<a class="anchor" id="step-3"></a>

## Modeling

We will try the following modeling approaches.
- `Logistic Regression`: EXPLANATION FOR CHOICE
- `Random Forest`: EXPLANATION FOR CHOICE
- `Light Gradient-Boosted Model`: EXPLANATION FOR CHOICE

### Helper Functions

These will be used in the master function.

In [11]:
# The master function will use this helper function to prep the data for each target variable
def prep_and_split_data(target, df_base, df_targets):
    
    try:
        # append target feature to base dataframe
        df = df_base.merge(df_targets[['PublicID', target]], on='PublicID').drop('PublicID', axis=1)

        # drop rows missing the output feature
        # print('  Num rows before dropping:', df.shape[0])
        # print('  Num missing values:', df[target].isna().sum())
        df = df.dropna(subset=[target])
        # print('  Num rows after dropping:', df.shape[0])

        # split into X and y
        X = df.drop([target], axis = 1)
        y = df[target]

        # drop correlated features
        # print('  Num columns before dropping correlated features:', X.shape[1])
        corr = X.corr()
        upper = corr.where(np.triu(np.ones(corr.shape), k=1).astype(np.bool))
        to_drop = [column for column in upper.columns if any(upper[column] > 0.85)]
        X.drop(to_drop, axis=1, inplace=True)
        # print('  Num columns after dropping correlated features:', X.shape[1])

        # one-hot encode X
        X_dummied = pd.get_dummies(X, prefix_sep='__')
        # print('  Num columns after one-hot encoding:', X_dummied.shape[1])
        
        # train/test split with NearMiss undersampling
        X_dummied, y = NearMiss(version=3, n_neighbors_ver3=3).fit_resample(X_dummied, y)
        # print('  Num rows after Near-Miss undersampling:', X_dummied.shape[0])
        print('  Dataframe shape after cleaning:', str(X_dummied.shape))

        # save dataframe for access during analysis phase
        filename = 'Target Pickles/df_' + target + '.pkl'
        with open(filename, 'wb') as f:
            pickle.dump(pd.concat([X_dummied, y], axis=1), f)
        # print('  Dataframe saved:', filename)

        return train_test_split(X_dummied, y, test_size=0.3, random_state=42, stratify = y)
    
    except:
        pass
    
#     # if too many rows for modeling, reduce
#     if X_dummied.shape[0] > 1000:
#         print('  Num rows after further undersampling: 1000')
#         print()
#         return train_test_split(X_dummied, y, train_size=700, test_size=300, random_state=42, stratify = y)
#     else:
#         print()
#         return train_test_split(X_dummied, y, test_size=0.3, random_state=42, stratify = y)

In [12]:
# The master function will use this function to initialize the feature_results dataframe for each target variable
def initialize_feature_results(X_dummied):
    
    # initialize the feature results dataframe
    feature_results = pd.DataFrame(columns=['Feature', 'Variable1 Name', 'Variable1 Desc', 'Variable2 Name', 'Variable2 Desc',
                                            'LogR_FeatureImportance', 'RanF_FeatureImportance', 'LGBM_FeatureImportance'])
    
    # populate the "Feature" column
    feature_results['Feature'] = X_dummied.columns

    # extract the individual features from the delta columns, nan for second feature if not a delta column
    feature_split = [x.split('__')[0].split('_delta_') for x in feature_results['Feature']]
    for x in feature_split:
        if(len(x) == 1): x.append(np.nan)
    feature_results['Variable1 Name'] = [x[0] for x in feature_split]
    feature_results['Variable2 Name'] = [x[1] for x in feature_split]

    # extract the feature labels for all features
    feature_results['Variable1 Desc'] = feature_results[['Variable1 Name']].merge(variables_df[['Variable Name', 'Variable Label']], how='left',
                                                                                  left_on='Variable1 Name', right_on='Variable Name')['Variable Label']
    feature_results['Variable2 Desc'] = feature_results[['Variable2 Name']].merge(variables_df[['Variable Name', 'Variable Label']], how='left',
                                                                                  left_on='Variable2 Name', right_on='Variable Name')['Variable Label']

    # extract the feature labels for all features
    feature_results['Variable1 Desc'] = feature_results[['Variable1 Name']].merge(variables_df[['Variable Name', 'Variable Label']], how='left',
                                                                                  left_on='Variable1 Name', right_on='Variable Name')['Variable Label']
    feature_results['Variable2 Desc'] = feature_results[['Variable2 Name']].merge(variables_df[['Variable Name', 'Variable Label']], how='left',
                                                                                  left_on='Variable2 Name', right_on='Variable Name')['Variable Label']
    
    return feature_results

### Functions for Modeling Approaches

***
**Logistic Regression**
***

In [13]:
# create function that runs and tunes logistic regression, outputs results
def model_logisticregression_tuned(X_train, y_train, X_test, y_test):
    
    # create parameter grid to fine-tune model
    param_grid = { 'C': [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 10.0] }
    
    # run the cross-validated grid search to identify the best parameters for the model
    CV_clf = GridSearchCV(estimator=LogisticRegression(penalty='l1', solver='liblinear', random_state=42),
                          scoring='f1', param_grid=param_grid, n_jobs=-1, verbose=0).fit(X_train, y_train)
    
    # extract the best parameters, as selected by the grid search
    best_params = CV_clf.best_params_
    best_C = best_params['C']
    
    # create the final RandomForestClassifier
    best_clf = LogisticRegression(random_state=42, C=best_C).fit(X_train, y_train)
    
    # predict on the test set
    y_pred = np.round(best_clf.predict(X_test))
    
    # Create dataframe for parameters and feature importances
    features_df = pd.DataFrame()
    features_df['Parameter'] = X_train.columns.to_list()
    features_df['Feature Importance'] = best_clf.coef_[0]
    
    # return the features dataframe and a classification report
    return [features_df, classification_report(y_test, y_pred, output_dict = True)]

***
**Random Forest**
***

In [14]:
# create function that runs and tunes random forest, outputs results
def model_randomforest_tuned(X_train, y_train, X_test, y_test):
    
    # create parameter grid to fine-tune model
    param_grid = { 
        'n_estimators': range(100, 600, 100),
        'max_features': ['auto', 'log2', 0.2, 0.25, 0.33, 0.5],
        'max_depth' : [None, 4, 6, 8],
        'criterion' : ['gini', 'entropy']
    }
    
    # run the cross-validated grid search to identify the best parameters for the model
    CV_rfc = GridSearchCV(estimator=RandomForestClassifier(random_state=42), scoring='f1',
                          param_grid=param_grid, n_jobs=-1, verbose=0).fit(X_train, y_train)
    
    # extract the best parameters, as selected by the grid search
    best_params = CV_rfc.best_params_
    best_n_estimators = best_params['n_estimators']
    best_max_features = best_params['max_features']
    best_max_depth = best_params['max_depth']
    best_criterion = best_params['criterion']
    
    # create the final RandomForestClassifier
    best_rfc = RandomForestClassifier(random_state=42,
                                      max_features=best_max_features,
                                      n_estimators=best_n_estimators,
                                      max_depth=best_max_depth,
                                      criterion=best_criterion).fit(X_train, y_train)
    
    # predict on the test set
    y_pred = best_rfc.predict(X_test)
    
    # Create dataframe for parameters and feature importances
    features_df = pd.DataFrame()
    features_df['Parameter'] = X_train.columns.to_list()
    features_df['Feature Importance'] = best_rfc.feature_importances_
    
    # return the features dataframe and a classification report
    return [features_df, classification_report(y_test, y_pred, output_dict = True)]

***
**Light GBM**
***

In [15]:
# create function that runs lgbm, outputs results
def model_lgbm_tuned(X_train, y_train, X_test, y_test):
    
    # create parameter grid to fine-tune model
    param_grid = {
        'colsample_bytree': [0.8, 1.0],
        'max_depth': [15, 20, -1],
        'num_leaves': [10, 20, 31],
        'reg_alpha': [0, 0.5, 1.0],
        'reg_lambda': [0, 0.5, 1.0],
        'min_split_gain': [0, 0.2, 0.4],
        'subsample': [0.8, 1.0]
    }
    
    # run the cross-validated grid search to identify the best parameters for the model
    CV_lgb = GridSearchCV(estimator=LGBMClassifier(random_state=42), scoring='f1',
                          param_grid=param_grid, n_jobs=-1, verbose=0).fit(X_train, y_train)
    
    # extract the best parameters, as selected by the grid search
    best_params = CV_lgb.best_params_
    best_colsample_bytree = best_params['colsample_bytree']
    best_max_depth = best_params['max_depth']
    best_num_leaves = best_params['num_leaves']
    best_reg_alpha = best_params['reg_alpha']
    best_reg_lambda = best_params['reg_lambda']
    best_min_split_gain = best_params['min_split_gain']
    best_subsample = best_params['subsample']
    
    # create the final LGBMClassifier
    best_lgb = LGBMClassifier(random_state=42,
                              colsample_bytree=best_colsample_bytree,
                              max_depth=best_max_depth,
                              num_leaves=best_num_leaves,
                              reg_alpha=best_reg_alpha,
                              reg_lambda=best_reg_lambda,
                              min_split_gain=best_min_split_gain,
                              subsample=best_subsample).fit(X_train, y_train)
    
    # predict on the test set
    y_pred = best_lgb.predict(X_test)

    # create dataframe for parameters and feature importances
    features_df = pd.DataFrame()
    features_df['Parameter'] = best_lgb.feature_name_
    features_df['Feature Importance'] = best_lgb.feature_importances_

    # return the features dataframe and a classification report
    return [features_df, classification_report(y_test, y_pred, output_dict = True)]

### Master Modeling Function

In [16]:
def run_models_tuned(target, df_base, df_targets):
    try:
        print('Modeling for target =', target)

        # call the prep_and_split_data() helper function to extract the training and test sets
        X_train, X_test, y_train, y_test = prep_and_split_data(target, df_base, df_targets)

        # initialize model results dictionary and feature results dataframe for selected target variable
        model_results = {}
        feature_results = initialize_feature_results(X_train)

        # Run the logistic regression model, extract the results
        print('  Training and tuning Logistic Regression model...')
        t0 = time.time()
        logr_features, model_results['LogR'] = model_logisticregression_tuned(X_train, y_train, X_test, y_test)
        feature_results['LogR_FeatureImportance'] = feature_results[['Feature']].merge(logr_features, how='inner',
                                                                                       left_on='Feature', right_on='Parameter')['Feature Importance']
        t1 = time.time()
        print('    Done running Logistic Regression model in', str(int((t1-t0)/60)), 'mins and', str(int((t1-t0)%60)), 'secs')

        # Run the random forest model, extract the results
        print('  Training and tuning Random Forest model...')
        t0 = time.time()
        ranf_features, model_results['RanF'] = model_randomforest_tuned(X_train, y_train, X_test, y_test)
        feature_results['RanF_FeatureImportance'] = feature_results[['Feature']].merge(ranf_features, how='inner',
                                                                                       left_on='Feature', right_on='Parameter')['Feature Importance']
        t1 = time.time()
        print('    Done running Random Forest model in', str(int((t1-t0)/60)), 'mins and', str(int((t1-t0)%60)), 'secs')

        # Run the LGBM model, extract the results
        print('  Training and tuning LGBM model...')
        t0 = time.time()
        lgbm_features, model_results['LGBM'] = model_lgbm_tuned(X_train, y_train, X_test, y_test)
        feature_results['LGBM_FeatureImportance'] = feature_results[['Feature']].merge(lgbm_features, how='inner',
                                                                                       left_on='Feature', right_on='Parameter')['Feature Importance']
        t1 = time.time()
        print('    Done running LGBM model in', str(int((t1-t0)/60)), 'mins and', str(int((t1-t0)%60)), 'secs')

        # Save results back to dictionaries
        model_results_dict[target]   = model_results
        feature_results_dict[target] = feature_results
        print('Modeling successful!')
        print()
        
    except:
        print('Modeling failed, due to error')
        print()
        pass

In [17]:
for target in df_targets.columns.drop('PublicID').sort_values():
    run_models_tuned(target, df_base, df_targets)

Modeling for target = CMAD01a
  Dataframe shape after cleaning: (168, 2407)
  Training and tuning Logistic Regression model...
    Done running Logistic Regression model in 0 mins and 1 secs
  Training and tuning Random Forest model...
    Done running Random Forest model in 1 mins and 58 secs
  Training and tuning LGBM model...
    Done running LGBM model in 6 mins and 23 secs
Modeling successful!

Modeling for target = CMAD01b
  Dataframe shape after cleaning: (156, 2407)
  Training and tuning Logistic Regression model...
    Done running Logistic Regression model in 0 mins and 0 secs
  Training and tuning Random Forest model...
    Done running Random Forest model in 1 mins and 55 secs
  Training and tuning LGBM model...
    Done running LGBM model in 5 mins and 41 secs
Modeling successful!

Modeling for target = CMAD01c
  Dataframe shape after cleaning: (162, 2407)
  Training and tuning Logistic Regression model...
    Done running Logistic Regression model in 0 mins and 0 secs
  T

In [18]:
with open('Results/model_results_dict.pkl', 'wb') as f:
    pickle.dump(model_results_dict, f)

In [19]:
with open('Results/feature_results_dict.pkl', 'wb') as f:
    pickle.dump(feature_results_dict, f)

[Back to Top](#top)