# Introduction to the babysaver module

This is an introduction to the `babysaver` module written by Team Babies as part of DSSG 2015 for our project with the Illinois Department of Human Services (IDHS).

First, we're going to need to establish a connection with our PostgreSQL server and insert into our system path the location of our `babysaver` module.

In [1]:
import pandas as pd
import numpy as np
import psycopg2
from sqlalchemy import create_engine
import json

with open('/mnt/data/predicting-adverse-births/passwords/psql_psycopg2.password', 'r') as f:
    params = json.load(f)

try:
    conn = psycopg2.connect(**params)
    conn.autocommit
    cur = conn.cursor()

except:
    print('Unable to connect to database')

with open('/mnt/data/predicting-adverse-births/passwords/psql_engine.password', 'r') as f:
    engine = create_engine(f.read())

babysaver_parent = '/mnt/data/predicting-adverse-births/babies/' # clone the babies repo
import sys
sys.path.insert(0, babysaver_parent)
from babysaver import features
from babysaver import models
from babysaver.models import WeightedQuestions
from babysaver import evaluation

## Getting the data

The next step is to gather our data. We have a data configuration file that allows you to specify which questions from which assessment during which time frame from which populations in addition to which additional features and which outcome you would like to extract from our database. We also have a `config_writer()` function that allows you to write a dictionary of values to a config file to avoid entering values in a spreadsheet. The `data_getter()` function also has the option to create all two-way interaction terms and carry out basic imputation strategies (such as impute all missing question values with 0).

In [2]:
config_add1 = {
               'Features': None, 
               'Include 707G?': 'Y', 
               '707G Questions': range(35,52), 
               '707G Start Date': '2014-07-01', 
               '707G End Date': None,
               'Include 711?': 'N', 
               '711 Questions': [], 
               '711 Start Date': None, 
               '711 End Date': None, 
               'Include FCM?': 'Y',  
               'Include BBO?': 'Y', 
               'Include other?': 'Y', 
               'Outcome': 'ADVB1_OTC'
              }

features.config_writer(config_add1, '/home/ipan/configs/config_add1.csv')
data_dct = features.data_getter('/home/ipan/configs/config_add1.csv', 
                                conn=conn, 
                                unique_identifier='UNI_PART_ID_I',  
                                impute='fill_mode',
                                interactions=False)

data_getter: there are no continuous values to standardize
data_getter: dataset has dimensions (6457, 22)


`data_getter()` returns a dictionary that includes the resulting dataframe.

It also contains the path to the config file used to generate the dataset, as well as the list of features, the outcome, the unique identifier column, holdout dataset if specified and the date column (deprecated). 

In [4]:
data_dct.keys()

['config_file',
 'features',
 'dataframe',
 'date',
 'holdout',
 'outcome',
 'unique_id']

## Training the models

Now that we have our data we can start training classifiers. We need to import classifiers from `sklearn` and then specify dictionaries for each classifier family including a dictionary of hyperparameters.

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.lda import LDA
from sklearn.qda import QDA
from sklearn.svm import SVC

logit_lib = {'clf': LogisticRegression,
             'param_dict': {'C': [1e-4, 1e-3, 0.01, 0.1, 1, 10, 1e3, 1e4, 1e20],
                            'penalty': ['l1', 'l2'],
                            'class_weight': [None, 'auto']
                           }
             }

rf_lib = {'clf': RandomForestClassifier,
          'param_dict': {'n_estimators': [100],
                         'max_depth': [None, 2, 5, 10, 20, 50],
                         'max_features': [None, 'sqrt', 'log2'],
                         'min_samples_split': [2, 5, 10],
                         'n_jobs': [-1],
                         'criterion': ['entropy', 'gini']
                        }
          }

adaboost_lib = {'clf': AdaBoostClassifier,
                'param_dict': {'n_estimators': [100],
                              'learning_rate': [0.1, 0.5, 1, 2, 5]
                              }
                }

gnb_lib = {'clf': GaussianNB,
           'param_dict': {}
          }

bnb_lib = {'clf': BernoulliNB,
           'param_dict': {}
          }

lda_lib = {'clf': LDA,
           'param_dict': {}
          }

qda_lib = {'clf': QDA,
           'param_dict': {}
          }

Now, we can pass a list of these libraries and the `data_dct` from `data_getter()` to the `machine_learner()` function to start training our classifiers. We can also specify whether we want to print evaluation sheets (`make_evals=True`) but this will increase the runtime. Models will be pickled in the specified directory (or `./pickles/` by default -- note that if you do not specify the folder it will prompt you to confirm the folder, so if you want to run this through an automated script you should specify a different folder name). There are different cross-validation schemes as well but `kfold_cv` is the only one that has been fully tested. This function will return a dictionary of dataframes, each of which is a list of metrics for each classifier, as well as a dictionary for the pickle file name of each classifier. See `help(models.machine_learner)` for more info.

In [6]:
eval_dct, pkl_dct = models.machine_learner(data_dct, clf_library=[logit_lib, rf_lib, adaboost_lib, 
                                                                  gnb_lib, bnb_lib, lda_lib, qda_lib],
                                           cv='kfold_cv', verbose=True, n_folds=10, pkl_folder='yay_pkls',
                                           k=[0.05, 0.1, 0.15, 0.2, 0.25, 0.3])

Running LogisticRegression(C=0.0001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0)
...
Finished in: 0:00:00.320212

Running LogisticRegression(C=0.0001, class_weight='auto', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0)
...
Finished in: 0:00:00.320910

Running LogisticRegression(C=0.001, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)
...
Finished in: 0:00:00.328651

Running LogisticRegression(C=0.001, class_weight='auto', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class=

  'precision', 'predicted', average, warn_for)


## Evaluating the models

There's not much we can do with a dictionary of dataframes if we want to analyze all of the classifiers at the same time. We can use `dict_to_dataframe` to create a one dataframe containing the metrics and pickle file names for each classifier. 

In [7]:
evaluation_df = evaluation.dict_to_dataframe(eval_dct, pkl_dct)
evaluation_df.head()

Unnamed: 0,avg_prec_score_mean,avg_prec_score_std,roc_auc_mean,roc_auc_std,avg_prec_0.05 mean,avg_prec_0.05 std,precision at 0.05 mean,precision at 0.05 std,recall at 0.05 mean,recall at 0.05 std,...,avg_prec_0.3 std,precision at 0.3 mean,precision at 0.3 std,recall at 0.3 mean,recall at 0.3 std,test_count at 0.3 mean,test_count at 0.3 std,test_percent at 0.3 mean,test_percent at 0.3 std,pickle_file
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n max_depth=None, max_features=None, max_leaf_nodes=None,\n min_samples_leaf=1, min_samples_split=10,\n min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,\n oob_score=False, random_state=None, verbose=0,\n warm_start=False)",0.227689,0.037991,0.564351,0.035016,0.042793,0.026549,0.280492,0.096231,0.092683,0.033848,...,0.038559,0.215661,0.030034,0.421209,0.052806,205.5,16.641648,0.318261,0.025804,yay_pkls/RandomForestClassifier126.pkl
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n max_depth=None, max_features='log2', max_leaf_nodes=None,\n min_samples_leaf=1, min_samples_split=5,\n min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,\n oob_score=False, random_state=None, verbose=0,\n warm_start=False)",0.228491,0.036778,0.56323,0.036368,0.046269,0.030128,0.301346,0.128485,0.08315,0.03422,...,0.037046,0.21189,0.024902,0.435476,0.069877,215.5,24.636242,0.333738,0.038073,yay_pkls/RandomForestClassifier102.pkl
"AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=5,\n n_estimators=100, random_state=None)",0.349449,0.12106,0.63461,0.085664,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,yay_pkls/AdaBoostClassifier148.pkl
"LogisticRegression(C=10, class_weight='auto', dual=False, fit_intercept=True,\n intercept_scaling=1, max_iter=100, multi_class='ovr',\n penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n verbose=0)",0.269039,0.036457,0.594393,0.032791,0.071361,0.032236,0.381378,0.097234,0.120357,0.033885,...,0.04051,0.226779,0.023373,0.430751,0.06338,198.6,16.153431,0.307571,0.025008,yay_pkls/LogisticRegression29.pkl
"RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n max_depth=50, max_features='log2', max_leaf_nodes=None,\n min_samples_leaf=1, min_samples_split=5,\n min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,\n oob_score=False, random_state=None, verbose=0,\n warm_start=False)",0.226853,0.034828,0.563092,0.036765,0.045904,0.027237,0.306021,0.124561,0.085998,0.034483,...,0.033072,0.211291,0.027636,0.435485,0.078397,215.8,24.956852,0.334204,0.038582,yay_pkls/RandomForestClassifier107.pkl


Now we can sort the dataframe by a particular metric to see which classifier did the best on that metric. For example, we might be interested in precision at 10%.

In [8]:
print evaluation_df.columns.values
sorted_df = evaluation_df.sort('precision at 0.1 mean', ascending=False)
sorted_df[['precision at 0.1 mean', 'precision at 0.1 std', 'roc_auc_mean', 'roc_auc_std', 'test_percent at 0.1 mean', 'pickle_file']].head()

['avg_prec_score_mean' 'avg_prec_score_std' 'roc_auc_mean' 'roc_auc_std'
 'avg_prec_0.05 mean' 'avg_prec_0.05 std' 'precision at 0.05 mean'
 'precision at 0.05 std' 'recall at 0.05 mean' 'recall at 0.05 std'
 'test_count at 0.05 mean' 'test_count at 0.05 std'
 'test_percent at 0.05 mean' 'test_percent at 0.05 std' 'avg_prec_0.1 mean'
 'avg_prec_0.1 std' 'precision at 0.1 mean' 'precision at 0.1 std'
 'recall at 0.1 mean' 'recall at 0.1 std' 'test_count at 0.1 mean'
 'test_count at 0.1 std' 'test_percent at 0.1 mean'
 'test_percent at 0.1 std' 'avg_prec_0.15 mean' 'avg_prec_0.15 std'
 'precision at 0.15 mean' 'precision at 0.15 std' 'recall at 0.15 mean'
 'recall at 0.15 std' 'test_count at 0.15 mean' 'test_count at 0.15 std'
 'test_percent at 0.15 mean' 'test_percent at 0.15 std' 'avg_prec_0.2 mean'
 'avg_prec_0.2 std' 'precision at 0.2 mean' 'precision at 0.2 std'
 'recall at 0.2 mean' 'recall at 0.2 std' 'test_count at 0.2 mean'
 'test_count at 0.2 std' 'test_percent at 0.2 mean'
 't

Unnamed: 0,precision at 0.1 mean,precision at 0.1 std,roc_auc_mean,roc_auc_std,test_percent at 0.1 mean,pickle_file
"LogisticRegression(C=1, class_weight='auto', dual=False, fit_intercept=True,\n intercept_scaling=1, max_iter=100, multi_class='ovr',\n penalty='l1', random_state=None, solver='liblinear', tol=0.0001,\n verbose=0)",0.319557,0.065447,0.594181,0.032089,0.099116,yay_pkls/LogisticRegression9.pkl
"LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,\n intercept_scaling=1, max_iter=100, multi_class='ovr',\n penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n verbose=0)",0.318762,0.068244,0.590139,0.032635,0.099116,yay_pkls/LogisticRegression26.pkl
"LogisticRegression(C=1, class_weight='auto', dual=False, fit_intercept=True,\n intercept_scaling=1, max_iter=100, multi_class='ovr',\n penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n verbose=0)",0.31787,0.065527,0.593924,0.033104,0.099581,yay_pkls/LogisticRegression27.pkl
"LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,\n intercept_scaling=1, max_iter=100, multi_class='ovr',\n penalty='l1', random_state=None, solver='liblinear', tol=0.0001,\n verbose=0)",0.316662,0.066351,0.592775,0.033538,0.099735,yay_pkls/LogisticRegression8.pkl
"LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,\n intercept_scaling=1, max_iter=100, multi_class='ovr',\n penalty='l1', random_state=None, solver='liblinear', tol=0.0001,\n verbose=0)",0.316258,0.068131,0.590239,0.032488,0.098806,yay_pkls/LogisticRegression10.pkl


If we want to load the best model back into memory, we can do that using `joblib`. This is especially useful when running a model in the backend. That wraps up the `babysaver` module!

In [9]:
from sklearn.externals import joblib
best_clf = joblib.load(sorted_df['pickle_file'][0])
best_clf

LogisticRegression(C=1, class_weight='auto', dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)