# Models loop automation - impairment tuned models

This notebook uses as inputs the outputs from the pipeline 2 and performs model assessment of performances and general exploration for the transaction credit events prediction "impairment".  
All the experiments produce useful data for visualizing results and can be tracked on MLflow.  
The models' hyperparameters used at this stage have been tuned using random grid search in the notebook "13_enriched_models_imp_tuning.ipynb"

In [1]:
import pandas as pd
import numpy as np
import pickle

from models_utils import *
from visualization_utils import *

from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from bokeh.io import show, output_notebook
output_notebook()

In [2]:
#linear model Stochastic Gradient Descent with optimized hyperparameters using random search
sgd_rs = SGDClassifier(random_state=42, max_iter=300, loss='log', learning_rate='adaptive', eta0=0.01, tol=0.0001)

sgd_rs_2 = SGDClassifier(random_state=42, max_iter=350, loss='log', learning_rate='optimal', eta0=1e-05, tol=0.0001) 

sgd_rs_3 = SGDClassifier(random_state=42, max_iter=250, loss='log', learning_rate='adaptive', eta0=0.01, tol=0.0001) 

In [3]:
#random forest models parameters from randomized grid search

#first randomized search
rs1 = {'n_estimators': 280,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_leaf_nodes':60,
 'max_depth': 100,
 'bootstrap': True}

#second randomized search 
rs2 = {'n_estimators': 150,
 'min_samples_split': 2,
 'min_samples_leaf': 4,
 'max_leaf_nodes': 60,
 'max_features': 20,
 'max_depth': 100,
 'bootstrap': False}

#third randomized search 
rs3 = {'n_estimators': 280,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_leaf_nodes': 60,
 'max_features': 20,
 'max_depth': 200,
 'bootstrap': True}

#fourth randomized search_imp_time
rs4 = {'n_estimators': 200,
 'min_samples_split': 2,
 'min_samples_leaf': 4,
 'max_leaf_nodes': 10,
 'max_features': 'sqrt',
 'max_depth': 100,
 'bootstrap': True}

#fifth randomized search_imp_time bgt
rs5 = {'n_estimators': 280,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_leaf_nodes': 10,
 'max_features': 10,
 'max_depth': 100,
 'bootstrap': False}

rf_rs_1 = RandomForestClassifier(random_state=42,
                               n_estimators=rs1['n_estimators'],
                               min_samples_split=rs1['min_samples_split'],
                               min_samples_leaf=rs1['min_samples_leaf'],
                               max_features = rs1['max_features'],
                               max_leaf_nodes=rs1['max_leaf_nodes'],
                               max_depth=rs1['max_depth'],
                               bootstrap=rs1['bootstrap'],
                               class_weight="balanced",
                               n_jobs=7)

rf_rs_2 = RandomForestClassifier(random_state=42,
                               n_estimators=rs2['n_estimators'],
                               min_samples_split=rs2['min_samples_split'],
                               min_samples_leaf=rs2['min_samples_leaf'],
                               max_features = rs2['max_features'],
                               max_leaf_nodes=rs2['max_leaf_nodes'],
                               max_depth=rs2['max_depth'],
                               bootstrap=rs2['bootstrap'],
                               class_weight="balanced",
                               n_jobs=7)

rf_rs_3 = RandomForestClassifier(random_state=42,
                               n_estimators=rs3['n_estimators'],
                               min_samples_split=rs3['min_samples_split'],
                               min_samples_leaf=rs3['min_samples_leaf'],
                               max_features = rs3['max_features'],
                               max_leaf_nodes=rs3['max_leaf_nodes'],
                               max_depth=rs3['max_depth'],
                               bootstrap=rs3['bootstrap'],
                               class_weight="balanced",
                               n_jobs=7)

rf_rs_4 = RandomForestClassifier(random_state=42,
                               n_estimators=rs4['n_estimators'],
                               min_samples_split=rs4['min_samples_split'],
                               min_samples_leaf=rs4['min_samples_leaf'],
                               max_features = rs4['max_features'],
                               max_leaf_nodes=rs4['max_leaf_nodes'],
                               max_depth=rs4['max_depth'],
                               bootstrap=rs4['bootstrap'],
                               class_weight="balanced",
                               n_jobs=7)

rf_rs_5 = RandomForestClassifier(random_state=42,
                               n_estimators=rs5['n_estimators'],
                               min_samples_split=rs5['min_samples_split'],
                               min_samples_leaf=rs5['min_samples_leaf'],
                               max_features = rs5['max_features'],
                               max_leaf_nodes=rs5['max_leaf_nodes'],
                               max_depth=rs5['max_depth'],
                               bootstrap=rs5['bootstrap'],
                               class_weight="balanced",
                               n_jobs=7)

In [4]:
models = [sgd_rs, sgd_rs_2, sgd_rs_3, rf_rs_1, rf_rs_2, rf_rs_3, rf_rs_4, rf_rs_5]

In [5]:
prefixes_shuffle = ['shuffle_imp_bg_']
postfixes_shuffle = ['_190721_1655']
preproc_folder = "enriched_shuffle" #folder with the correct preprocessing data
expname = preproc_folder+"_opt_imp" #experiment name

In [6]:
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
output_path = "../data/models/"

## Experiment in Shuffle Mode

In [7]:
experiment_shuffle = models_loop(models, datafolder, prefixes_shuffle, postfixes_shuffle, mlf_tracking=True, save_model=True,
                                experiment_name=expname, save_results_for_viz=True)

Training, validation and testing of experiment with prefix shuffle_imp_bg_ and postfix _190721_1655 using SGDClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/enriched_shuffle/shuffle_imp_bg__traindata_190721_1655.pkl
testing files: ../data/preproc_traintest/enriched_shuffle/shuffle_imp_bg__testdata_190721_1655.pkl
- Training...
- Validation...
AUC 0.821
- Testing...
Confusion matrix: 
[[0.99867 0.00133]
 [0.01788 0.00221]]
AUC 0.813
- Saving the model to ../data/models/enriched_shuffle_opt_imp/...
Saving model to ../data/models/enriched_shuffle_opt_imp/shuffle_imp_bg__SGDClassifier_190727_158.pkl
Saving dictionary to ../data/viz_data/enriched_shuffle_opt_imp/shuffle_imp_bg__SGDClassifier_190727_158_viz.pkl
- Creating the new experiment 'enriched_shuffle_opt_imp',  the following results will be saved in it...
- Tracking the experiment on mlflow...

Training, validation and testing of experiment with prefix shuffle_imp_bg_ and postfix _190721_1655 using 

## Experiment in Time mode

In [8]:
preproc_folder = "enriched_time" #folder with the correct preprocessing data
expname = preproc_folder+"_opt_imp_bg_" #experiment name
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
prefixes_time = ['time_2018-04-30_imp_bg_']
postfixes_time = ['_190721_170']

In [9]:
experiment_time = models_loop(models, datafolder, prefixes_time, postfixes_time, mlf_tracking=True, save_model=True,
                                experiment_name=expname, save_results_for_viz=True)

Training, validation and testing of experiment with prefix time_2018-04-30_imp_bg_ and postfix _190721_170 using SGDClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/enriched_time/time_2018-04-30_imp_bg__traindata_190721_170.pkl
testing files: ../data/preproc_traintest/enriched_time/time_2018-04-30_imp_bg__testdata_190721_170.pkl
- Training...
- Validation...
AUC 0.783
- Testing...
Confusion matrix: 
[[0.99665 0.00335]
 [0.03725 0.00671]]
AUC 0.805
- Saving the model to ../data/models/enriched_time_opt_imp_bg_/...
Saving model to ../data/models/enriched_time_opt_imp_bg_/time_2018-04-30_imp_bg__SGDClassifier_190727_1513.pkl
Saving dictionary to ../data/viz_data/enriched_time_opt_imp_bg_/time_2018-04-30_imp_bg__SGDClassifier_190727_1513_viz.pkl
- Creating the new experiment 'enriched_time_opt_imp_bg_',  the following results will be saved in it...
- Tracking the experiment on mlflow...

Training, validation and testing of experiment with prefix time_2018-