# Models loop automation

This notebook uses as inputs the outputs from the pipeline ('05_preproc_pipeline_1.ipynb' notebook) and performs model calibration and general exploration for the transaction credit events prediction.  
It uses the same models of the notebook "model_loop_benchmarks", with same hyperparameters for comparison.

In [1]:
import pandas as pd
import numpy as np
import pickle

from models_utils import *
from visualization_utils import *

from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from bokeh.io import show, output_notebook
output_notebook()

In [2]:
datafolder = "../data/preproc_traintest/"
output_path = "../data/models/"

prefixes_shuffle = ['shuffle_imp_bgt', 'shuffle_p90_bgt', 'shuffle_p180_bgt']
postfixes_shuffle = ['_190710_1539']*len(prefixes_shuffle)

In [3]:
#Linear model Stochastic Gradient Descent
sgd = SGDClassifier(random_state=42, max_iter=250, loss='log', tol=0.0001)

rf = RandomForestClassifier(random_state=42,
                               n_estimators=200,
                               max_leaf_nodes=40,
                               class_weight="balanced",
                               n_jobs=7)

models = [sgd, rf]

## Experiment in Shuffle Mode

In [4]:
experiment_shuffle = models_loop(models, datafolder, prefixes_shuffle, postfixes_shuffle, mlf_tracking=True, save_model=Tru)

Training, validation and testing of experiment with prefix shuffle_imp_ and postfix _19072_750 using SGDClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/shuffle_imp__traindata_19072_750.pkl
testing files: ../data/preproc_traintest/shuffle_imp__testdata_19072_750.pkl
- Training...
- Validation...
AUC 0.793
Confusion matrix: 
[[0.99883 0.00117]
 [0.01901 0.00244]]
- Testing...
Confusion matrix: 
[[0.99743 0.00257]
 [0.01735 0.00274]]
AUC 0.791
Tracking the experiment on mlflow...

Training, validation and testing of experiment with prefix shuffle_imp_ and postfix _19072_750 using RandomForestClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/shuffle_imp__traindata_19072_750.pkl
testing files: ../data/preproc_traintest/shuffle_imp__testdata_19072_750.pkl
- Training...
- Validation...
AUC 0.921
Confusion matrix: 
[[0.89408 0.10592]
 [0.00439 0.01706]]
- Testing...
Confusion matrix: 
[[0.88747 0.11253]
 [0.00469 0.01541]]
AUC

In [5]:
experiment_shuffle.keys()

dict_keys(['SGDClassifier_shuffle_imp_validation', 'SGDClassifier_shuffle_imp_testing', 'RandomForestClassifier_shuffle_imp_validation', 'RandomForestClassifier_shuffle_imp_testing', 'SGDClassifier_shuffle_p90_validation', 'SGDClassifier_shuffle_p90_testing', 'RandomForestClassifier_shuffle_p90_validation', 'RandomForestClassifier_shuffle_p90_testing', 'SGDClassifier_shuffle_p180_validation', 'SGDClassifier_shuffle_p180_testing', 'RandomForestClassifier_shuffle_p180_validation', 'RandomForestClassifier_shuffle_p180_testing'])

In [6]:
rf_imp = plot_rocs([experiment_shuffle['RandomForestClassifier_shuffle_imp_validation'], experiment_shuffle['RandomForestClassifier_shuffle_imp_testing']],
                   p_width=600, p_height=600, model_appendix=['RF - 5folds','RF - test'], title_lab='Random Forest performance for Impairment')
show(rf_imp)

## Experiment in Time mode

In [8]:
prefixes_time = ['time_2018-04-30_imp_', 'time_2018-04-30_p90_', 'time_2018-02-20_p180_']
postfixes_time = ['_190710_745']*len(prefixes_time)

In [9]:
experiment_time = models_loop(models, datafolder, prefixes_time, postfixes_time, mlf_tracking=True)

Training, validation and testing of experiment with prefix time_2018-04-30_imp_ and postfix _190710_745 using SGDClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/time_2018-04-30_imp__traindata_190710_745.pkl
testing files: ../data/preproc_traintest/time_2018-04-30_imp__testdata_190710_745.pkl
- Training...
- Validation...
AUC 0.746
Confusion matrix: 
[[0.99711 0.00289]
 [0.01289 0.00275]]
- Testing...
Confusion matrix: 
[[0.99828 0.00172]
 [0.04342 0.00054]]
AUC 0.710
Tracking the experiment on mlflow...

Training, validation and testing of experiment with prefix time_2018-04-30_imp_ and postfix _190710_745 using RandomForestClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/time_2018-04-30_imp__traindata_190710_745.pkl
testing files: ../data/preproc_traintest/time_2018-04-30_imp__testdata_190710_745.pkl
- Training...
- Validation...
AUC 0.797
Confusion matrix: 
[[0.87987 0.12013]
 [0.00769 0.00795]]
- Testing...
Confusi

In [26]:
rf_imp = plot_rocs([experiment_time['RandomForestClassifier_time_2018-04-30_imp_validation'], experiment_time['RandomForestClassifier_time_2018-04-30_imp_testing']],
                   p_width=600, p_height=600, model_appendix=['RF - 5folds','RF - test'], title_lab='Random Forest performance for Impairment')
show(rf_imp)

In [27]:
rf_p90 = plot_rocs([experiment_time['RandomForestClassifier_time_2018-04-30_p90_validation'], experiment_time['RandomForestClassifier_time_2018-04-30_p90_testing']],
                   p_width=600, p_height=600, model_appendix=['RF - 5folds','RF - test'], title_lab='Random Forest performance for Pastdue90')
show(rf_p90)

In [28]:
rf_p180 = plot_rocs([experiment_time['RandomForestClassifier_time_2018-02-20_p180_validation'], experiment_time['RandomForestClassifier_time_2018-02-20_p180_testing']],
                   p_width=600, p_height=600, model_appendix=['RF - 5folds','RF - test'], title_lab='Random Forest performance for Pastdue180')
show(rf_p180)

In [29]:
#expriment with first rf benchmark model
n_estimators = 200
max_leaf_nodes = 40
rf_clf = RandomForestClassifier(random_state=42,
                               n_estimators=n_estimators,
                               max_leaf_nodes=max_leaf_nodes,
                               class_weight="balanced",
                               n_jobs=7)

experiment_time = models_loop([rf_clf], datafolder, prefixes_time, postfixes_time)

Training, validation and testing of experiment with prefix time_2018-04-30_imp_ and postfix _190710_745 using RandomForestClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/time_2018-04-30_imp__traindata_190710_745.pkl
testing files: ../data/preproc_traintest/time_2018-04-30_imp__testdata_190710_745.pkl
- Training...
- Validation...
AUC 0.797
Confusion matrix: 
[[0.87987 0.12013]
 [0.00769 0.00795]]
- Testing...
[[9463 1570]
 [  94  391]]
AUC 0.862

Training, validation and testing of experiment with prefix time_2018-04-30_p90_ and postfix _190710_745 using RandomForestClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/time_2018-04-30_p90__traindata_190710_745.pkl
testing files: ../data/preproc_traintest/time_2018-04-30_p90__testdata_190710_745.pkl
- Training...
- Validation...
AUC 0.714
Confusion matrix: 
[[0.75318 0.24682]
 [0.03447 0.05984]]
- Testing...
[[8864 2432]
 [  53  169]]
AUC 0.844

Training, validation and tes