# Models loop automation - impairment

This notebook uses as inputs the outputs from the pipeline 2 ('11_preproc_pipeline_1.ipynb' notebook) and performs model assessment of performances and general exploration for the transaction credit events prediction "impairment".  
All the experiments produce useful data for visualizing results and can be tracked on MLflow.

In [1]:
import pandas as pd
import numpy as np
import pickle

from scripts_ml.models_utils import *
from scripts_viz.visualization_utils import *

from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from bokeh.io import show, output_notebook
output_notebook()

In [2]:
preproc_folder = "enriched_shuffle"
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
output_path = "../data/models/"

prefixes_shuffle = ['shuffle_imp_bg_']
postfixes_shuffle = ['_190812_1315']

In [3]:
#Linear model Stochastic Gradient Descent
sgd = SGDClassifier(random_state=42, max_iter=250, loss='log', tol=0.0001)

rf = RandomForestClassifier(random_state=42,
                               n_estimators=200,
                               max_leaf_nodes=40,
                               class_weight="balanced",
                               n_jobs=7)

models = [sgd, rf]

## Experiment in Shuffle Mode

In [8]:
expname = preproc_folder+"_imp"

In [9]:
experiment_shuffle = models_loop(models, datafolder, prefixes_shuffle, postfixes_shuffle, mlf_tracking=True, save_model=True,
                                experiment_name=expname, save_results_for_viz=True)

----Loop 1 of 2 for credit event shuffle_imp_bg_----
Training, validation and testing of experiment with prefix shuffle_imp_bg_ and postfix _190812_1315 using SGDClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/enriched_shuffle/shuffle_imp_bg__traindata_190812_1315.pkl
testing files: ../data/preproc_traintest/enriched_shuffle/shuffle_imp_bg__testdata_190812_1315.pkl
- Training/Validation...
AUC 0.815
- Training for test...
- Testing...
Confusion matrix: 
[[0.99947 0.00053]
 [0.0193  0.0008 ]]
AUC 0.800
- Saving the model to ../data/models/enriched_shuffle_imp/...
- Saving model to ../data/models/enriched_shuffle_imp/shuffle_imp_bg__SGDClassifier_190813_11910.pkl
- Saving dictionary to ../data/viz_data/enriched_shuffle_imp/shuffle_imp_bg__SGDClassifier_190813_11910_viz.pkl
- Activating existing experiment 'enriched_shuffle_imp', the following results will be saved in it...


The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()

All git commands will error until this is rectified.

$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - error|e|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet



- Tracking the experiment on mlflow...
- Experiment tracked.

----Loop 2 of 2 for credit event shuffle_imp_bg_----
Training, validation and testing of experiment with prefix shuffle_imp_bg_ and postfix _190812_1315 using RandomForestClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/enriched_shuffle/shuffle_imp_bg__traindata_190812_1315.pkl
testing files: ../data/preproc_traintest/enriched_shuffle/shuffle_imp_bg__testdata_190812_1315.pkl
- Training/Validation...
AUC 0.969
- Training for test...
- Testing...
Confusion matrix: 
[[0.9313  0.0687 ]
 [0.00248 0.01762]]
AUC 0.968
- Saving the model to ../data/models/enriched_shuffle_imp/...
- Saving model to ../data/models/enriched_shuffle_imp/shuffle_imp_bg__RandomForestClassifier_190813_11936.pkl
- Saving dictionary to ../data/viz_data/enriched_shuffle_imp/shuffle_imp_bg__RandomForestClassifier_190813_11936_viz.pkl
- Activating existing experiment 'enriched_shuffle_imp', the following results will be saved i

## Experiment in Time mode

In [11]:
preproc_folder = "enriched_time"
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
prefixes_time = ['time_2018-04-30_imp_bg_']
postfixes_time = ['_190812_1316']

expname = preproc_folder+"_imp"

In [12]:
experiment_time = models_loop(models, datafolder, prefixes_time, postfixes_time, mlf_tracking=True, save_model=True,
                                experiment_name=expname, save_results_for_viz=True)

----Loop 1 of 2 for credit event time_2018-04-30_imp_bg_----
Training, validation and testing of experiment with prefix time_2018-04-30_imp_bg_ and postfix _190812_1316 using SGDClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/enriched_time/time_2018-04-30_imp_bg__traindata_190812_1316.pkl
testing files: ../data/preproc_traintest/enriched_time/time_2018-04-30_imp_bg__testdata_190812_1316.pkl
- Training/Validation...
AUC 0.755
- Training for test...
- Testing...
Confusion matrix: 
[[0.99656 0.00344]
 [0.03725 0.00671]]
AUC 0.792
- Saving the model to ../data/models/enriched_time_imp/...
- Saving model to ../data/models/enriched_time_imp/time_2018-04-30_imp_bg__SGDClassifier_190813_111227.pkl
- Saving dictionary to ../data/viz_data/enriched_time_imp/time_2018-04-30_imp_bg__SGDClassifier_190813_111227_viz.pkl
- Activating existing experiment 'enriched_time_imp', the following results will be saved in it...
- Tracking the experiment on mlflow...
- Experime

## Experiment in Sequential Time mode preventing time leak

In [4]:
preproc_folder = "enriched_time_seq"
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
prefixes_time_seq = ['time_2018-04-30_imp_bg_']
valid_code = '_val_12000_3000_'
postfixes_time_seq_val = ['_190812_1612']
postfixes_time_seq = ['_190812_1547']
preproc_folder = "enriched_time_seq"
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
indexfile = '_fold_indexes'
expname = preproc_folder+"_imp"

In [5]:
experiment_time_seq = models_loop_time_leak(models, datafolder, prefixes_time_seq, postfixes_time_seq, valid_code, postfixes_time_seq_val,
                                            indexfile, experiment_name = expname, mlf_tracking=True, save_model=True,
                                save_results_for_viz=True)

----Loop 1 of 2 for credit event time_2018-04-30_imp_bg_----
Training, validation and testing of experiment with prefix time_2018-04-30_imp_bg_ and postfix _190812_1547 using SGDClassifier
--------------Loading VALIDATION preprocessed data...
training files: ../data/preproc_traintest/enriched_time_seq/time_2018-04-30_imp_bg__val_12000_3000__traindata_190812_1612.pkl
testing files: ../data/preproc_traintest/enriched_time_seq/time_2018-04-30_imp_bg__val_12000_3000__testdata_190812_1612.pkl
- Training/Validation...
Fold 1: train  on 13101 from index 0 to 13100, test on 3000 from 0 to 2999
Fold 1 AUC: 0.4836247414927618
Fold 2: train  on 12000 from index 4101 to 16100, test on 3000 from 3000 to 5999
Fold 2 AUC: 0.4860205582339108
Fold 3: train  on 12000 from index 7101 to 19100, test on 3000 from 6000 to 8999
Fold 3 AUC: 0.5376261768791479
Fold 4: train  on 12000 from index 10101 to 22100, test on 3000 from 9000 to 11999
Fold 4 AUC: 0.6016965898825655
Fold 5: train  on 12000 from index 131

The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh()

All git commands will error until this is rectified.

$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - error|e|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet



- Tracking the experiment on mlflow...
- Experiment tracked.

----Loop 2 of 2 for credit event time_2018-04-30_imp_bg_----
Training, validation and testing of experiment with prefix time_2018-04-30_imp_bg_ and postfix _190812_1547 using RandomForestClassifier
--------------Loading VALIDATION preprocessed data...
training files: ../data/preproc_traintest/enriched_time_seq/time_2018-04-30_imp_bg__val_12000_3000__traindata_190812_1612.pkl
testing files: ../data/preproc_traintest/enriched_time_seq/time_2018-04-30_imp_bg__val_12000_3000__testdata_190812_1612.pkl
- Training/Validation...
Fold 1: train  on 13101 from index 0 to 13100, test on 3000 from 0 to 2999
Fold 1 AUC: 0.908734724572288
Fold 2: train  on 12000 from index 4101 to 16100, test on 3000 from 3000 to 5999
Fold 2 AUC: 0.7906072307728607
Fold 3: train  on 12000 from index 7101 to 19100, test on 3000 from 6000 to 8999
Fold 3 AUC: 0.5623367803673406
Fold 4: train  on 12000 from index 10101 to 22100, test on 3000 from 9000 to 11999