# Models loop automation - pastdue180

This notebook uses as inputs the outputs from the pipeline 2 ('11_preproc_pipeline_1.ipynb' notebook) and performs model assessment of performances and general exploration for the transaction credit events prediction "pastdue180".  
All the experiments produce useful data for visualizing results and can be tracked on MLflow.

In [1]:
import pandas as pd
import numpy as np
import pickle

from scripts_ml.models_utils import *
from scripts_viz.visualization_utils import *

from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from bokeh.io import show, output_notebook
output_notebook()

In [2]:
preproc_folder = "enriched_shuffle"
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
output_path = "../data/models/"

prefixes_shuffle = ['shuffle_p180_bg_']
postfixes_shuffle = ['_190721_1655']

In [3]:
#Linear model Stochastic Gradient Descent
sgd = SGDClassifier(random_state=42, max_iter=250, loss='log', tol=0.0001)

rf = RandomForestClassifier(random_state=42,
                               n_estimators=200,
                               max_leaf_nodes=40,
                               class_weight="balanced",
                               n_jobs=7)

models = [sgd, rf]

## Experiment in Shuffle Mode

In [4]:
expname = preproc_folder+"_p180"

In [5]:
experiment_shuffle = models_loop(models, datafolder, prefixes_shuffle, postfixes_shuffle, mlf_tracking=True, save_model=True,
                                experiment_name=expname, save_results_for_viz=False)

----Loop 1 of 2 for credit event shuffle_p180_bg_----
Training, validation and testing of experiment with prefix shuffle_p180_bg_ and postfix _190721_1655 using SGDClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/enriched_shuffle/shuffle_p180_bg__traindata_190721_1655.pkl
testing files: ../data/preproc_traintest/enriched_shuffle/shuffle_p180_bg__testdata_190721_1655.pkl
- Training/Validation...
AUC 0.845
- Training for test...
- Testing...
Confusion matrix: 
[[0.99305 0.00695]
 [0.05979 0.00834]]
AUC 0.849
- Saving the model to ../data/models/enriched_shuffle_p180/...
- Saving model to ../data/models/enriched_shuffle_p180/shuffle_p180_bg__SGDClassifier_190817_112820.pkl
- Creating the new experiment 'enriched_shuffle_p180',  the following results will be saved in it...
- Tracking the experiment on mlflow...
- Experiment tracked.

----Loop 2 of 2 for credit event shuffle_p180_bg_----
Training, validation and testing of experiment with prefix shuffle_p18

## Experiment in Time mode

In [6]:
preproc_folder = "enriched_time"
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
prefixes_time = ['time_2018-02-20_p180_bg_']
postfixes_time = ['_190721_170']

expname = preproc_folder+"_p180"

In [7]:
experiment_time = models_loop(models, datafolder, prefixes_time, postfixes_time, mlf_tracking=True, save_model=True,
                                experiment_name=expname, save_results_for_viz=False)

----Loop 1 of 2 for credit event time_2018-02-20_p180_bg_----
Training, validation and testing of experiment with prefix time_2018-02-20_p180_bg_ and postfix _190721_170 using SGDClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/enriched_time/time_2018-02-20_p180_bg__traindata_190721_170.pkl
testing files: ../data/preproc_traintest/enriched_time/time_2018-02-20_p180_bg__testdata_190721_170.pkl
- Training/Validation...
AUC 0.796
- Training for test...
- Testing...
Confusion matrix: 
[[0.95867 0.04133]
 [0.00336 0.00013]]
AUC 0.731
- Saving the model to ../data/models/enriched_time_p180/...
- Saving model to ../data/models/enriched_time_p180/time_2018-02-20_p180_bg__SGDClassifier_190817_11296.pkl
- Creating the new experiment 'enriched_time_p180',  the following results will be saved in it...
- Tracking the experiment on mlflow...
- Experiment tracked.

----Loop 2 of 2 for credit event time_2018-02-20_p180_bg_----
Training, validation and testing of exper

## Experiment in Sequential Time mode preventing time leak

In [8]:
preproc_folder = "enriched_time_seq"
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
prefixes_time_seq = ['time_2018-02-20_p180_bg_']
valid_code = '_val_24000_6000_'
postfixes_time_seq_val = ['_190815_713']
postfixes_time_seq = ['_190812_1645']
preproc_folder = "enriched_time_seq"
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
indexfile = '_fold_indexes'
expname = preproc_folder+valid_code.split('_val_')[1][:-1]+"_p180"

In [9]:
experiment_time_seq = models_loop_time_leak(models, datafolder, prefixes_time_seq, postfixes_time_seq, valid_code, postfixes_time_seq_val,
                                            indexfile, experiment_name = expname, mlf_tracking=True, save_model=True,
                                save_results_for_viz=False)

----Loop 1 of 2 for credit event time_2018-02-20_p180_bg_----
Training, validation and testing of experiment with prefix time_2018-02-20_p180_bg_ and postfix _190812_1645 using SGDClassifier
--------------Loading VALIDATION preprocessed data...
training files: ../data/preproc_traintest/enriched_time_seq/time_2018-02-20_p180_bg__val_24000_6000__traindata_190815_713.pkl
testing files: ../data/preproc_traintest/enriched_time_seq/time_2018-02-20_p180_bg__val_24000_6000__testdata_190815_713.pkl
- Training/Validation...
Fold 1: train  on 24079 from index 0 to 24078, test on 6000 from 0 to 5999
Fold 1 AUC: 0.5097505422150359
Fold 2: train  on 24000 from index 6079 to 30078, test on 6000 from 6000 to 11999
Fold 2 AUC: 0.4991854018966892
Fold 3: train  on 24000 from index 12079 to 36078, test on 6000 from 12000 to 17999
Fold 3 AUC: 0.3994729685791138
Validation AUC 0.446
- Training for test...
---------------Loading TEST preprocessed data...
training files: ../data/preproc_traintest/enriched_ti