# Models loop automation - Pastdue180 tuned models

This notebook uses as inputs the outputs from the pipeline ('05_preproc_pipeline_1.ipynb' notebook) and performs model assessment of performances and general exploration for the transaction credit events prediction "is_pastdue180".  
All the experiments produce useful data for visualizing results and can be tracked on MLflow.  

In [1]:
import pandas as pd
import numpy as np
import pickle

from scripts_ml.models_utils import *
from scripts_viz.visualization_utils import *

from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from bokeh.io import show, output_notebook
output_notebook()



In [2]:
#linear model Stochastic Gradient Descent with semi-optimized hyperparameters using random search
sgd_rs_1 = SGDClassifier(random_state=42, max_iter=350, loss='log', learning_rate='optimal', eta0=0.001, tol=0.0001) #from benchmark_imp_shuffle_tuning
sgd_rs_2 = SGDClassifier(random_state=42, max_iter=200, loss='log', learning_rate='constant', eta0=0.001, tol=0.0001) #from benchmark_imp_time_tuning
sgd_rs_3 = SGDClassifier(random_state=42, max_iter=180, loss='log', learning_rate='adaptive', eta0=0.01, tol=0.0001) #from benchmark_p90_shuffle_tuning
sgd_rs_4 = SGDClassifier(random_state=42, max_iter=320, loss='log', learning_rate='constant', eta0=0.01, tol=0.0001) #from benchmark_p90_time_tuning
sgd_rs_5 = SGDClassifier(random_state=42, max_iter=320, loss='log', learning_rate='adaptive', eta0=0.01, tol=0.0001) #from both benchmark_p180_shuffle_tuning and benchmark_p180_time_tuning

sgd_models = [sgd_rs_1, sgd_rs_2, sgd_rs_3, sgd_rs_4, sgd_rs_5]

In [3]:
#random forest models parameters from randomized grid search

#from benchmark_imp_shuffle_tuning
rf_opt_1 = {'n_estimators': 350,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_leaf_nodes': 80,
 'max_features': 15,
 'max_depth': 80,
 'bootstrap': True,
 'random_state':42,
 'class_weight':"balanced",
  'n_jobs': 7}

#from benchmark_imp_time_tuning
rf_opt_2 = {'n_estimators': 180,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_leaf_nodes': 25,
 'max_features': 10,
 'max_depth': None,
 'bootstrap': True,
 'random_state':42,
 'class_weight':"balanced",
  'n_jobs': 7}

#from benchmark_p90_shuffle_tuning
rf_opt_3 = {'n_estimators': 280,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_leaf_nodes': 100,
 'max_features': 15,
 'max_depth': 280,
 'bootstrap': True,
 'random_state':42,
 'class_weight':"balanced",
  'n_jobs': 7}

#from benchmark_p90_time_tuning
rf_opt_4 = {'n_estimators': 300,
 'min_samples_split': 5,
 'min_samples_leaf': 2,
 'max_leaf_nodes': 80,
 'max_features': 'sqrt',
 'max_depth': None,
 'bootstrap': True,
 'random_state':42,
 'class_weight':"balanced",
  'n_jobs': 7}

#from benchmark_p180_shuffle_tuning
rf_opt_5 = {'n_estimators': 250,
 'min_samples_split': 15,
 'min_samples_leaf': 1,
 'max_leaf_nodes': 100,
 'max_features': 'auto',
 'max_depth': 180,
 'bootstrap': True,
 'random_state':42,
 'class_weight':"balanced",
  'n_jobs': 7}

#from benchmark_p180_time_tuning 
rf_opt_6 = {'n_estimators': 180,
 'min_samples_split': 5,
 'min_samples_leaf': 2,
 'max_leaf_nodes': 80,
 'max_features': 'sqrt',
 'max_depth': 80,
 'bootstrap': True,
 'random_state':42,
 'class_weight':"balanced",
  'n_jobs': 7}

rf_parameters = [rf_opt_1, rf_opt_2, rf_opt_3, rf_opt_4, rf_opt_5, rf_opt_6]

rf_models = [RandomForestClassifier(**params) for params in rf_parameters]

In [4]:
models = sgd_models+rf_models

In [5]:
prefixes_shuffle = ['shuffle_p180_']
postfixes_shuffle = ['_19072_750']
preproc_folder = "benchmarks_shuffle" #folder with the correct preprocessing data
expname = preproc_folder+"_opt_p180" #experiment name

In [6]:
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
output_path = "../data/models/"

## Experiment in Shuffle Mode

In [7]:
experiment_shuffle = models_loop(models, datafolder, prefixes_shuffle, postfixes_shuffle, mlf_tracking=True, save_model=True,
                                experiment_name=expname, save_results_for_viz=False)

----Loop 1 of 11 for credit event shuffle_p180_----
Training, validation and testing of experiment with prefix shuffle_p180_ and postfix _19072_750 using SGDClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/benchmarks_shuffle/shuffle_p180__traindata_19072_750.pkl
testing files: ../data/preproc_traintest/benchmarks_shuffle/shuffle_p180__testdata_19072_750.pkl
- Training/Validation...
AUC 0.772
- Training for test...
- Testing...
Confusion matrix: 
[[0.9987  0.0013 ]
 [0.0648  0.00334]]
AUC 0.781
- Saving the model to ../data/models/benchmarks_shuffle_opt_p180/...
- Saving model to ../data/models/benchmarks_shuffle_opt_p180/shuffle_p180__SGDClassifier_190817_10186.pkl
- Creating the new experiment 'benchmarks_shuffle_opt_p180',  the following results will be saved in it...
- Tracking the experiment on mlflow...
- Experiment tracked.

----Loop 2 of 11 for credit event shuffle_p180_----
Training, validation and testing of experiment with prefix shuffle_p180

## Experiment in Time mode

In [8]:
preproc_folder = "benchmarks_time" #folder with the correct preprocessing data
expname = preproc_folder+"_opt_p180" #experiment name
datafolder = "../data/preproc_traintest/"+preproc_folder+'/'
prefixes_time = ['time_2018-02-20_p180_']
postfixes_time = ['_190710_745']

In [9]:
experiment_time = models_loop(models, datafolder, prefixes_time, postfixes_time, 
                              timeSeqValid = True, train_window = 12000, test_window = 3000,
                              mlf_tracking=True, save_model=True,
                                experiment_name=expname, save_results_for_viz=False)

----Loop 1 of 11 for credit event time_2018-02-20_p180_----
Training, validation and testing of experiment with prefix time_2018-02-20_p180_ and postfix _190710_745 using SGDClassifier
-Loading preprocessed data...
training files: ../data/preproc_traintest/benchmarks_time/time_2018-02-20_p180__traindata_190710_745.pkl
testing files: ../data/preproc_traintest/benchmarks_time/time_2018-02-20_p180__testdata_190710_745.pkl
- Training/Validation...
Preparing fold 0 with 12079 train observations and 3000 test observations, starti=3079...
Fold 0: train  on 12079 from index 0 to 12078, test on 3000 from 12079 to 15078
Fold 0 AUC: 0.3604058169498856
Preparing fold 1 with 12000 train observations and 3000 test observations, starti=6079...
Fold 1: train  on 12000 from index 3079 to 15078, test on 3000 from 15079 to 18078
Fold 1 AUC: 0.6474719999999999
Preparing fold 2 with 12000 train observations and 3000 test observations, starti=9079...
Fold 2: train  on 12000 from index 6079 to 18078, test on