In [1]:
# Importer
import classes.utils as utils
from classes.splitter import Splitter
from classes.classifier_trainer import ClassifierTrainer
from classes.drift_detector import DriftDetector

from sklearn.tree import DecisionTreeClassifier

import numpy as np


utils.set_parent_directory_as_working_directory()

# TODO: Move this to a config file
# Importing
DATA_FOLDER = "./data"


FE_DATA_PATH = DATA_FOLDER +'/fe_data.csv'
DATES_DATA_PATH = DATA_FOLDER +'/dates_data.csv'

SEED = 47


# 0 Introduction
In this notebook we will develop our first model. We are going to assume that we have 1 year of loans that have finished (finished_d = issued date + total length of loan), which basically place ourselves on 2011-05-01.

We need to use this variable and not issued_date because when you issue new loan you don't know yet if it's going to be fully paid or not. So we need to wait until the loan is finished to know if it was fully paid or not.


# 1 Splitting data
Before we get hands on with the modelling, we need to split the data into train and test sets. As we mentioned in the preprocessing notebook we will be using the create variable  'finished_d' to 

We will use the train set to train the model and the test set to evaluate the model. We will use the train_test_split function from sklearn to split the data. We will use 80% of the data for training and 20% for testing.

In [2]:
splitter_name = "splitter"

splitter = Splitter(
    name = splitter_name
    , data_path = FE_DATA_PATH
    , date_cols = []
    , target_variable = 'loan_status'
    , destination_directory = DATA_FOLDER
    , dates_data_path = DATES_DATA_PATH
    , column_to_split_by = 'finished_d'
    , test_size = 0.3
    , random_state = SEED
)

splitter.execute()


-------------- Executing splitter --------------
Data loaded from ./data/fe_data.csv
Dates data loaded from ./data/dates_data.csv
Test and train attributes defined 0.3.
        Test size: 678201
        Train size: 1582467
--------------- splitter finished ---------------


This object contains x_train, x_test, y_train and y_test, for the whole series later we can filter them by changing "number_of_months".

# 2 Modelling with MLOps methodology

In this part is where we are going to dig deeper into MLOps methodology and simulate what the process could be imagining that we start modelling after we have 1 year of finished loans (since the first one finished). The techniques we are going to use are:

- Concept drift with online evaluation by training a challenger model
- Input drift and target drift detection with univariate statistical tests and for multivariate detection we are going to use a domain classifier.

## 2.2 First year of data

In [3]:
# create a list of 5 random integers
random_max_depth = np.random.randint(1, 50, 5)
experiment_name = 'first_year_model_baseline'
splitter.set_train_test_filtered(number_of_months=12)


trainer_first_year = ClassifierTrainer(
    name = 'trainer_first_year'
    , model_class = DecisionTreeClassifier()
    , random_state=SEED
    , splitter = splitter
    , objective_metric = 'roc_auc'
)

trainer_first_year.train_grid_search(
    param_distributions = {'max_depth': random_max_depth}  
)

trainer_first_year.predict()

trainer_first_year.results = trainer_first_year.evaluate(splitter.y_test, trainer_first_year.y_pred)


Date column finished_d added to the data
Data filtered by 0 and 12 months
Test and train attributes defined 0.3.
        Test size: 597
        Train size: 1393
Fitting grid search with 5 splits and 5 repeats
Best parameters: {'max_depth': 2}
Best cross-validation score: 0.59
Model trainer_first_year has made the predictions
Model trainer_first_year has made the predictions


Once the model has been trained and the predictions have been made on the test set we can see the metrics

In [4]:
trainer_first_year.results

{'accuracy': 0.7688442211055276,
 'precision': 0.7828371278458844,
 'recall': 0.9696312364425163,
 'f1': 0.8662790697674418,
 'roc_auc': 0.5289332652800817}

If we are happy we can now log the experiment and model in MLflow. We are going to consider this model the model in production.

We need to initialize the MLflow tracking server
```bash
mlflow server --backend-store-uri sqlite:///mydb.sqlite
```

In this case I am doing it locally [using a sqlite database as a backend store](https://mlflow.org/docs/latest/tracking.html#scenario-3-mlflow-on-localhost-with-tracking-server).

URL: http://127.0.0.1:5000

In [5]:
#trainer_first_year.run_experiment_mlflow(experiment_name = 'dt_first_year_test', log_models=True)

So now we have trained a model after 1 year of data, as we mentioned we want to know track if this model is going to perform well on the new incoming data, in a real scenario we could track on daily basis if the input data is drifting and retrain a new model and set a time window to retrain a new model.

In this case as we already have all the data series, what we are going to do is analyze the drift and retrain a new model every year

## 2.3 Second year of data

Imagine that a year has passed after we trained our first model and we have new data, we want to know if the model is still performing well on the new data.

In [6]:
first_year_X,  first_year_y= splitter.x_y_filter_by_month(from_month=0, to_month=12)
second_year_X,  second_year_y= splitter.x_y_filter_by_month(from_month=12, to_month=24)


drift_detector = DriftDetector(
    name = 'drift_detector'
    , random_state=SEED
)

drift_detector.univariate_input_drift(first_year_X, second_year_X)

Date column finished_d added to the data
Data filtered by 0 and 12 months
Date column finished_d added to the data
Data filtered by 12 and 24 months
Kolmogorov-Smirnov test
Chi square test


['annual_inc',
 'fico_range_high',
 'installment',
 'int_rate',
 'revol_bal',
 'revol_util',
 'issue_d_month',
 'issue_d_year',
 'addr_state_CA',
 'purpose_debt_consolidation',
 'sub_grade_A5',
 'verification_status_Not Verified',
 'verification_status_Source Verified',
 'verification_status_Verified']

# Continuar aqui!

In [7]:
splitter.split_data_filtered(number_of_months=24)

trainer_second_year = Trainer(
    name = 'trainer'
    , model_class = DecisionTreeClassifier()
    , random_state=random_state
    , splitter = splitter
)


trainer_second_year.set_model_params(best_params)

trainer_second_year.run_experiment_mlflow(
    experiment_name = 'second_year_model_baseline'
)



Date column finished_d added to the data
Data filtered by 24 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)
2023/05/04 17:56:35 INFO mlflow.tracking.fluent: Experiment with name 'second_year_model_baseline' does not exist. Creating a new experiment.


Test and train attributes defined 0.3.
        Test size: 1375
        Train size: 3208
Experiment run_id=e77c94e828f341e9bd908e7e51db212d created in tracking URI=http://localhost:5000


In [11]:
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt
def plot_confusion_matrix(y, y_pred):
    cm_pred = confusion_matrix(y, y_pred)
    ConfusionMatrixDisplay(confusion_matrix=cm_pred).plot()


def get_classification_metrics(y, y_pred):
    plot_confusion_matrix(y, y_pred)
    print(classification_report(y, y_pred))

prod_model_preds = trainer_first_year.model_class.predict(splitter.X_test)
challenger_model_preds = trainer_second_year.model_class.predict(splitter.X_test)



In [12]:
get_classification_metrics(splitter.y_test, prod_model_preds)

              precision    recall  f1-score   support

           0       0.41      0.60      0.49       287
           1       0.88      0.78      0.83      1088

    accuracy                           0.74      1375
   macro avg       0.65      0.69      0.66      1375
weighted avg       0.78      0.74      0.76      1375



In [13]:
get_classification_metrics(splitter.y_test, challenger_model_preds )

              precision    recall  f1-score   support

           0       0.29      0.30      0.30       287
           1       0.81      0.81      0.81      1088

    accuracy                           0.70      1375
   macro avg       0.55      0.55      0.55      1375
weighted avg       0.71      0.70      0.71      1375



In [30]:
import pandas as pd
from sklearn import metrics

def evaluate(y_test, y_preds):
    """This function evaluates the model
    and returns a dictionary with the results
    """ 
    results = {}
    results['accuracy'] = metrics.accuracy_score(y_test, y_preds)
    results['precision'] = metrics.precision_score(y_test, y_preds)
    results['recall'] = metrics.recall_score(y_test, y_preds)
    results['f1'] = metrics.f1_score(y_test, y_preds)
    results['roc_auc'] = metrics.roc_auc_score(y_test, y_preds)
    return results
        

def choose_prod_challenger_model(prod, challenger, objective_metric='roc_auc'):
    prod_model_preds = prod.predict(splitter.X_test)
    challenger_model_preds = challenger.predict(splitter.X_test)
    prod_model_metrics = evaluate(splitter.y_test, prod_model_preds)
    challenger_model_metrics = evaluate(splitter.y_test, challenger_model_preds)

    if prod_model_metrics[objective_metric] > challenger_model_metrics[objective_metric]:
        print('Prod model is better')
        return  prod_model_metrics, 0
    else:
        print('Challenger model is better')
        return challenger_model_metrics, 1

def run_whole_timeseries(params, splitter, experiment_prefix,  step, random_state=47):

    start_date = splitter.dates_data['finished_d'].min()
    end_date = splitter.dates_data['finished_d'].max()
    metrics = {}
    models = []
    months = step

    while start_date < end_date:
        
        splitter.split_data_filtered(number_of_months=months)

        trainer = Trainer(
            name = 'trainer'
            , model_class = DecisionTreeClassifier()
            , random_state=random_state
            , splitter = splitter
        )

        trainer.set_model_params(params)

        trainer.train()
        if len(models)>0:
            prod = models[-1]
            challenger = trainer.model_class
            challenger_metrics, is_challenger = choose_prod_challenger_model(prod, challenger)
            if is_challenger:
                models.append(trainer.model_class)
                metrics[experiment_prefix + str(start_date)] = challenger_metrics
        else:
            trainer.predict()
            metrics[experiment_prefix + str(start_date)] = trainer.evaluate()
            models.append(trainer.model_class)

        months += step       
        start_date = start_date + pd.DateOffset(months=step)
    
    return metrics, models




In [31]:
metrics, models = run_whole_timeseries(params=best_params
                                          , splitter=splitter
                                          , experiment_prefix='test_'
                                          , step=12
                                          , random_state=47
                                          )


Date column finished_d added to the data
Data filtered by 12 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 597
        Train size: 1393




Model trainer trained
Model trainer predicted
Date column finished_d added to the data
Data filtered by 24 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 1375
        Train size: 3208
Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 36 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 3651
        Train size: 8517
Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 48 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 6587
        Train size: 15367
Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 60 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 13091
        Train size: 30544




Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 72 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 34277
        Train size: 79977




Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 84 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 74899
        Train size: 174762




Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 96 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 141030
        Train size: 329069




Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 108 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 254295
        Train size: 593354




Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 120 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 373390
        Train size: 871243




Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 132 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 516563
        Train size: 1205312




Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 144 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 608805
        Train size: 1420545




Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 156 months


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_data.drop(columns=self.column_to_split_by, inplace=True)


Test and train attributes defined 0.3.
        Test size: 650804
        Train size: 1518540




Model trainer trained
Prod model is better
Date column finished_d added to the data
Data filtered by 168 months
Test and train attributes defined 0.3.
        Test size: 678201
        Train size: 1582467




Model trainer trained
Prod model is better


In [35]:
metrics

{'test_2010-06-01 00:00:00': {'accuracy': 0.6968174204355109,
  'precision': 0.782258064516129,
  'recall': 0.841648590021692,
  'f1': 0.8108672936259143,
  'roc_auc': 0.5237654714814343}}

# Input drift
- Univariate test (kolmogorw-smirnoff and chi squared test)
- Multivariate: domain classifier
