# Predicting Loan Default 3 - Models

In this notebook we perform model fitting, evaluation and selection. 

## Packages

In [1]:
## data handling 
import numpy as np 
import pandas as pd
import polars as pl 
import polars.selectors as cs

## tuning
import optuna 

## visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.express as px

## sklearn models
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler

## pickle 
from pickle import dump, load

## get file path of the data
from private import FINAL_FILE_PATH

## Data

In [2]:
## load file
loans_df = pl.read_csv(FINAL_FILE_PATH, ignore_errors=True)

## Models

We now split the data into train-test sets and apply our ML models. 

### Splitting

In [3]:
X = loans_df.to_pandas()
y = X.pop("default")

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    shuffle=True,
                                                    stratify=y,
                                                    test_size=0.2)

### Normalization 

We now normalize the training set and fit the scaling to the test set. 

In [4]:
scaling_cols = ["loan_amnt", "int_rate", "installment", "dti",
                "fico_range_low", "fico_range_high", "open_acc",
                "pub_rec", "revol_util", "total_acc", "last_pymnt_amnt",
                "acc_open_past_24mths", "avg_cur_bal", "bc_open_to_buy",
                "bc_util", "mo_sin_old_rev_tl_op", "num_actv_rev_tl",
                "log_annual_inc", "log_revol_bal"]

scalers = {}
for col in scaling_cols:
    scaler = StandardScaler()
    X_train[col] = scaler.fit_transform(X_train[[col]])
    X_test[col] = scaler.transform(X_test[[col]])
    scalers[col] = scaler

### Model Evaluation 

Due to the imbalanced nature of our `default` feature, we must carefully consider our metric. As false negatives (borrowers who default but who were predicted to not default) are much more costly to investors than false positives (borrowers who don't default but who were predicted to default), we will make use of recall as a metric, as well as the area under the ROC curve. 

### Model Selection 

We will use 3-fold cross validation on a large number of models with default parameters. This will help gauge which type of model may be performing best. We will then select the best performing model for hyperparameter tuning. 

In [5]:
models = []

## baseline models 
models.append(('LR', LogisticRegression(max_iter=10000)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
## ensemble models 
models.append(('RF', RandomForestClassifier()))
## boosting models 
models.append(('GBM', GradientBoostingClassifier()))
models.append(('AB', AdaBoostClassifier(algorithm="SAMME")))
## neural networks
models.append(('NN', MLPClassifier()))

In [6]:
results = []
names = []

for name, model in models :
    kfold = StratifiedKFold(n_splits=3)
    cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    print(f"{name}: {round(cv_results.mean(), 4)} ({round(cv_results.std(), 6)})")

LR: 0.8733 (0.000579)
KNN: 0.8296 (0.000339)
CART: 0.8751 (0.000733)
NB: 0.8225 (0.000555)
RF: 0.8955 (0.000301)
GBM: 0.8955 (0.000942)




AB: 0.8855 (0.000184)
NN: 0.8946 (0.000805)


Save the results so they can be loaded later. 

In [7]:
cv_scores = {name:result for (name,result) in zip(names,results)}
dump(cv_scores, open("../models/cv_scores.p", "wb"))

In [8]:
cv_scores = load(open("../models/cv_scores.p", "rb" ))

We plot the results for each model below: 

In [11]:
cv_scores

{'LR': array([0.87297719, 0.8741417 , 0.87285998]),
 'KNN': array([0.83000486, 0.82963812, 0.8291772 ]),
 'CART': array([0.87453672, 0.87614637, 0.87465313]),
 'NB': array([0.82330898, 0.82216876, 0.82209615]),
 'RF': array([0.89517998, 0.89591837, 0.89553954]),
 'GBM': array([0.89456437, 0.89681179, 0.89523331]),
 'AB': array([0.88553551, 0.88576561, 0.88531416]),
 'NN': array([0.89514841, 0.89348752, 0.89523647])}