# Lazy Prediction of Loan Grade Model

First, all the necessary libraries are imported. 

In [1]:
import lazypredict
import pandas as pd
from imblearn.metrics import macro_averaged_mean_absolute_error
from helper_functions.custom_model import (
    reg_macro_averaged_mean_absolute_error,
)
from lazypredict.Supervised import LazyClassifier, LazyRegressor
from helper_functions.ml_data_prep import (
    stratified_sample,
    X_y_spilt,
)
from sklearn.preprocessing import OrdinalEncoder

Computationally expensive and unable to execute classifiers and regressors are removed.

In [2]:
classifiers = lazypredict.Supervised.CLASSIFIERS
classifiers_to_remove = [
    "StackingClassifier",
    "CategoricalNB",
    "LabelPropagation",
    "LabelSpreading",
    "NuSVC",
    "SVC",
    "LinearSVC",
]
for model, _ in classifiers[:]:
    if model in classifiers_to_remove:
        classifiers.remove((model, _))

In [3]:
regressors = lazypredict.Supervised.REGRESSORS
regressors_to_remove = [
    "GammaRegressor",
    "QuantileRegressor",
    "GaussianProcessRegressor",
    "KernelRidge",
    "NuSVR",
    "SVR",
    "RandomForestRegressor",
    "ExtraTreesRegressor",
]
for model, _ in regressors[:]:
    if model in regressors_to_remove:
        regressors.remove((model, _))

Data loaded and splited. Training performed on balanced data. Only 25% of validation data is used.

In [4]:
drop_cols = ["sub_grade", "int_rate", "sub_grade_enc", "grade"]
X_train, y_train = (
    pd.read_pickle("./data/data_train_balanced_mod2.pkl")
    .drop(columns=drop_cols)
    .pipe(X_y_spilt, target="grade_enc")
)
X_val, y_val = (
    pd.read_pickle("./data/data_val_mod2.pkl")
    .pipe(stratified_sample, frac=0.25, col="sub_grade")
    .drop(columns=drop_cols)
    .pipe(X_y_spilt, target="grade_enc")
)
print(f"Number of training instances {X_train.shape[0]}")
print(f"Number of validation instances {X_val.shape[0]}")
print("Target counts for validation:")
print(y_val.value_counts())

Number of training instances 53935
Number of validation instances 59660
Target counts for validation:
grade_enc
1    17682
2    16066
0    14946
3     8480
4     2010
5      402
6       74
Name: count, dtype: int64


A number of different classifiers and regressors are trained and evaluated. Macro averaged mean absolute error is used to compare classifiers and regressors. This metric accounts for class order.

In [6]:
clf = LazyClassifier(
    random_state=42, custom_metric=macro_averaged_mean_absolute_error
)
clf_models, _ = clf.fit(X_train, X_val, y_train, y_val)
clf_models

  0%|          | 0/22 [00:00<?, ?it/s]

 95%|█████████▌| 21/22 [01:33<00:03,  3.62s/it]

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001995 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2751
[LightGBM] [Info] Number of data points in the train set: 53935, number of used features: 31
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910
[LightGBM] [Info] Start training from score -1.945910


100%|██████████| 22/22 [01:36<00:00,  4.38s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,macro_averaged_mean_absolute_error,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
XGBClassifier,0.42,0.36,,0.43,1.11,2.6
LGBMClassifier,0.43,0.36,,0.44,1.09,2.37
LogisticRegression,0.42,0.36,,0.43,1.13,2.02
LinearDiscriminantAnalysis,0.39,0.35,,0.4,1.09,0.57
AdaBoostClassifier,0.4,0.35,,0.41,1.16,7.36
CalibratedClassifierCV,0.38,0.35,,0.37,1.18,12.11
ExtraTreesClassifier,0.41,0.35,,0.41,1.16,16.56
BernoulliNB,0.38,0.35,,0.39,1.21,0.38
RandomForestClassifier,0.42,0.34,,0.43,1.16,25.1
RidgeClassifier,0.34,0.33,,0.29,1.34,0.38


In [7]:
reg = LazyRegressor(
    random_state=42, custom_metric=reg_macro_averaged_mean_absolute_error
)
reg_models, predictions = reg.fit(X_train, X_val, y_train, y_val)
reg_models

  0%|          | 0/34 [00:00<?, ?it/s]

 97%|█████████▋| 33/34 [01:42<00:00,  1.41it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003036 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2751
[LightGBM] [Info] Number of data points in the train set: 53935, number of used features: 31
[LightGBM] [Info] Start training from score 3.000000


100%|██████████| 34/34 [01:43<00:00,  3.04s/it]


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken,reg_macro_averaged_mean_absolute_error
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
PoissonRegressor,-0.09,-0.09,1.21,0.37,1.3
LGBMRegressor,-0.11,-0.11,1.22,0.69,1.33
HistGradientBoostingRegressor,-0.12,-0.12,1.22,1.51,1.35
GradientBoostingRegressor,-0.12,-0.12,1.22,21.77,1.33
HuberRegressor,-0.17,-0.17,1.25,1.54,1.35
XGBRegressor,-0.18,-0.18,1.25,0.7,1.38
LinearSVR,-0.18,-0.18,1.26,9.32,1.36
TransformedTargetRegressor,-0.19,-0.19,1.26,0.32,
LinearRegression,-0.19,-0.19,1.26,0.3,
Ridge,-0.19,-0.19,1.26,0.26,


## Outcome

For ordered classification, simple multi classification approach is more suitable than regression. LGBMClassifier has the lowest macro-averaged mean absolute error and is chosen for further tuning to predict loan grades.