# Lazy Prediction of Loan Subgrade Model

First, all the necessary libraries are imported. 

In [1]:
import lazypredict
import pandas as pd
from imblearn.metrics import macro_averaged_mean_absolute_error
from helper_functions.custom_model import (
    reg_macro_averaged_mean_absolute_error,
)
from lazypredict.Supervised import LazyClassifier, LazyRegressor
from helper_functions.ml_data_prep import (
    stratified_sample,
    X_y_spilt,
)

Computationally expensive and unable to execute classifiers and regressors are removed.

In [2]:
classifiers = lazypredict.Supervised.CLASSIFIERS
classifiers_to_remove = [
    "StackingClassifier",
    "CategoricalNB",
    "LabelPropagation",
    "LabelSpreading",
    "NuSVC",
    "SVC",
    "LinearSVC",
]
for model, _ in classifiers[:]:
    if model in classifiers_to_remove:
        classifiers.remove((model, _))

In [3]:
regressors = lazypredict.Supervised.REGRESSORS
regressors_to_remove = [
    "GammaRegressor",
    "QuantileRegressor",
    "GaussianProcessRegressor",
    "KernelRidge",
    "NuSVR",
    "SVR",
    "RandomForestRegressor",
    "ExtraTreesRegressor",
]
for model, _ in regressors[:]:
    if model in regressors_to_remove:
        regressors.remove((model, _))

Data loaded and splited. Training performed on balanced data. Only 25% of validation data is used.

In [2]:
drop_cols = ["int_rate", "grade_enc", "sub_grade"]
X_train, y_train = (
    pd.read_pickle("./data/data_train_balanced_mod2.pkl")
    .drop(columns=drop_cols)
    .pipe(X_y_spilt, target="sub_grade_enc")
)
X_val, y_val = (
    pd.read_pickle("./data/data_val_mod2.pkl")
    .pipe(stratified_sample, frac=0.25, col="sub_grade")
    .drop(columns=drop_cols)
    .pipe(X_y_spilt, target="sub_grade_enc")
)
print(f"Number of training instances {X_train.shape[0]}")
print(f"Number of validation instances {X_val.shape[0]}")
print("Target counts for validation:")
print(y_val.value_counts())

Number of training instances 53935
Number of validation instances 59660
Target counts for validation:
sub_grade_enc
9     3794
5     3682
6     3668
8     3617
10    3593
3     3478
11    3236
12    3220
2     3136
13    3132
7     2921
4     2886
14    2885
1     2760
0     2686
15    1954
16    1850
17    1704
18    1532
19    1440
24     530
22     457
23     396
21     343
20     284
25     220
30      64
26      54
27      48
29      42
28      38
31       4
32       3
33       2
34       1
Name: count, dtype: int64


A number of different classifiers and regressors are trained and evaluated. Macro averaged mean absolute error is used to compare classifiers and regressors. This metric accounts for class order.

In [6]:
clf = LazyClassifier(
    random_state=42, custom_metric=macro_averaged_mean_absolute_error
)
clf_models, _ = clf.fit(X_train, X_val, y_train, y_val)
clf_models

  0%|          | 0/22 [00:00<?, ?it/s]

 95%|█████████▌| 21/22 [02:38<00:07,  7.40s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003214 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2765
[LightGBM] [Info] Number of data points in the train set: 53935, number of used features: 38
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM] [Info] Start training from score -3.555348
[LightGBM

100%|██████████| 22/22 [02:48<00:00,  7.66s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,macro_averaged_mean_absolute_error,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
LGBMClassifier,0.24,0.25,,0.24,1.43,9.96
LogisticRegression,0.24,0.25,,0.23,1.49,5.23
RidgeClassifier,0.24,0.25,,0.21,1.56,0.43
RidgeClassifierCV,0.24,0.25,,0.21,1.56,0.71
CalibratedClassifierCV,0.24,0.25,,0.23,1.45,33.52
BernoulliNB,0.23,0.25,,0.21,1.54,0.48
ExtraTreeClassifier,0.21,0.24,,0.21,1.61,0.57
RandomForestClassifier,0.24,0.24,,0.23,1.42,32.12
PassiveAggressiveClassifier,0.21,0.23,,0.18,1.54,3.29
XGBClassifier,0.24,0.23,,0.24,1.51,9.34


In [7]:
reg = LazyRegressor(
    random_state=42, custom_metric=reg_macro_averaged_mean_absolute_error
)
reg_models, predictions = reg.fit(X_train, X_val, y_train, y_val)
reg_models

  0%|          | 0/34 [00:00<?, ?it/s]

 97%|█████████▋| 33/34 [01:54<00:00,  1.41it/s]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004082 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2765
[LightGBM] [Info] Number of data points in the train set: 53935, number of used features: 38
[LightGBM] [Info] Start training from score 17.000000


100%|██████████| 34/34 [01:54<00:00,  3.37s/it]


Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken,reg_macro_averaged_mean_absolute_error
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
LassoCV,0.95,0.95,1.38,3.35,
LassoLarsIC,0.95,0.95,1.38,0.48,
RidgeCV,0.94,0.95,1.38,0.47,
Ridge,0.94,0.95,1.38,0.3,
BayesianRidge,0.94,0.95,1.38,0.39,
Lars,0.94,0.95,1.38,0.33,
LinearRegression,0.94,0.95,1.38,0.41,
TransformedTargetRegressor,0.94,0.95,1.38,0.37,
RANSACRegressor,0.94,0.94,1.38,0.41,
HistGradientBoostingRegressor,0.94,0.94,1.38,1.28,


## Outcome

For ordered classification, simple multi classification approach is more suitable than regression. LGBMClassifier has the lowest macro-averaged mean absolute error and is chosen for further tuning to predict loan subgrades. With many classes, regression models quite often do not predict minority classes at all.