# Lazy Prediction of Loan Acceptance Model

First, all the necessary libraries are imported.

In [19]:
import lazypredict
import pandas as pd
from lazypredict.Supervised import LazyClassifier
from helper_functions.ml_data_prep import (
    stratified_sample,
    X_y_spilt,
)
from sklearn.preprocessing import TargetEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

Computationally expensive and unable to execute classifiers are removed.

In [20]:
classifiers = lazypredict.Supervised.CLASSIFIERS
models_to_remove = [
    "StackingClassifier",
    "CategoricalNB",
    "LabelPropagation",
    "LabelSpreading",
    "KNeighborsClassifier",
    "NuSVC",
    "SVC",
    "LinearSVC",
    "RandomForestClassifier",
    "ExtraTreesClassifier",
]
for model, _ in classifiers[:]:
    if model in models_to_remove:
        classifiers.remove((model, _))

Data loaded and splited. Training performed only on 20% of balanced data.

In [21]:
X_train, y_train = (
    pd.read_pickle("./data/data_train_balanced_mod1.pkl")
    .pipe(stratified_sample, frac=0.2)
    .pipe(X_y_spilt)
)
X_val, y_val = pd.read_pickle("./data/data_val_mod1.pkl").pipe(X_y_spilt)
print(f"Number of training instances {X_train.shape[0]}")
print(f"Number of validation instances {X_val.shape[0]}")

Number of training instances 871424
Number of validation instances 885469


Basic feature engineering performed to encode high-dimensionality features and preprocessing pipeline created.

In [22]:
cat_transformer = Pipeline([("enc", TargetEncoder(random_state=42))])
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", cat_transformer, ["state"]),
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
).set_output(transform="pandas")

Data for training and validation are preprocessed.

In [23]:
X_train = preprocessor.fit_transform(X_train, y_train)
X_val = preprocessor.transform(X_val)

A number of different classifiers are trained and evaluated.

In [24]:
clf = LazyClassifier(random_state=42, ignore_warnings=False)
models, predictions = clf.fit(X_train, X_val, y_train, y_val)
models

  0%|          | 0/19 [00:00<?, ?it/s]

 95%|█████████▍| 18/19 [02:31<00:03,  3.50s/it]

[LightGBM] [Info] Number of positive: 435712, number of negative: 435712
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003937 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 773
[LightGBM] [Info] Number of data points in the train set: 871424, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000


100%|██████████| 19/19 [02:34<00:00,  8.15s/it]


Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
GaussianNB,0.98,0.99,0.99,0.99,1.92
Perceptron,0.99,0.99,0.99,0.99,3.03
XGBClassifier,1.0,0.99,0.99,1.0,5.11
LGBMClassifier,1.0,0.98,0.98,1.0,3.51
AdaBoostClassifier,1.0,0.98,0.98,1.0,45.59
LogisticRegression,0.99,0.98,0.98,0.99,5.03
SGDClassifier,0.99,0.98,0.98,0.99,3.19
CalibratedClassifierCV,0.99,0.98,0.98,0.99,15.96
BaggingClassifier,1.0,0.98,0.98,1.0,45.38
LinearDiscriminantAnalysis,0.99,0.97,0.97,0.99,2.64


## Outcome

Though LGBMClassifier is faster than XGBClassifier, their overall performance is comparable. Since XGBClassifier could be a bit more robust because of level-wise growth compared to LGBMClassifier's leaf-wise growth, XGBClassifier is chosen to be tuned. However, it is clear that this classification is rather a trivial task. The aim is to improve predictions which could be obtained with simple heuristics.