In Step-2, we performed various data preprocessing steps necessary to make it ready for ingestion by Machine learning models.

In this step, we'll:
* Experiment training multiple models in the preprocessed data using the pipeline. The idea is to train quick-and-dirty models with standard parameters to gauge which models perform well, which features are important for these models etc. Models that we'll be experimenting with are: Logistic regression, Random Forest, XgBoost, LightGBM, SVM
* Experiment and validate choices in preprocessing pipeline such as encoding type, scaling, outlier-capping etc.
* Finally shortlist top-3 performing models

In [2]:
# Add src path
import sys
sys.path.append("../")

In [3]:
# Imports 
from src.utils import TRAIN_DATA_PATH
from src.data_preprocessing.preprocessor import build_pipeline

import os
import mlflow
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.linear_model import LogisticRegression, LinearRegression

In [4]:
# Load the data
train_df = pd.read_csv(TRAIN_DATA_PATH, index_col=0)
train_df.reset_index(drop=True, inplace=True)
train_df.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,...,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,application_type,mort_acc,pub_rec_bankruptcies,address,loan_status
0,18500.0,60 months,10.65,340.24,B,B2,CNMI Government,10+ years,OWN,40000.0,...,0.0,8.0,0.1,27.0,f,INDIVIDUAL,,0.0,"7530 Barnes Flat Apt. 584\r\nWhitetown, NV 30723",Fully Paid
1,13175.0,36 months,16.55,466.78,D,D2,customer service / account rep,4 years,RENT,30000.0,...,1.0,1046.0,15.8,8.0,f,INDIVIDUAL,0.0,0.0,"443 Rice Views Apt. 282\r\nNorth Jameshaven, A...",Fully Paid
2,35000.0,60 months,17.86,886.11,D,D5,Branch Manager,10+ years,MORTGAGE,80000.0,...,0.0,20239.0,57.5,36.0,w,INDIVIDUAL,2.0,0.0,3857 Christopher Courts Suite 005\r\nEast Chri...,Charged Off
3,20400.0,36 months,12.12,678.75,B,B3,California Dept of transportation,10+ years,RENT,65000.0,...,0.0,12717.0,49.4,31.0,f,INDIVIDUAL,0.0,0.0,"840 Parks Viaduct\r\nLake Brittanyside, MT 48052",Fully Paid
4,35000.0,60 months,17.57,880.61,D,D4,Air Traffic Control Specialist,10+ years,RENT,200000.0,...,0.0,14572.0,63.1,8.0,w,INDIVIDUAL,0.0,0.0,"042 Jamie Grove\r\nEast Maryshire, LA 70466",Charged Off


In [5]:
# Create dataframe copy to work on
train_df_copy = train_df.copy()

# Feature types
target_feat = "loan_status"
independent_feat = [col for col in train_df_copy.columns if col!=target_feat and col not in {"title", "emp_title"}] # excluding title & emp_title columns from beginning
date_feat = ["issue_d", "earliest_cr_line"]
num_feat = [col for col in independent_feat if train_df_copy[col].dtype=="float"]
cat_feat = [col for col in independent_feat if col not in set(num_feat).union(date_feat)]
eng_feat = ["emi_ratio", "credit_age_years", "closed_acc", "credit_util_ratio", "mortgage_ratio"]

# Types of categorical features
ordinal_feat = ["grade", "sub_grade", "emp_length"]
supervised_feat = ["purpose", "address"] # features to receive supervised categorical encoding i.e. Target encoding
ohe_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership"]

In [33]:
# Fetch X & y
X, y = train_df_copy[independent_feat], train_df_copy[target_feat]

# Encoding the target feature
y = y.map({"Charged Off": 1, "Fully Paid": 0})

# Experiment setup
We'll be using:
* Cross-validation i.e. `StratifiedKFold` to maintain the same target label distribution across the train & validation splits. The metrics we'll be using are: F1-score & PR-AUC
* MLFlow to track the experiments

In [7]:
# Cross-validation setup
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Mlflow experiment tracking
tracking_uri_path = os.path.abspath('../mlflow_tracking/mlruns').replace('\\', '/')
mlflow.set_tracking_uri(f"file:///{tracking_uri_path}") # Set path for experiment tracking artifacts 
mlflow.set_experiment("loantap-experiments")

  return FileStore(store_uri, store_uri)


<Experiment: artifact_location='file:///c:/Users/raon1/Desktop/Python/Projects/LoanTap/mlflow_tracking/mlruns/529194635109779319', creation_time=1765452200241, experiment_id='529194635109779319', last_update_time=1765452200241, lifecycle_stage='active', name='loantap-experiments', tags={'mlflow.experimentKind': 'custom_model_development'}>

In [8]:
# Function to run cross-validation experiments & log them
def run_cross_validation(run_name, pipeline_config, model, cv, X, y, metrics=["precision", "recall", "f1", "average_precision"]):
    # Fetch preprocessing pipeline
    pipeline = build_pipeline(num_feat, eng_feat, cat_feat, supervised_feat, ohe_feat, ordinal_feat, **pipeline_config)
    # Append model to end of pipeline steps
    pipeline.steps.append(("model", model))
    
    # Mlflow run
    with mlflow.start_run(run_name=run_name):
        scores = cross_validate(
            estimator=pipeline,
            X=X,
            y=y,
            cv=cv,
            scoring=metrics,
            return_train_score=True,
            n_jobs=-1
        )

        # Fetch and log train & test metrics
        metrics_dict = {"Train": [], "Test": []}
        for metric in metrics:
            train_metric_mean, train_metric_std = scores[f"train_{metric}"].mean(), scores[f"train_{metric}"].std()
            test_metric_mean, test_metric_std = scores[f"test_{metric}"].mean(), scores[f"test_{metric}"].std()
            metrics_dict["Train"].append(f"{train_metric_mean:.3f} ± {train_metric_std:.3f}")
            metrics_dict["Test"].append(f"{test_metric_mean:.3f} ± {test_metric_std:.3f}")
        
        # Log the metrics_df
        metrics_df = pd.DataFrame(metrics_dict, index=metrics,)
        mlflow.log_table(metrics_df.reset_index().rename(columns={"index": "Metric"}), artifact_file="metrics.json")
        print(f"----------{run_name}----------")
        print(metrics_df)
        print("="*100)

        # Log the parameters
        for k, v in pipeline_config.items():
            mlflow.log_param(k, v)

# Experiment-1: Logistic regression

We'll start with the simplest binary-classification model i.e. Logistic regression. 

We'll execute separate runs for the following sub-experiments 
* Keep or drop features as per VIF analysis
* Trying out different categorical encoding type: One-hot + Target or Only One-hot
* Keep outliers or not
* Using SMOTE vs class_weights to address class imbalance

## Dropping columns

Dropping following features:
* `grade`: This information is already captured under `sub_grade` and with more granularity
* `total_acc`: This signal is already captured by `closed_acc` (engineered feature) & `open_acc`. Also dropping because it was used to engineer `mortgage_ratio` and thus avoid multicollinearity

## VIF analysis
But before the experiments, we need to perform multicollinearity analysis using VIF (Variance-inflation-factor). For linear models such as Logistic regression, multicollinearity messes up the feature importances. So, its important to handle it first.
$$VIF=\frac{1}{1-R^2}$$
We need to only keep features with VIF<5 and drop features with VIF>=5. 
Process:
* Calculate VIF for each feature in the dataset
* Drop only the feature with highest VIF 
* Repeat the process until the max VIF is <5 i.e. all features have VIF<5

After the analysis, it turns out that these features had large values of VIF and need to be dropped in the experiments: `loan_amnt, sub_grade, negative_rec, mort_acc, revol_bal`. The VIF values of remaining features after dropping can seen below

**Note:** VIF asks us to drop `sub_grade` which can be an important signal to identify defaulters. So, instead of relying blindly on VIF analysis we'll be experimenting dropping & keeping features as per VIF.

In [16]:
# Transformed dataset
logical_features_drop = ["grade", "total_acc"]
features_to_drop = date_feat + logical_features_drop
pipe = build_pipeline(num_feat, eng_feat, cat_feat, supervised_feat, ohe_feat, ordinal_feat, features_to_drop=features_to_drop)
X_trans = pipe.fit_transform(X, y)
X_trans.head()

Unnamed: 0,loan_amnt,term,int_rate,installment,sub_grade,emp_length,annual_inc,purpose,dti,open_acc,...,closed_acc,negative_rec,credit_util_ratio,mortgage_ratio,verification_status_Verified,application_type_NON_INDIVIDUAL,initial_list_status_w,home_ownership_OTHER,home_ownership_OWN,home_ownership_RENT
0,0.541667,1,-0.442244,-0.110755,-0.444444,0.428571,-0.533333,0.20702,0.805983,-0.333333,...,0.5,0,-0.9,-0.090909,1.0,0.0,0.0,0.0,1.0,0.0
1,0.097917,0,0.531353,0.287057,0.666667,-0.428571,-0.755556,0.20702,-1.252137,-0.666667,...,-0.916667,1,-0.75,-0.454545,1.0,0.0,0.0,0.0,0.0,1.0
2,1.916667,1,0.747525,1.605332,1.0,0.428571,0.355556,0.167457,1.491453,0.333333,...,0.916667,0,0.35,0.090909,1.0,0.0,1.0,0.0,0.0,0.0
3,0.7,0,-0.19967,0.953441,-0.333333,0.428571,0.022222,0.20702,0.375214,1.333333,...,0.0,0,0.1,-0.454545,1.0,0.0,0.0,0.0,0.0,1.0
4,1.916667,1,0.69967,1.588041,0.888889,0.428571,3.022222,0.20702,-1.011111,-0.833333,...,-0.833333,0,-0.55,-0.454545,1.0,0.0,1.0,0.0,0.0,1.0


In [22]:
def calculate_vif(df: pd.DataFrame):
    # Work on copy of dataframe
    df = df.copy()

    # Drop features
    df.drop(columns=["loan_amnt", "sub_grade", "negative_rec", "mort_acc", "revol_bal"], inplace=True)

    vifs = []
    for col in df:
        lr = LinearRegression()
        y_vif = df[col]
        x_vif = df.drop(col, axis=1)
        model = lr.fit(x_vif, y_vif)
        r2 = model.score(x_vif, y_vif)

        if r2 >= 0.999:
            vif = np.inf
        else:
            vif = round(1 / (1 - r2), 2)
        vifs.append([col, vif])

    vifs = pd.DataFrame(vifs, columns=["Feature", "VIF"]).sort_values("VIF", axis=0, ascending=False).reset_index(drop=True)
    return vifs
    
calculate_vif(df=X_trans)

Unnamed: 0,Feature,VIF
0,emi_ratio,2.97
1,installment,2.94
2,pub_rec,2.81
3,pub_rec_bankruptcies,2.8
4,annual_inc,2.61
5,credit_util_ratio,1.9
6,int_rate,1.64
7,home_ownership_RENT,1.51
8,open_acc,1.51
9,revol_util,1.46


## Keeping vs Dropping features as per VIF

This is experiment is with 
* Imputing missing values
* Capping the outliers
* Target-encoding (for slightly higher cardinality features) + Onehot-encoding for categorical features
* Scaling the numerical features using `RobustScaler`

In [None]:
# Instantiating the model
logical_feat_drop = date_feat + ["grade", "total_acc"]
vif_feat_drop = ["loan_amnt", "sub_grade", "negative_rec", "mort_acc", "revol_bal"]
lr_model = LogisticRegression(max_iter=500)

# Pipeline config
config = {
    "use_imputation": True,
    "use_outlier_capping": True,
    "use_encoding": True, 
    "use_scaling": True, 
    "use_smote": False
}

# Experiment run
for drop_vif_features in [True, False]:
    if drop_vif_features:
        run_name = f"logreg_w_vif_dropping"
        config["features_to_drop"] = logical_feat_drop + vif_feat_drop
    else:
        run_name = f"logreg_wo_vif_dropping"
        config["features_to_drop"] = logical_feat_drop
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=lr_model,
        cv=cv,
        X=X,
        y=y
    )

----------logreg_w_vif_dropping----------
                           Train           Test
precision          0.932 ± 0.002  0.932 ± 0.003
recall             0.469 ± 0.001  0.469 ± 0.003
f1                 0.624 ± 0.001  0.624 ± 0.003
average_precision  0.775 ± 0.001  0.775 ± 0.003
----------logreg_wo_vif_dropping----------
                           Train           Test
precision          0.918 ± 0.001  0.918 ± 0.005
recall             0.477 ± 0.001  0.477 ± 0.004
f1                 0.628 ± 0.001  0.628 ± 0.004
average_precision  0.778 ± 0.001  0.778 ± 0.003


## Missing values experiment run

From the previous run, we know that keeping features & ignoring VIF analysis yield slightly higher F1 & PR-AUC. Hence we'll ignore VIF & keep the features for future runs.

For this run we'll try:
* Dropping the missing values
* Imputing the missing values

and check if there is improvement in performance by dropping them

In [36]:
# Pipeline config
config = {
    "use_outlier_capping": True,
    "use_encoding": True, 
    "use_scaling": True, 
    "use_smote": False,
    "features_to_drop": logical_feat_drop
}

# Experiment run
for impute in [True, False]:
    config["use_imputation"] = impute
    if impute:
        run_name = f"logreg_wo_vif_impute_nan"
    else:
        run_name = f"logreg_wo_vif_drop_nan"
        
        # Dropping missing values
        nan_rows = X.isna().any(axis=1)
        X, y = X.loc[~nan_rows], y.loc[~nan_rows]
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=lr_model,
        cv=cv,
        X=X,
        y=y
    )

----------logreg_wo_vif_impute_nan----------
                           Train           Test
precision          0.918 ± 0.001  0.918 ± 0.005
recall             0.477 ± 0.001  0.477 ± 0.004
f1                 0.628 ± 0.001  0.628 ± 0.004
average_precision  0.778 ± 0.001  0.778 ± 0.003
----------logreg_wo_vif_drop_nan----------
                           Train           Test
precision          0.909 ± 0.001  0.909 ± 0.002
recall             0.484 ± 0.001  0.484 ± 0.004
f1                 0.632 ± 0.001  0.632 ± 0.004
average_precision  0.783 ± 0.001  0.782 ± 0.002


## Encoding experiment run
From the previous run, we know that dropping missing values in the dataset improved F1 & PR-AUC metric (~0.4%). This suggests that the missing values were of the either type: MAR (Missing at Random) or MNAR (Missing not at Random). For now, we'll drop the missing values for future runs.

For this run we'll try using:
* Onehot + Target encoding
* Only Onehot encoding all the categorical features 
* Only target encoding all the categorical features 

**Note:** These are excluding the ordinal features which are encoded separately in the encoding step

In [39]:
# Using X & y directly, as they don't contain missing values due to previous run
# Pipeline config
config = {
    "use_imputation": False,
    "use_outlier_capping": True,
    "use_encoding": True, 
    "use_scaling": True, 
    "use_smote": False,
    "features_to_drop": logical_feat_drop
}

# Experiment run
for encoding_option in ["ohe_only", "ohe_target", "target_only"]:
    run_name = f"logreg_wo_vif_drop_nan_{encoding_option}_encoding"
    if encoding_option == "ohe_only":
        supervised_feat = None
        ohe_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership", "purpose", "address"]
    elif encoding_option == "ohe_target":
        supervised_feat = ["purpose", "address"]
        ohe_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership"]
    else:
        supervised_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership", "purpose", "address"]
        ohe_feat = None
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=lr_model,
        cv=cv,
        X=X,
        y=y
    )

----------logreg_wo_vif_drop_nan_ohe_only_encoding----------
                           Train           Test
precision          0.908 ± 0.002  0.908 ± 0.003
recall             0.485 ± 0.001  0.485 ± 0.005
f1                 0.633 ± 0.001  0.632 ± 0.004
average_precision  0.783 ± 0.001  0.783 ± 0.002
----------logreg_wo_vif_drop_nan_ohe_target_encoding----------
                           Train           Test
precision          0.908 ± 0.001  0.908 ± 0.002
recall             0.485 ± 0.001  0.485 ± 0.005
f1                 0.633 ± 0.000  0.632 ± 0.004
average_precision  0.783 ± 0.001  0.782 ± 0.002
----------logreg_wo_vif_drop_nan_target_only_encoding----------
                           Train           Test
precision          0.909 ± 0.001  0.909 ± 0.002
recall             0.484 ± 0.001  0.484 ± 0.004
f1                 0.632 ± 0.001  0.632 ± 0.004
average_precision  0.783 ± 0.001  0.782 ± 0.002


## Keeping vs Capping Outliers run
From the previous run, we know that using One-hot only encoding result in negligible improvement in PR-AUC (~0.1%), hence choosing One-hot only encoding for future runs.

Since, Logistic regression uses sigmoid function squishing all the output between 0-1, the effect of outliers is reduced. Not saying that outliers don't affect Logistic Regression, which is why testing it using this run. 

Also, to remove the influence of outliers if present, scaling is done using `RobustScaler`.

In [41]:
# Pipeline config
supervised_feat = None
ohe_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership", "purpose", "address"]
config = {
    "use_imputation": False,
    "use_encoding": True, 
    "use_scaling": True, 
    "use_smote": False,
    "features_to_drop": logical_feat_drop
}

# Experiment run
for cap_outliers in [True, False]:
    config["use_outlier_capping"] = cap_outliers
    if cap_outliers:
        run_name = f"logreg_wo_vif_drop_nan_ohe_only_w_outlier_capping"  
    else:
        run_name = f"logreg_wo_vif_drop_nan_ohe_only_wo_outlier_capping"
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=lr_model,
        cv=cv,
        X=X,
        y=y
    )

----------logreg_wo_vif_drop_nan_ohe_only_w_outlier_capping----------
                           Train           Test
precision          0.908 ± 0.002  0.908 ± 0.003
recall             0.485 ± 0.001  0.485 ± 0.005
f1                 0.633 ± 0.001  0.632 ± 0.004
average_precision  0.783 ± 0.001  0.783 ± 0.002
----------logreg_wo_vif_drop_nan_ohe_only_wo_outlier_capping----------
                           Train           Test
precision          0.904 ± 0.002  0.904 ± 0.003
recall             0.489 ± 0.001  0.488 ± 0.005
f1                 0.634 ± 0.001  0.634 ± 0.004
average_precision  0.784 ± 0.001  0.783 ± 0.002


## SMOTE vs. class_weights run
From the previous run, we know that ignoring Outlier-capping improved Recall, F1 & PR-AUC, though negligibly (~0.1%). Hence, keeping outliers as they are for future runs.

For this run, we'll experiment methods to handle class-imbalance i.e. SMOTE vs model `class_weights` parameter

In [44]:
# Pipeline config
supervised_feat = None
ohe_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership", "purpose", "address"]
config = {
    "use_imputation": False,
    "use_outlier_capping": False,
    "use_encoding": True,
    "use_scaling": True, 
    "features_to_drop": logical_feat_drop
}

# Experiment run
for smote_use in [True, False]:
    config["use_smote"] = smote_use
    if smote_use:
        run_name = f"logreg_wo_vif_drop_nan_ohe_only_wo_outlier_capping_w_smote"
    else:
        run_name = f"logreg_wo_vif_drop_nan_ohe_only_wo_outlier_capping_w_class_weights"
        lr_model = LogisticRegression(max_iter=500, class_weight="balanced")
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=lr_model,
        cv=cv,
        X=X,
        y=y
    )

----------logreg_wo_vif_drop_nan_ohe_only_wo_outlier_capping_w_smote----------
                           Train           Test
precision          0.530 ± 0.001  0.529 ± 0.002
recall             0.773 ± 0.001  0.772 ± 0.005
f1                 0.628 ± 0.001  0.628 ± 0.002
average_precision  0.779 ± 0.001  0.779 ± 0.002
----------logreg_wo_vif_drop_nan_ohe_only_wo_outlier_capping_w_class_weights----------
                           Train           Test
precision          0.522 ± 0.001  0.522 ± 0.003
recall             0.792 ± 0.001  0.791 ± 0.004
f1                 0.629 ± 0.001  0.629 ± 0.003
average_precision  0.784 ± 0.001  0.783 ± 0.002


## Insights

From the experiments with Logistic regression, we observed:
* Keeping features that VIF suggested to drop is better than dropping for metrics. My intuition is that since, the model has `penalty="l2"` as default, the regularization takes care of the multicollinearity and prevents feature-coefficient from being unstable. Also, by dropping important feature such as `sub_grade` & `loan_amnt` would lead to loss in predictive signal making it difficult for the model to learn.
* Using only One-hot encoding is slightly better than Target encoding features with compartively higher cardinality (7-10)
* Ignoring Outlier capping provides better metrics than capping
* Handling class-imbalance using SMOTE or class-weights significantly improved Recall, but Precision & F1 (small hit) took a hit while PR-AUC remained the same. 
Is it worth having higher Recall at the cost of Precision? I think so, because severity/cost of a False Negative (Failing to identify a defaulter) outweighs that of False Positive (Incorrectly predicting an eligible applicant as defaulter).
So, we'll use `class_weights` parameter of the model to handle class-imbalance as its more efficient both computationally & performance-wise compared to SMOTE.

# Experiment-2: Random Forest

In this section we'll experiment with Bagging model i.e. Random Forest which is more complex than Logistic Regression

Decision Trees based algorithms are more robust and don't require:
* VIF analysis as they aren't affect by multicollinearity. Multicollinearity comes into picture when linear combination of all features is used for prediction, but in case of DT based algorithms, every decision/condition evaluated at each node considers only a single feature at a time. So, we won't we be dropping any features and let the algorithm figure out which features to consider for node splitting.
* Encoding the categorical features mostly for newer implementations. But since sklearn's implementation requires us to encode the categorical features we'll try out different encoding strategies i.e. one-hot only, one-hot + target & target only
* Outlier handling since they are robust to them. But we'll still experiment with & without handling outlier scenarios
* Missing value imputation as of new version implementation. The model figures out how to split samples into child nodes for feature with missing values during both training & inference time. But we'll still experiment with leaving NaNs as they are vs. dropping them.
* Scaling because while evaluating a condition at a tree node, it considers relative ordering of samples rather than their magnitudes. So, we won't experiment for scaling

From the previous experiment, it was pretty clear that using `class_weights` parameters was beneficial. So, we'll be setting `class_weight="balanced_subsample` for `RandomForestClassifier`, which makes sure to assign class-weight to every bootstrapped subset created for each estimator

## Keep vs Impute Missing values run

In this experiment, we'll modify certain parameters to prevent Random-Forest model from overfitting and reduce experiment time:
* increase number of estimators (default=100) i.e. 200
* set a lower `max_depth` (default=None) i.e. 10
* set a higher `min_samples_leaf` (default=1) i.e. 50 

We'll keep this config for RandomForest same for all the experiment runs

In [47]:
# Resetting X & y to contain missing values (i.e. original data)
X, y = train_df_copy[independent_feat], train_df_copy[target_feat]

# Encoding the target feature
y = y.map({"Charged Off": 1, "Fully Paid": 0})

In [49]:
# Instantiating the model
logical_feat_drop = date_feat
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10, 
    min_samples_leaf=50,
    class_weight="balanced_subsample",
    n_jobs=-1
)

# Pipeline config
config = {
    "use_outlier_capping": False,
    "use_encoding": True,
    "use_scaling": False,
    "use_smote": False, 
    "features_to_drop": logical_feat_drop
}

# Experiment run
for option in ["impute", "keep", "drop"]:
    if option == "impute":
        run_name = f"rf_impute_nan"
        config["use_imputation"] = True
    elif option == "keep":
        # Let the model handle the missing values
        run_name = f"rf_keep_nan"
        config["use_imputation"] = False
    else:
        run_name = f"rf_drop_nan"
        config["use_imputation"] = False

        # Dropping missing values
        nan_rows = X.isna().any(axis=1)
        X, y = X.loc[~nan_rows], y.loc[~nan_rows]
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=rf_model,
        cv=cv,
        X=X,
        y=y
    )

----------rf_impute_nan----------
                           Train           Test
precision          0.463 ± 0.001  0.460 ± 0.005
recall             0.791 ± 0.002  0.787 ± 0.002
f1                 0.584 ± 0.000  0.581 ± 0.004
average_precision  0.757 ± 0.001  0.753 ± 0.004
----------rf_keep_nan----------
                           Train           Test
precision          0.466 ± 0.001  0.463 ± 0.003
recall             0.789 ± 0.001  0.785 ± 0.003
f1                 0.586 ± 0.001  0.583 ± 0.003
average_precision  0.758 ± 0.001  0.754 ± 0.003
----------rf_drop_nan----------
                           Train           Test
precision          0.467 ± 0.003  0.463 ± 0.003
recall             0.796 ± 0.003  0.790 ± 0.004
f1                 0.588 ± 0.002  0.584 ± 0.003
average_precision  0.760 ± 0.001  0.755 ± 0.003


## Encoding experiment run
From the previous run, we know that dropping missing values provided better performance (slightly). Hence we'll drop missing values for future runs. But, Random-Forest's performance is worse than that of Logistic-regression. Let's see if it improves with following runs.

For this run we'll try using:
* Onehot + Target encoding
* Only Onehot encoding all the categorical features 
* Only target encoding all the categorical features 

**Note:** These are excluding the ordinal features which are encoded separately in the encoding step

In [51]:
# Using X & y directly, as they don't contain missing values due to previous run
# Pipeline config
config = {
    "use_imputation": False,
    "use_outlier_capping": False,
    "use_encoding": True, 
    "use_scaling": False, 
    "use_smote": False,
    "features_to_drop": logical_feat_drop
}

# Experiment run
for encoding_option in ["ohe_only", "ohe_target", "target_only"]:
    run_name = f"rf_drop_nan_{encoding_option}_encoding"
    if encoding_option == "ohe_only":
        supervised_feat = None
        ohe_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership", "purpose", "address"]
    elif encoding_option == "ohe_target":
        supervised_feat = ["purpose", "address"]
        ohe_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership"]
    else:
        supervised_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership", "purpose", "address"]
        ohe_feat = None
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=rf_model,
        cv=cv,
        X=X,
        y=y
    )

----------rf_drop_nan_ohe_only_encoding----------
                           Train           Test
precision          0.465 ± 0.003  0.462 ± 0.003
recall             0.797 ± 0.003  0.793 ± 0.004
f1                 0.587 ± 0.001  0.583 ± 0.002
average_precision  0.759 ± 0.001  0.754 ± 0.003
----------rf_drop_nan_ohe_target_encoding----------
                           Train           Test
precision          0.493 ± 0.001  0.487 ± 0.001
recall             0.832 ± 0.001  0.823 ± 0.003
f1                 0.619 ± 0.001  0.612 ± 0.002
average_precision  0.790 ± 0.001  0.779 ± 0.003
----------rf_drop_nan_target_only_encoding----------
                           Train           Test
precision          0.494 ± 0.002  0.488 ± 0.001
recall             0.831 ± 0.001  0.823 ± 0.004
f1                 0.620 ± 0.002  0.613 ± 0.001
average_precision  0.791 ± 0.001  0.780 ± 0.002


## Keeping vs Capping Outliers run
From the previous run, we know that using One-hot only encoding resulted in worse results out of the 3 runs while One-hot + Target & Target only strategies provided similar results. Looking at the comparatively poor performance with One-hot encoding involved, we'll choose Target only encoding for future runs. [This](https://community.deeplearning.ai/t/isnt-it-a-bad-idea-to-use-one-hot-encode-for-decision-tree-models/165559/3) might be the reason of bad results with One-hot encoding

Since, Random-Forest algorithm is mostly robust to outliers, their effect is reduced. Not saying that outliers don't affect Random-Forest (we see the effect when the model is overfitting), which is why testing it using this run. 

Also, to remove the influence of outliers if present, scaling is done using `RobustScaler`.

In [53]:
# Pipeline config
supervised_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership", "purpose", "address"]
ohe_feat = None
config = {
    "use_imputation": False,
    "use_encoding": True, 
    "use_scaling": False, 
    "use_smote": False,
    "features_to_drop": logical_feat_drop
}

# Experiment run
for cap_outliers in [True, False]:
    config["use_outlier_capping"] = cap_outliers
    if cap_outliers:
        run_name = f"rf_drop_nan_target_encoding_w_outlier_capping"  
    else:
        run_name = f"rf_drop_nan_target_encoding_wo_outlier_capping"
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=rf_model,
        cv=cv,
        X=X,
        y=y
    )

----------rf_drop_nan_target_encoding_w_outlier_capping----------
                           Train           Test
precision          0.494 ± 0.003  0.489 ± 0.002
recall             0.831 ± 0.002  0.822 ± 0.004
f1                 0.620 ± 0.002  0.613 ± 0.001
average_precision  0.791 ± 0.000  0.781 ± 0.003
----------rf_drop_nan_target_encoding_wo_outlier_capping----------
                           Train           Test
precision          0.495 ± 0.003  0.489 ± 0.002
recall             0.831 ± 0.002  0.822 ± 0.004
f1                 0.620 ± 0.002  0.613 ± 0.001
average_precision  0.790 ± 0.001  0.780 ± 0.002


## Insights

From the experiments with Random Forest, we observed:
* Dropping the missing values was better in performance than with imputation or keeping them as is.
* Using only Target encoding for all categorical features (except ordinal) was better than other encoding strategies.
* Outlier capping provides better metrics slightly than not capping.
* Random Forest with chosen parameters didn't overfit but surprisingly its performance didn't even match that of Logistic-regression. It improved only the Recall by 3.1% but at a very high cost of Precision (<50%), while F1 & PR-AUC dropped by 1.6% & 0.3% respectively

# Experiment-3: XgBoost

In this section we'll experiment with Gradient Boosting model i.e. XgBoost (Extreme Gradient Boosting)

For XgBoost model we'll performing the same sub-experiment runs as Random-Forest i.e. Impute or not, Encoding strategies, Handle outliers or not.

XgBoost model is robust enough to handle missing values, handle categorical features without encoding & handle outliers, but we want to experiment if handling these steps on our end rather than the model improves or degrades the performance

## Keep vs Impute Missing values run

For all the upcoming experiments with XGBClassifier, we'll use the following paramters to prevent the model from overfitting the data & hence gauge its actual performance on unseen data:
* set higher value of `n_estimators` (default=100) i.e 200
* set smaller value of `learning_rate` (default=0.3) i.e. 0.2
* set smaller value of `max_depth` (default=6) i.e. 4
* set slightly smaller value (than 1 i.e. max-value) for `subsample` & `colsample_bytree` (both default=1) representing using row-sampling & column-sampling respectively to introduce randomness

In [54]:
# Resetting X & y to contain missing values (i.e. original data)
X, y = train_df_copy[independent_feat], train_df_copy[target_feat]

# Encoding the target feature
y = y.map({"Charged Off": 1, "Fully Paid": 0})

In [56]:
# Instantiating the model
logical_feat_drop = date_feat
neg_pos_sample_ratio = (y==0).sum() / (y==1).sum()

xgb_model = XGBClassifier(
    n_estimators = 200,
    max_depth = 4,
    learning_rate = 0.1,
    subsample = 0.8, 
    colsample_bytree = 0.8,
    scale_pos_weight = neg_pos_sample_ratio,
    random_state = 42,
    n_jobs = -1,
    enable_categorical = True
)

# Pipeline config
config = {
    "use_outlier_capping": False,
    "use_encoding": False, 
    "use_scaling": False, 
    "use_smote": False,
    "features_to_drop": logical_feat_drop,
    "convert_cat_dtype": True
}

# Experiment run
for option in ["impute", "keep", "drop"]:
    if option == "impute":
        run_name = f"xgb_impute_nan"  
        config["use_imputation"] = True
    elif option == "keep":
        run_name = f"xgb_keep_nan"
        config["use_imputation"] = False
    else:
        run_name = f"xgb_drop_nan"
        config["use_imputation"] = False
        
        # Dropping missing values
        nan_rows = X.isna().any(axis=1)
        X, y = X.loc[~nan_rows], y.loc[~nan_rows]
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=xgb_model,
        cv=cv,
        X=X,
        y=y
    )

----------xgb_impute_nan----------
                           Train           Test
precision          0.513 ± 0.001  0.504 ± 0.003
recall             0.821 ± 0.001  0.808 ± 0.002
f1                 0.631 ± 0.001  0.621 ± 0.003
average_precision  0.795 ± 0.001  0.784 ± 0.003
----------xgb_keep_nan----------
                           Train           Test
precision          0.514 ± 0.001  0.505 ± 0.003
recall             0.822 ± 0.001  0.809 ± 0.003
f1                 0.632 ± 0.001  0.622 ± 0.003
average_precision  0.796 ± 0.001  0.785 ± 0.003
----------xgb_drop_nan----------
                           Train           Test
precision          0.516 ± 0.002  0.507 ± 0.001
recall             0.826 ± 0.001  0.812 ± 0.003
f1                 0.635 ± 0.001  0.624 ± 0.001
average_precision  0.799 ± 0.001  0.786 ± 0.002


## Encoding experiment run
From the previous run, we know that dropping missing values provided better performance (slightly). Hence we'll drop missing value for future runs.

For this run we'll try using:
* Only target encoding all the categorical features 
* Skip encoding any categorical feature and letting model handle it

The reasoning for the choice above is that One-hot encoding don't usually perform well with Decision trees (also seen from Random-Forest runs), hence choosing encoding strategies without it.

In [58]:
# Using X & y directly, as they don't contain missing values due to previous run
# Pipeline config
config = {
    "use_imputation": False,
    "use_outlier_capping": False,
    "use_scaling": False, 
    "use_smote": False,
    "features_to_drop": logical_feat_drop,
}

# Experiment run
for encode in [True, False]:
    config["use_encoding"] = encode
    if encode:
        run_name = f"xgb_drop_nan_target_encoding"
        config["convert_cat_dtype"] = False
        supervised_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership", "purpose", "address"]
        ohe_feat = None
    else:
        run_name = f"xgb_drop_nan_no_encoding"
        config["convert_cat_dtype"] = True
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=xgb_model,
        cv=cv,
        X=X,
        y=y
    )

----------xgb_drop_nan_target_encoding----------
                           Train           Test
precision          0.514 ± 0.002  0.506 ± 0.001
recall             0.825 ± 0.001  0.813 ± 0.003
f1                 0.633 ± 0.001  0.624 ± 0.002
average_precision  0.796 ± 0.001  0.787 ± 0.003
----------xgb_drop_nan_no_encoding----------
                           Train           Test
precision          0.516 ± 0.002  0.507 ± 0.001
recall             0.826 ± 0.001  0.812 ± 0.003
f1                 0.635 ± 0.001  0.624 ± 0.001
average_precision  0.799 ± 0.001  0.786 ± 0.002


## Keeping vs Capping Outliers run
From the previous run, we know that using Target-only encoding provided better slightly better performance, so we'll use that for encoding strategy.

Since, Xgboost algorithm is mostly robust to outliers, their effect is reduced. Not saying that outliers don't affect Xgboost (we see the effect when the model is overfitting), which is why testing it using this run. 

In [60]:
# Pipeline config
supervised_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership", "purpose", "address"]
ohe_feat = None
config = {
    "use_imputation": False,
    "use_encoding": True,
    "use_scaling": False, 
    "use_smote": False,
    "features_to_drop": logical_feat_drop,
    "convert_cat_dtype": False
}

# Experiment run
for cap_outliers in [True, False]:
    config["use_outlier_capping"] = cap_outliers
    if cap_outliers:
        run_name = f"xgb_drop_nan_target_encoding_w_outlier_capping"  
    else:
        run_name = f"xgb_drop_nan_target_encoding_wo_outlier_capping"
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=xgb_model,
        cv=cv,
        X=X,
        y=y
    )

----------xgb_drop_nan_target_encoding_w_outlier_capping----------
                           Train           Test
precision          0.513 ± 0.002  0.506 ± 0.002
recall             0.825 ± 0.002  0.814 ± 0.003
f1                 0.633 ± 0.001  0.624 ± 0.002
average_precision  0.796 ± 0.001  0.787 ± 0.003
----------xgb_drop_nan_target_encoding_wo_outlier_capping----------
                           Train           Test
precision          0.514 ± 0.002  0.506 ± 0.001
recall             0.825 ± 0.001  0.813 ± 0.003
f1                 0.633 ± 0.001  0.624 ± 0.002
average_precision  0.796 ± 0.001  0.787 ± 0.003


## Insights

From the experiments with Xgboost, we observed:
* Dropping missing values was better in performance than with imputation or keeping them as is.
* Using only Target encoding for all categorical features (except ordinal) was better than other encoding strategies.
* Outlier capping provides better metrics slightly than not capping
* Xgboost with the chosen parameters didn't overfit and improved Recall & PR-AUC by 2.2% & 0.4% respectively. Though the increase in Recall came at the cost of Precision & F1-score i.e. 1.6% & 0.5% drop respectively. Considering the high Recall value, its still in contention with Logistic-regression

# Experiment 4: LightGBM

In this section we'll experiment with another Gradient Boosting model i.e. LightGBM (Light Gradient Boosting Machine)

For LightGBM model we'll performing the same sub-experiment runs as Xgboost i.e. Impute or not, Encoding strategies, Handle outliers or not.

LightGBM model is robust enough to handle missing values, handle categorical features without encoding & handle outliers, but we want to experiment if handling these steps on our end rather than the model improves or degrades the performance


## Keep vs Impute Missing values run

For all the upcoming experiments with LGBMClassifier, we'll use the following paramters to prevent the model from overfitting the data & hence gauge its actual performance on unseen data:
* set higher value of `n_estimators` (default=100) i.e 200
* set smaller value of `max_depth` (default=-1) i.e. 5
* set smaller value of `max_leaf_nodes` (default=31) i.e. 20
* set higher value of `min_samples_leaf` (default=20) i.e. 40
* set smaller value of `learning_rate` (default=0.1) i.e. 0.05
* set slightly smaller value (than 1 i.e. max-value) for `subsample` & `colsample_bytree` (both default=1) representing using row-sampling & column-sampling respectively to introduce randomness

Similar to Xgboost, if we want LightGBM to handle the categorical features, then just convert their datatype to __category__. It automatically infers such features as categorical features as per this [Kaggle post](https://www.kaggle.com/discussions/getting-started/203471)

In [63]:
# Resetting X & y to contain missing values (i.e. original data)
X, y = train_df_copy[independent_feat], train_df_copy[target_feat]

# Encoding the target feature
y = y.map({"Charged Off": 1, "Fully Paid": 0})

In [66]:
# Instantiating the model
logical_feat_drop = date_feat

lgbm_model = LGBMClassifier(
    n_estimators = 200,
    max_depth = 5,
    max_leaf_nodes=20,
    min_samples_leaf=30,
    learning_rate = 0.05,
    subsample = 0.8, 
    colsample_bytree = 0.8,
    subsample_freq = 1, # bagging happens after 1 iteration
    is_unbalance = True, # for imbalanced dataset
    random_state = 42,
    n_jobs = -1,
)

# Pipeline config
config = {
    "use_outlier_capping": False,
    "use_encoding": False, 
    "use_scaling": False, 
    "use_smote": False,
    "features_to_drop": logical_feat_drop,
    "convert_cat_dtype": True
}

# Experiment run
for option in ["impute", "keep", "drop"]:
    if option == "impute":
        run_name = f"lgbm_impute_nan"  
        config["use_imputation"] = True
    elif option == "keep":
        run_name = f"lgbm_keep_nan"  
        config["use_imputation"] = False
    else:
        run_name = f"lgbm_drop_nan"
        config["use_imputation"] = False

        # Dropping missing values
        nan_rows = X.isna().any(axis=1)
        X, y = X.loc[~nan_rows], y.loc[~nan_rows]
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=lgbm_model,
        cv=cv,
        X=X,
        y=y
    )

----------lgbm_impute_nan----------
                           Train           Test
precision          0.504 ± 0.001  0.499 ± 0.002
recall             0.820 ± 0.000  0.811 ± 0.002
f1                 0.625 ± 0.001  0.618 ± 0.002
average_precision  0.791 ± 0.001  0.783 ± 0.003
----------lgbm_keep_nan----------
                           Train           Test
precision          0.506 ± 0.002  0.500 ± 0.002
recall             0.820 ± 0.001  0.811 ± 0.002
f1                 0.626 ± 0.001  0.619 ± 0.002
average_precision  0.792 ± 0.001  0.784 ± 0.003
----------lgbm_drop_nan----------
                           Train           Test
precision          0.509 ± 0.001  0.503 ± 0.001
recall             0.822 ± 0.001  0.812 ± 0.003
f1                 0.629 ± 0.001  0.622 ± 0.001
average_precision  0.794 ± 0.001  0.785 ± 0.002


## Encoding experiment run
From the previous run, we know that dropping missing values provided better performance (slightly). Hence we'll drop missing values for future runs.

For this run we'll try using:
* Only target encoding all the categorical features 
* Skip encoding any categorical feature and letting model handle it

The reasoning for the choice above runs is that One-hot encoding don't usually perform well with Decision trees (also seen from Random-Forest runs), hence choosing encoding strategies without it.

In [67]:
# Using X & y directly, as they don't contain missing values due to previous run
# Pipeline config
config = {
    "use_imputation": False,
    "use_outlier_capping": False,
    "use_scaling": False, 
    "use_smote": False,
    "features_to_drop": logical_feat_drop,
}

# Experiment run
for encode in [True, False]:
    config["use_encoding"] = encode
    if encode:
        run_name = f"lgbm_drop_nan_target_encoding"
        config["convert_cat_dtype"] = False
        supervised_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership", "purpose", "address"]
        ohe_feat = None
    else:
        run_name = f"lgbm_drop_nan_no_encoding"
        config["convert_cat_dtype"] = True
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=lgbm_model,
        cv=cv,
        X=X,
        y=y
    )

----------lgbm_drop_nan_target_encoding----------
                           Train           Test
precision          0.507 ± 0.001  0.503 ± 0.001
recall             0.821 ± 0.001  0.815 ± 0.003
f1                 0.627 ± 0.001  0.622 ± 0.002
average_precision  0.792 ± 0.001  0.786 ± 0.003
----------lgbm_drop_nan_no_encoding----------
                           Train           Test
precision          0.509 ± 0.001  0.503 ± 0.001
recall             0.822 ± 0.001  0.812 ± 0.003
f1                 0.629 ± 0.001  0.622 ± 0.001
average_precision  0.794 ± 0.001  0.785 ± 0.002


## Keeping vs Capping Outliers run
From the previous run, we know that using target-encoding provided better slightly better performance, so we'll use that for encoding strategy.

Since, LightGBM algorithm is mostly robust to outliers, their effect is reduced. Not saying that outliers don't affect LightGBM (we see the effect when the model is overfitting), which is why testing it using this run. 

In [69]:
# Pipeline config
supervised_feat = ["verification_status", "application_type", "initial_list_status", "home_ownership", "purpose", "address"]
ohe_feat = None
config = {
    "use_imputation": False,
    "use_encoding": True,
    "use_scaling": False, 
    "use_smote": False,
    "features_to_drop": logical_feat_drop,
    "convert_cat_dtype": False
}

# Experiment run
for cap_outliers in [True, False]:
    config["use_outlier_capping"] = cap_outliers
    if cap_outliers:
        run_name = f"lgbm_drop_nan_target_encoding_w_outlier_capping"  
    else:
        run_name = f"lgbm_drop_nan_target_encoding_wo_outlier_capping"
    run_cross_validation(
        run_name=run_name,
        pipeline_config=config,
        model=lgbm_model,
        cv=cv,
        X=X,
        y=y
    )

----------lgbm_drop_nan_target_encoding_w_outlier_capping----------
                           Train           Test
precision          0.507 ± 0.001  0.503 ± 0.001
recall             0.821 ± 0.001  0.815 ± 0.003
f1                 0.627 ± 0.001  0.622 ± 0.001
average_precision  0.792 ± 0.000  0.786 ± 0.003
----------lgbm_drop_nan_target_encoding_wo_outlier_capping----------
                           Train           Test
precision          0.507 ± 0.001  0.503 ± 0.001
recall             0.821 ± 0.001  0.815 ± 0.003
f1                 0.627 ± 0.001  0.622 ± 0.002
average_precision  0.792 ± 0.001  0.786 ± 0.003


## Insights

From the experiments with LightGBM, we observed:
* Dropping missing values was better in performance than with imputation or keeping them as is.
* Using target encoding for all categorical features (except ordinal) was better than other encoding strategies.
* Outlier capping doesn't provide any improvement over not capping.
* LightGBM with the chosen parameters didn't overfit and improved Recall & PR-AUC by 2.4% & 0.3% respectively. Though the increase in Recall came at the cost of Precision & F1-score i.e. 1.9% & 0.7% drop respectively. Its performance is very similar to Xgboost, but Xgboost provides more balanced F1-score (though not by much)

# Final verdict

Below are the best runs (validation/test set performances) for the models trained until now i.e. `Logistic regression, Random Forest, Xgboost & LightGBM`:

| Model | Precision | Recall | F1-score | PR-AUC (Average Precision) |
|:---------:|:--------:|:---------:|:---------:|:---------:|
|  Logistic regression   |  0.522 ± 0.003   |  0.791 ± 0.004   |  0.629 ± 0.003   |  0.783 ± 0.002   |
|  Random forest   |  0.489 ± 0.002   |  0.822 ± 0.004   |  0.613 ± 0.001   |  0.781 ± 0.003   |
|  Xgboost   |  0.506 ± 0.002   |  0.814 ± 0.003   |  0.624 ± 0.002   |  0.787 ± 0.002   |
|  LightGBM   |  0.503 ± 0.001   |  0.815 ± 0.003   |  0.622 ± 0.002   |  0.786 ± 0.003   |

__Occam's razor: The simplest model with comparable performance is the best__

Considering Occam's razor, ___Logistic regression___ is one of the best models including ___Xgboost___ if we had to select top-2. So, next we'll be performing hyperparameter tuning only for these models, optimize them and then perform ensemble prediction using them. 

The reason for Logistic regression performing at par with Tree-based models like Xgboost suggests that the financial data might have linear/monotonic relationship between the features, which Logistic regression is capturing well. Xgboost has better Recall but poor Precision brings down the F1-score, thus being unable to improve the metrics drastically.
