<a href="https://www.kaggle.com/code/aniruddhapa/loan-default-prediction-model?scriptVersionId=188538567" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Developing a Stable and Reliable Loan Default Prediction Model for Enhanced Financial Inclusion

# Rational

Consumer finance providers face significant challenges in accurately predicting loan default risk, especially for clients with little or no credit history. Traditional methods often fail to address the dynamic nature of client behavior, leading to unstable and frequently updated scorecards. A stable and reliable model is essential for making informed lending decisions that balance risk and accessibility.

# Introduction

This notebook aims to develop a predictive model to determine which clients are more likely to default on their loans. The goal is to provide a stable and reliable solution that remains effective over time, thus aiding consumer finance providers in making better lending decisions. This project is part of a competition hosted by Home Credit, an international consumer finance provider known for responsible lending practices and financial inclusion efforts.

The competition runs from February 5, 2024, to May 28, 2024, and focuses on using data science to improve the prediction of loan default risk. 

A key evaluation criterion is the gini stability metric, which measures the model's predictive performance and stability over time.

# Objective

The primary objective of this project is to build a predictive model that accurately assesses the likelihood of loan default while maintaining stability over time. The model should:

1. Utilize a robust data preprocessing and feature engineering pipeline.
2. Employ advanced machine learning techniques to maximize predictive accuracy.
3. Ensure stability in predictions across different time periods to minimize the need for frequent model updates.

# Deliverables

## Data Preprocessing and Feature Engineering:

**1. Read and concatenate multiple datasets.**

    * Set appropriate data types for columns.
    * Handle date features and filter out irrelevant or low-quality columns.
    * Engineer additional features to enrich the dataset.

**2. Model Training and Evaluation:**

    * Implement a cross-validation strategy using StratifiedGroupKFold to ensure stable performance evaluation.
    * Train a LightGBM classifier and utilize early stopping and logging callbacks.
    * Aggregate predictions from multiple cross-validation folds using a custom VotingModel.

**3. Prediction and Submission:**

    * Prepare the test dataset by setting the correct data types and indices.
    * Use the trained model to predict probabilities for the test set.
    * Create a submission file with case_id and predicted scores.

**4. Model Stability Assessment:**

    * Evaluate the model using the gini stability metric.
    * Ensure the model's predictions are stable over different weeks to avoid performance drop-offs.


By achieving these deliverables, the notebook will contribute to developing a reliable and stable predictive model that can help consumer finance providers make better lending decisions and potentially improve financial inclusion for individuals with limited credit history.

# Import Libraries

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import os
import gc
import numpy as np
import pandas as pd
import polars as pl
print(pl.__version__)
from glob import glob
from pathlib import Path
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import TimeSeriesSplit, GroupKFold, StratifiedGroupKFold
from sklearn.base import BaseEstimator, RegressorMixin
from sklearn.metrics import roc_auc_score
import lightgbm as lgb



0.20.5


In [2]:
ROOT            = Path("/kaggle/input/home-credit-credit-risk-model-stability")
# ROOT            = Path("./input")

TRAIN_DIR       = ROOT / "csv_files" / "train"
TEST_DIR        = ROOT / "csv_files" / "test"

# Data Pre-Processing

# Pipeline Class for Changing the Data Types

In [3]:
import polars as pl
from glob import glob

class Pipeline:
    
    '''This decorator indicates that the method does not depend on the instance of the class and can be 
    called on the class itself.
    The set_table_dtypes method is a static method designed to standardize the data types of columns 
    in a Polars DataFrame (df). This method ensures that each column is cast to an appropriate data type 
    based on specific rules and conventions.'''
    
    @staticmethod 
    def set_table_dtypes(df):
        for col in df.columns:
            if col in ["case_id", "WEEK_NUM", "num_group1", "num_group2"]:
                df = df.with_columns(pl.col(col).cast(pl.Int64))
            elif col in ["date_decision"]:
                df = df.with_columns(pl.col(col).cast(pl.Date))
            elif col[-1] in ("P", "A"):
                df = df.with_columns(pl.col(col).cast(pl.Float64))
            elif col[-1] in ("M",):
                df = df.with_columns(pl.col(col).cast(pl.String))
            elif col[-1] in ("D",):
                df = df.with_columns(pl.col(col).cast(pl.Date))
        return df
    
    '''The handle_dates function is designed to process date columns in a Polars DataFrame (df). It 
    modifies columns whose names end with "D" by calculating the number of days between the dates in 
    these columns and a reference date column (date_decision).'''
       
    @staticmethod
    def handle_dates(df):
        for col in df.columns:
            if col[-1] in ("D",):
                df = df.with_columns(pl.col(col) - pl.col("date_decision"))
                df = df.with_columns(pl.col(col).dt.total_days())
        df = df.drop("date_decision", "MONTH")
        return df

    '''The filter_cols function is designed to clean and filter the columns of a Polars DataFrame (df). 
    It performs two main tasks: removing columns with a high proportion of missing values and removing 
    categorical columns with low or excessively high cardinality'''
    
    @staticmethod
    def filter_cols(df):
        for col in df.columns:
            if col not in ["target", "case_id", "WEEK_NUM"]:
                isnull = df[col].is_null().mean()

                if isnull > 0.95:
                    df = df.drop(col)

        for col in df.columns:
            if (col not in ["target", "case_id", "WEEK_NUM"]) & (df[col].dtype == pl.String):
                freq = df[col].n_unique()

                if (freq == 1) | (freq > 200):
                    df = df.drop(col)

        return df


# Aggregator Class to summerise/transform the data by grouping and aggregating the values

In [4]:
class Aggregator:
    
    '''Generates aggregation expressions for numeric columns (identified by suffixes "P" or "A"). 
    It creates expressions to find the maximum value of these columns.'''
    
    @staticmethod
    def num_expr(df):
        cols = [col for col in df.columns if col[-1] in ("P", "A")]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        return expr_max

    '''Generates aggregation expressions for date columns (identified by suffix "D"). 
    It creates expressions to find the maximum value of these columns.'''
    
    @staticmethod
    def date_expr(df):
        cols = [col for col in df.columns if col[-1] in ("D",)]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        return expr_max

    '''Generates aggregation expressions for string columns (identified by suffix "M"). 
    It creates expressions to find the maximum value of these columns.'''
    
    @staticmethod
    def str_expr(df):
        cols = [col for col in df.columns if col[-1] in ("M",)]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        return expr_max

    '''Generates aggregation expressions for other types of columns (identified by suffixes "T" or "L"). 
    It creates expressions to find the maximum value of these columns.'''
    
    @staticmethod
    def other_expr(df):
        cols = [col for col in df.columns if col[-1] in ("T", "L")]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        return expr_max

    '''Generates aggregation expressions for columns related to grouping 
    (columns containing "num_group"). It creates expressions to find the maximum value of these columns.'''
    
    @staticmethod
    def count_expr(df):
        cols = [col for col in df.columns if "num_group" in col]
        expr_max = [pl.max(col).alias(f"max_{col}") for col in cols]
        return expr_max

    '''Combines all the aggregation expressions from the other methods into a single list of expressions.'''
    
    @staticmethod
    def get_exprs(df):
        exprs = Aggregator.num_expr(df) + \
                Aggregator.date_expr(df) + \
                Aggregator.str_expr(df) + \
                Aggregator.other_expr(df) + \
                Aggregator.count_expr(df)
        return exprs


# read_file method to read single .csv file

In [5]:
def read_file(path, depth=None):
    df = pl.read_csv(path)
    df = df.pipe(Pipeline.set_table_dtypes)

    if depth in [1, 2]:
        df = df.group_by("case_id").agg(Aggregator.get_exprs(df))
    return df

# read_files method to read multiple .csv files

In [6]:
def read_files(regex_path, depth=None):
    chunks = []
    for path in glob(str(regex_path)):
        chunks.append(pl.read_csv(path).pipe(Pipeline.set_table_dtypes))
    df = pl.concat(chunks, how="vertical_relaxed")
    if depth in [1, 2]:
        df = df.group_by("case_id").agg(Aggregator.get_exprs(df))
    return df

# Feature Engineering

### Feature Enggineering methods for creating new features and merging dataframes

In [7]:
'''This feature_eng function is designed to perform feature engineering on a base DataFrame (df_base) along with additional DataFrames (depth_0, depth_1, and depth_2). Here's a breakdown of what it does:

Date Features Addition:

It adds two new columns to df_base: month_decision and weekday_decision.
These columns are derived from the date_decision column, capturing the month and weekday information.

Joining Additional DataFrames:

It iterates over the lists depth_0, depth_1, and depth_2, which contain additional DataFrames to be joined with df_base.
For each DataFrame in these lists, it performs a left join with df_base on the case_id column.
It adds a suffix to the column names of the joined DataFrames to differentiate them from existing columns in df_base.

Date Handling:

After all joins are performed, it applies the Pipeline.handle_dates method static method defined above
to handle date-related columns in the resulting DataFrame.

Return Resulting DataFrame:

The resulting DataFrame, after all feature engineering steps, is returned.'''


def feature_eng(df_base, depth_0, depth_1, depth_2):
    df_base = (
        df_base
        .with_columns(
            month_decision = pl.col("date_decision").dt.month(),
            weekday_decision = pl.col("date_decision").dt.weekday(),
        )
    )
    for i, df in enumerate(depth_0 + depth_1 + depth_2):
        df_base = df_base.join(df, how="left", on="case_id", suffix=f"_{i}")
    df_base = df_base.pipe(Pipeline.handle_dates)
    return df_base


''' to_pandas method converts Polars Dataframe to Pandas Dataframe.It identifies columns of type object as 
categorical if cat_cols is not provided. It converts the identified categorical columns to the 
category data type in Pandas.Finally, it returns the converted DataFrame and the list of categorical columns.'''

def to_pandas(df_data, cat_cols=None):
    df_data = df_data.to_pandas()
    if cat_cols is None:
        cat_cols = list(df_data.select_dtypes("object").columns)
    df_data[cat_cols] = df_data[cat_cols].astype("category")
    return df_data, cat_cols

# Organizing and Loading Multiple Train and Test Datasets into data_store

In [8]:
data_store = {
    "df_base": read_file(TRAIN_DIR / "train_base.csv"),
    "depth_0": [
        read_file(TRAIN_DIR / "train_static_cb_0.csv"),
        read_files(TRAIN_DIR / "train_static_0_*.csv"),
    ],
    "depth_1": [
        read_files(TRAIN_DIR / "train_applprev_1_*.csv", 1),
        read_file(TRAIN_DIR / "train_tax_registry_a_1.csv", 1),
        read_file(TRAIN_DIR / "train_tax_registry_b_1.csv", 1),
        read_file(TRAIN_DIR / "train_tax_registry_c_1.csv", 1),
        read_file(TRAIN_DIR / "train_credit_bureau_b_1.csv", 1),
        read_file(TRAIN_DIR / "train_other_1.csv", 1),
        read_file(TRAIN_DIR / "train_person_1.csv", 1),
        read_file(TRAIN_DIR / "train_deposit_1.csv", 1),
        read_file(TRAIN_DIR / "train_debitcard_1.csv", 1),
    ],
    "depth_2": [
        read_file(TRAIN_DIR / "train_credit_bureau_b_2.csv", 2),
    ]
}

In [9]:
df_train = feature_eng(**data_store) # Unpacking the contents of the dictionary using ** Operator
print("train data shape:\t", df_train.shape)

train data shape:	 (1526659, 376)


In [10]:
data_store = {
    "df_base": read_file(TEST_DIR / "test_base.csv"),
    "depth_0": [
        read_file(TEST_DIR / "test_static_cb_0.csv"),
        read_files(TEST_DIR / "test_static_0_*.csv"),
    ],
    "depth_1": [
        read_files(TEST_DIR / "test_applprev_1_*.csv", 1),
        read_file(TEST_DIR / "test_tax_registry_a_1.csv", 1),
        read_file(TEST_DIR / "test_tax_registry_b_1.csv", 1),
        read_file(TEST_DIR / "test_tax_registry_c_1.csv", 1),
        read_file(TEST_DIR / "test_credit_bureau_b_1.csv", 1),
        read_file(TEST_DIR / "test_other_1.csv", 1),
        read_file(TEST_DIR / "test_person_1.csv", 1),
        read_file(TEST_DIR / "test_deposit_1.csv", 1),
        read_file(TEST_DIR / "test_debitcard_1.csv", 1),
    ],
    "depth_2": [
        read_file(TEST_DIR / "test_credit_bureau_b_2.csv", 2),
    ]
}

In [11]:
df_test = feature_eng(**data_store) # Unpacking the contents of the dictionary using ** Operator
print("test data shape:\t", df_test.shape)

test data shape:	 (10, 375)


# Train and Test Dataset

In [12]:
df_train = df_train.pipe(Pipeline.filter_cols)
df_test = df_test.select([col for col in df_train.columns if col != "target"])

print("train data shape:\t", df_train.shape)
print("test data shape:\t", df_test.shape)

train data shape:	 (1526659, 260)
test data shape:	 (10, 259)


In [13]:
df_train, cat_cols = to_pandas(df_train)
df_test, cat_cols = to_pandas(df_test, cat_cols)

## del data_store

gc.collect()

In [14]:
class VotingModel(BaseEstimator, RegressorMixin):
    def __init__(self, estimators):
        super().__init__()
        self.estimators = estimators

    def fit(self, X, y=None):
        return self

    def predict(self, X):
        y_preds = [estimator.predict(X) for estimator in self.estimators]
        return np.mean(y_preds, axis=0)

    def predict_proba(self, X):
        y_preds = [estimator.predict_proba(X) for estimator in self.estimators]
        return np.mean(y_preds, axis=0)

In [15]:
X = df_train.drop(columns=["target", "case_id","WEEK_NUM"])
y = df_train["target"]
weeks = df_train["WEEK_NUM"]

In [16]:
'''import optuna
from sklearn.model_selection import cross_validate
from lightgbm import LGBMClassifier

def objective(trial):
    max_depth = trial.suggest_int('max_depth', 3, 30)
    n_estimators = trial.suggest_int('n_estimators', 1, 1000)
    gamma = trial.suggest_float('gamma', 0, 1)
    reg_alpha = trial.suggest_float('reg_alpha', 0, 1)
    reg_lambda = trial.suggest_float('reg_lambda', 0, 1)
    min_child_weight = trial.suggest_int('min_child_weight', 0, 10)
    subsample = trial.suggest_float('subsample', 0, 1)
    colsample_bytree = trial.suggest_float('colsample_bytree', 0, 1)
    learning_rate = trial.suggest_float('learning_rate', 0, 1)
    
#     print('Training the model with', X.shape[1], 'features')
    
#       LightGBM
    params = {'learning_rate': learning_rate,
              'n_estimators': n_estimators,
              'max_depth': max_depth,
              'lambda_l1': reg_alpha,
              'lambda_l2': reg_lambda,
              'colsample_bytree': colsample_bytree, 
              'subsample': subsample,    
              'min_child_samples': min_child_weight,
              'class_weight': 'balanced'}
    
    clf = LGBMClassifier(**params, verbose = -1, verbosity = -1)
    
    cv_results = cross_validate(clf,X,y, cv=5, scoring='accuracy')
    
    validation_score = np.mean(cv_results['test_score'])
    
    return validation_score'''

"import optuna\nfrom sklearn.model_selection import cross_validate\nfrom lightgbm import LGBMClassifier\n\ndef objective(trial):\n    max_depth = trial.suggest_int('max_depth', 3, 30)\n    n_estimators = trial.suggest_int('n_estimators', 1, 1000)\n    gamma = trial.suggest_float('gamma', 0, 1)\n    reg_alpha = trial.suggest_float('reg_alpha', 0, 1)\n    reg_lambda = trial.suggest_float('reg_lambda', 0, 1)\n    min_child_weight = trial.suggest_int('min_child_weight', 0, 10)\n    subsample = trial.suggest_float('subsample', 0, 1)\n    colsample_bytree = trial.suggest_float('colsample_bytree', 0, 1)\n    learning_rate = trial.suggest_float('learning_rate', 0, 1)\n    \n#     print('Training the model with', X.shape[1], 'features')\n    \n#       LightGBM\n    params = {'learning_rate': learning_rate,\n              'n_estimators': n_estimators,\n              'max_depth': max_depth,\n              'lambda_l1': reg_alpha,\n              'lambda_l2': reg_lambda,\n              'colsample_by

In [17]:
'''study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials= 2)'''

'study = optuna.create_study(direction="maximize")\nstudy.optimize(objective, n_trials= 2)'

# Cross Validation to assess the performance of an LightGBM classifier

This following code snippet is performing cross-validation using the StratifiedGroupKFold technique. Here's a breakdown of what it's doing:

1. **StratifiedGroupKFold(n_splits=5, shuffle=False):** This initializes a StratifiedGroupKFold object with 5 splits and shuffle set to False, ensuring that each fold maintains the same class distribution and group membership.

2. **fitted_models = [] and cv_scores = []:** These lists will store the trained models and the corresponding cross-validation scores, respectively.

3. The loop iterates over each fold generated by the cross-validator **(cv.split(X, y, groups=weeks)). Within each iteration:**

    * It splits the data into training and validation sets **(X_train, y_train, X_valid, y_valid)**     based on the current fold indices.
    * Trains an LightGBM classifier **(lgb.LGBMClassifier())** on the training data.
    * Evaluates the model on the validation set using the AUC score.
    * Appends the trained model to *fitted_models* and the AUC score to **cv_scores.**


4. Finally, it creates a VotingModel instance with the fitted models and prints the cross-validation AUC scores.

In [18]:
cv = StratifiedGroupKFold(n_splits=5, shuffle=False)


fitted_models = []
cv_scores = []


for idx_train, idx_valid in cv.split(X, y, groups=weeks):
    X_train, y_train = X.iloc[idx_train], y.iloc[idx_train]
    X_valid, y_valid = X.iloc[idx_valid], y.iloc[idx_valid]

    print("Valid week range: ", (weeks.iloc[idx_valid].min(), weeks.iloc[idx_valid].max()))

    model = lgb.LGBMClassifier()
    model.fit(
        X_train, y_train,
        eval_set=[(X_valid, y_valid)],
        callbacks=[lgb.log_evaluation(50), lgb.early_stopping(50)]
    )

    fitted_models.append(model)


    y_pred_valid = model.predict_proba(X_valid)[:, 1]
    auc_score = roc_auc_score(y_valid, y_pred_valid)
    cv_scores.append(auc_score)

model = VotingModel(fitted_models)
print("CV AUC scores: ", cv_scores)

Valid week range:  (3, 90)
[LightGBM] [Info] Number of positive: 37755, number of negative: 1183301
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 1.390948 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 30534
[LightGBM] [Info] Number of data points in the train set: 1221056, number of used features: 256
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.030920 -> initscore=-3.444945
[LightGBM] [Info] Start training from score -3.444945
Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.124105
[100]	valid_0's binary_logloss: 0.122318
Did not meet early stopping. Best iteration is:
[100]	valid_0's binary_logloss: 0.122318
Valid week range:  (0, 89)
[LightGBM] [Info] Number of positive: 38743, number of negative: 1182495
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of test

In [19]:
X_test = df_test.drop(columns=["WEEK_NUM"])
X_test = X_test.set_index("case_id")

X_test[['pmtcount_693L', 'pmtscount_423L', 'deferredmnthsnum_166L', 'max_credacc_transactions_402L']] = X_test[['pmtcount_693L', 'pmtscount_423L', 'deferredmnthsnum_166L', 'max_credacc_transactions_402L']].astype(float)

lgb_pred = pd.Series(model.predict_proba(X_test)[:, 1], index=X_test.index)

In [20]:
df_subm = pd.read_csv(ROOT / "sample_submission.csv")
df_subm = df_subm.set_index("case_id")

df_subm["score"] = lgb_pred

In [21]:
print("Check null: ", df_subm["score"].isnull().any())


Check null:  False


In [22]:
df_subm.head()

Unnamed: 0_level_0,score
case_id,Unnamed: 1_level_1
57543,0.01716
57549,0.027295
57551,0.006489
57552,0.012122
57569,0.067675


In [23]:
df_subm.to_csv("submission.csv")