## ***Intro***

Ensemble methods like random forests and gradient boosting are highly valued in the machine learning community for their performance, versatility, and robustness. These methods combine the predictions of multiple weak learners, typically decision trees, to enhance generalization and robustness compared to a single estimator. This ensemble approach makes them easier to interpret than more complex algorithms like neural networks, making them particularly suitable for certain classification tasks while mitigating the common overfitting issues of individual decision trees.

In machine learning competitions (e.g., Kaggle), gradient boosting models such as ***XGBoost, LightGBM, and CatBoost*** are frequently used and often appear in winning solutions.
In this notebook, we undertake a binary classification task (**predicting default on loan**) using the Histogram-based Gradient Boosting Classifier from scikit-learn.

## ***Imports***

In [2]:
# For data manipulation and linear algebra
import pandas as pd
import numpy as np

# some utilities from Scikit learn
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import precision_score, recall_score, roc_auc_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# The Histogram-based Gradient Boosting Classification Tree
from sklearn.ensemble import HistGradientBoostingClassifier

## ***Functions***

In [3]:
def convert_to_tensor(*args, target=False) -> tuple:
    """Converts one or more DataFrames to Tensors.

    This function takes one or more DataFrames as input and returns a tuple of Tensors (np.ndarray).

    Args:
    *args: A variable number of DataFrames to be converted.
    target: (Optional) Boolean flag indicating the data type of the output Tensors.
    - If True (default: False), Tensors will have the original data types of the DataFrame columns.
    - If False, Tensors will be cast to float type.

    Returns:
    A tuple of Tensors, one for each input DataFrame.
    """
    # Create a tensor from dataframe (convert inputs to float)
    return (np.array(df)for df in args) if target else (np.array(df).astype(float) for df in args)

## ***Data preparation***

**We will use a base table of 100 features selected from a set of features generated with Deep Feature Synthesis on the Home Credit - Credit Risk Model Stability competition dataset totaling over 1,500,000 examples.**

In [4]:
dataset = pd.read_parquet("/kaggle/input/deep-feature-synthesis-home-credit-stability/base_100features.parquet")
dataset

Unnamed: 0_level_0,WEEK_NUM,COUNT(person_1),MODE(person_1.contaddr_district_15M),MODE(person_1.contaddr_zipcode_807M),MODE(person_1.education_927M),MODE(person_1.empladdr_district_926M),MODE(person_1.empladdr_zipcode_114M),MODE(person_1.incometype_1044T),MODE(person_1.language1_981M),MODE(person_1.role_1084L),...,MODE(static_cb_0.education_88M),MODE(static_cb_0.maritalst_385M),MODE(static_cb_0.requesttype_4525192L),SUM(static_cb_0.days120_123L),SUM(static_cb_0.days30_165L),MAX_MIN_DELTA(applprev_2.num_group2),MONTH(date_decision),SEASON(date_decision),WEEKDAY(date_decision),target
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,4,259,2828,5,221,2093,6,2,2,...,2,2,0,0.0,0.0,0.0,1,3,3,0
1,0,5,259,2828,5,221,2093,6,2,0,...,2,2,0,0.0,0.0,0.0,1,3,3,0
2,0,5,259,2828,5,221,2093,8,2,0,...,2,2,0,0.0,0.0,1.0,1,3,4,0
3,0,3,259,2828,5,221,2093,8,2,0,...,2,2,0,0.0,0.0,2.0,1,3,3,0
4,0,4,259,2828,5,221,2093,8,2,2,...,2,2,0,0.0,0.0,1.0,1,3,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2703450,91,1,143,356,5,221,2093,0,1,0,...,4,3,0,0.0,0.0,2.0,10,0,0,0
2703451,91,2,403,2825,5,221,2093,0,1,0,...,4,3,0,0.0,0.0,3.0,10,0,0,0
2703452,91,1,273,3202,5,221,2093,1,1,0,...,4,3,0,2.0,0.0,1.0,10,0,0,0
2703453,91,2,148,1959,5,221,2093,0,1,0,...,4,0,0,2.0,1.0,3.0,10,0,0,0


In [5]:
# You can notice that the positive class is extremely under represented (totalling about 3% of the entire dataset)
# This complicates the training process as the model will likely get a 97% classification accuracy by always prediction class 0
# So we have to find a solution to that problem first
dataset[dataset["target"] == 1]

Unnamed: 0_level_0,WEEK_NUM,COUNT(person_1),MODE(person_1.contaddr_district_15M),MODE(person_1.contaddr_zipcode_807M),MODE(person_1.education_927M),MODE(person_1.empladdr_district_926M),MODE(person_1.empladdr_zipcode_114M),MODE(person_1.incometype_1044T),MODE(person_1.language1_981M),MODE(person_1.role_1084L),...,MODE(static_cb_0.education_88M),MODE(static_cb_0.maritalst_385M),MODE(static_cb_0.requesttype_4525192L),SUM(static_cb_0.days120_123L),SUM(static_cb_0.days30_165L),MAX_MIN_DELTA(applprev_2.num_group2),MONTH(date_decision),SEASON(date_decision),WEEKDAY(date_decision),target
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4,0,4,259,2828,5,221,2093,8,2,2,...,2,2,0,0.0,0.0,1.0,1,3,4,1
101,0,4,259,2828,5,221,2093,1,2,2,...,2,2,0,0.0,0.0,1.0,1,3,3,1
118,0,4,259,2828,5,221,2093,1,2,2,...,2,2,0,0.0,0.0,0.0,1,3,3,1
129,0,4,259,2828,5,221,2093,8,2,2,...,2,2,0,0.0,0.0,0.0,1,3,4,1
148,0,4,259,2828,5,221,2093,1,2,2,...,2,2,0,0.0,0.0,0.0,1,3,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2702884,91,1,515,2177,5,221,2093,6,0,0,...,4,4,0,0.0,0.0,1.0,10,0,4,1
2702904,91,1,297,1304,5,221,2093,8,1,0,...,4,0,0,1.0,1.0,2.0,10,0,4,1
2703005,91,1,232,233,5,221,2093,6,0,0,...,4,0,0,3.0,3.0,2.0,10,0,5,1
2703400,91,1,798,1536,5,221,2093,8,1,0,...,4,0,0,8.0,4.0,2.0,10,0,0,1


In [6]:
# Drop WEEK_NUM and target to form the training entries X (you can use WEEK_NUM as a feature too)
week_num = dataset["WEEK_NUM"]
target = dataset["target"]
dataset.drop(columns=["WEEK_NUM", "target"], inplace=True)
dataset

Unnamed: 0_level_0,COUNT(person_1),MODE(person_1.contaddr_district_15M),MODE(person_1.contaddr_zipcode_807M),MODE(person_1.education_927M),MODE(person_1.empladdr_district_926M),MODE(person_1.empladdr_zipcode_114M),MODE(person_1.incometype_1044T),MODE(person_1.language1_981M),MODE(person_1.role_1084L),MODE(person_1.sex_738L),...,MODE(static_cb_0.education_1103M),MODE(static_cb_0.education_88M),MODE(static_cb_0.maritalst_385M),MODE(static_cb_0.requesttype_4525192L),SUM(static_cb_0.days120_123L),SUM(static_cb_0.days30_165L),MAX_MIN_DELTA(applprev_2.num_group2),MONTH(date_decision),SEASON(date_decision),WEEKDAY(date_decision)
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,4,259,2828,5,221,2093,6,2,2,0,...,3,2,2,0,0.0,0.0,0.0,1,3,3
1,5,259,2828,5,221,2093,6,2,0,1,...,3,2,2,0,0.0,0.0,0.0,1,3,3
2,5,259,2828,5,221,2093,8,2,0,0,...,3,2,2,0,0.0,0.0,1.0,1,3,4
3,3,259,2828,5,221,2093,8,2,0,0,...,3,2,2,0,0.0,0.0,2.0,1,3,3
4,4,259,2828,5,221,2093,8,2,2,0,...,3,2,2,0,0.0,0.0,1.0,1,3,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2703450,1,143,356,5,221,2093,0,1,0,0,...,4,4,3,0,0.0,0.0,2.0,10,0,0
2703451,2,403,2825,5,221,2093,0,1,0,0,...,4,4,3,0,0.0,0.0,3.0,10,0,0
2703452,1,273,3202,5,221,2093,1,1,0,1,...,4,4,3,0,2.0,0.0,1.0,10,0,0
2703453,2,148,1959,5,221,2093,0,1,0,0,...,1,4,0,0,2.0,1.0,3.0,10,0,0


In [7]:
# Split the dataset into training and test sets (randomly shuffling the data)
# Only 5% of the data will be used as the test set (the dataset being really big, 5% should be largely enough)
X_train, X_test, Y_train, Y_test = train_test_split(dataset, target, test_size=0.05, shuffle=True, random_state=123)

In [8]:
# keep the true test labels in a Dataframe before converting to tensor
true_label_test = pd.DataFrame(Y_test)
true_label_test

Unnamed: 0_level_0,target
case_id,Unnamed: 1_level_1
1017805,0
2543910,0
225134,0
955887,0
23690,0
...,...
2540270,0
2553269,0
947602,0
755641,0


In [9]:
# Convert Xs and Ys to tensor (np.ndarray)
X_train, X_test = convert_to_tensor(X_train, X_test)
Y_train, Y_test = convert_to_tensor(Y_train, Y_test, target=True)

In [10]:
X_train.shape, X_test.shape

((1450326, 100), (76333, 100))

In [11]:
Y_train.dtype, X_train.dtype

(dtype('int64'), dtype('float64'))

## ***StandardScaler for normalization***

**This estimator will act as a preprocessing layer that shifts and scale inputs into a distribution centered around 0 with standard deviation 1. It computes mean and variance when .fit is called and transforms the input as follow: input = (input - mean) / sqrt(var) when .transform is called.**

In [12]:
normalizer = StandardScaler()
normalizer.fit(X_train)
normalizer.transform(X_test[:1])

array([[ 0.04624721, -0.28207035, -1.98300306, -4.6552798 ,  0.24187498,
         0.19741905,  0.78313462,  1.19394783, -0.37999895, -0.77626192,
        -1.37151   , -0.24940889, -2.18956401,  0.65681083, -1.01999347,
        -0.92357804, -1.00080705,  0.49003232, -1.18272814,  0.65047138,
         0.52157697, -0.02339314, -0.68215513, -0.16142495, -0.99749769,
        -0.70858183,  0.12254756,  0.11174086, -0.62869268,  0.75663826,
        -0.53296634, -1.04587252, -0.71103933, -0.11861729, -0.02432021,
        -0.88033272, -0.38889713, -2.00131617, -0.47380036, -1.32754398,
        -1.39047098, -1.5428801 , -0.87488219, -0.9901183 , -0.85514003,
        -1.60920835, -1.28708186, -1.05573749, -0.81731618, -1.3617547 ,
        -0.79203177, -1.00521888, -0.14500467,  0.01480454,  0.42456864,
        -0.78026421, -0.57582792,  0.37803943, -0.74890573, -0.62719456,
        -1.02840467, -0.9170443 , -0.64738589, -0.5287307 , -0.40075795,
        -0.68189228, -0.66453196, -0.51602374, -0.3

## ***Dealing with imbalance***

As we observed earlier, the dataset is highly unbalanced, requiring a strategy for the model to effectively learn to distinguish between the two classes. Without such a strategy, the model is likely to predict class 0 for almost every example and still achieve a high classification accuracy. Therefore, accuracy is not as relevant for this kind of dataset. To evaluate the model, we need metrics that can assess its ability to **recognize the minority class**. By weighting the classes, we penalize the model more heavily for misclassifying the minority class than the majority class.

Additionally, we will use precision and recall to evaluate the model's performance. Both metrics provide insights into the model's ability to accurately predict the positive class. Precision measures the percentage of true positive predictions among all examples predicted as positive, while recall measures the percentage of actual positive cases that the model correctly identified as positive.

In [13]:
# Compute class weights 
# Run if not dataset resampled (to train on the whole training set)
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(Y_train), y=Y_train)
class_weights = {0: class_weights[0], 1: class_weights[1]/1.5}
class_weights

{0: 0.5162114467827323, 1: 10.614134849715677}

## ***Make pipeline***

In [14]:
pipe = make_pipeline(StandardScaler(),
                     HistGradientBoostingClassifier(random_state=987, max_iter=200, class_weight=class_weights))

In [15]:
# Start training
pipe.fit(X=X_train, y=Y_train)

In [16]:
# This is the accuracy score of the model
pipe.score(X=X_test, y=Y_test)

0.8180472403809623

In [17]:
y_pred = pipe.predict(X=X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

In [18]:
y_pred_probs = pipe.predict_proba(X=X_test)
y_pred_probs

array([[0.93199912, 0.06800088],
       [0.90105091, 0.09894909],
       [0.94569141, 0.05430859],
       ...,
       [0.78018459, 0.21981541],
       [0.95034137, 0.04965863],
       [0.9015088 , 0.0984912 ]])

In [19]:
# This is the recall score of the model, calculated as tp/(tp+fn) where tp == True Positive (class 1) and fn == False Negative (class 0)
recall_score(Y_test, y_pred)

0.6608091540662036

In [20]:
# The precision score calculated as tp/(tp+fp) where tp == True Positive and fp == False Positive
precision_score(Y_test, y_pred)

0.11017988552739166

In [21]:
# The area under the ROC curve 
roc_auc_score(Y_test, y_pred_probs[:, 1])

0.8343678613075837

In [22]:
true_label_test["predicted probs"] = y_pred_probs[:, 1]
true_label_test["predicted class"] = y_pred
true_label_test

Unnamed: 0_level_0,target,predicted probs,predicted class
case_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1017805,0,0.068001,0
2543910,0,0.098949,0
225134,0,0.054309,0
955887,0,0.195352,0
23690,0,0.249033,0
...,...,...,...
2540270,0,0.164224,0
2553269,0,0.341511,0
947602,0,0.219815,0
755641,0,0.049659,0


## ***Conclusion***

**Feel free to tune the parameters of the gradient boosting classifier or to train with a different set of features. Thanks for reading**