# Example Notebook

Welcome to the example notebook for the Home Credit Kaggle competition. The goal of this competition is to determine how likely a customer is going to default on an issued loan. The main difference between the [first](https://www.kaggle.com/c/home-credit-default-risk) and this competition is that now your submission will be scored with a custom metric that will take into account how well the model performs in future. A decline in performance will be penalized. The goal is to create a model that is stable and performs well in the future.

In this notebook you will see how to:
* Load the data
* Join tables with Polars - a DataFrame library implemented in Rust language, designed to be blazingy fast and memory efficient.  
* Create simple aggregation features
* Train a LightGBM model
* Create a submission table

## Load the data

In [1]:
import polars as pl
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score 

dataPath = "/kaggle/input/home-credit-credit-risk-model-stability/"

In [2]:
def set_table_dtypes(df: pl.DataFrame) -> pl.DataFrame:
    for col in df.columns:
        if col[-1] in ("P", "A"):
            df = df.with_columns(pl.col(col).cast(pl.Float64).alias(col))

    return df

def convert_strings(df: pd.DataFrame) -> pd.DataFrame:
    for col in df.columns:  
        if df[col].dtype.name in ['object', 'string']:
            df[col] = df[col].astype("string").astype('category')
            current_categories = df[col].cat.categories
            new_categories = current_categories.to_list() + ["Unknown"]
            new_dtype = pd.CategoricalDtype(categories=new_categories, ordered=True)
            df[col] = df[col].astype(new_dtype)
    return df

In [3]:
train_basetable = pl.read_csv(dataPath + "csv_files/train/train_base.csv")
train_static = pl.concat(
    [
        pl.read_csv(dataPath + "csv_files/train/train_static_0_0.csv").pipe(set_table_dtypes),
        pl.read_csv(dataPath + "csv_files/train/train_static_0_1.csv").pipe(set_table_dtypes),
    ],
    how="vertical_relaxed",
)
train_static_cb = pl.read_csv(dataPath + "csv_files/train/train_static_cb_0.csv").pipe(set_table_dtypes)
train_person_1 = pl.read_csv(dataPath + "csv_files/train/train_person_1.csv").pipe(set_table_dtypes) 
train_credit_bureau_b_2 = pl.read_csv(dataPath + "csv_files/train/train_credit_bureau_b_2.csv").pipe(set_table_dtypes) 

In [4]:
test_basetable = pl.read_csv(dataPath + "csv_files/test/test_base.csv")
test_static = pl.concat(
    [
        pl.read_csv(dataPath + "csv_files/test/test_static_0_0.csv").pipe(set_table_dtypes),
        pl.read_csv(dataPath + "csv_files/test/test_static_0_1.csv").pipe(set_table_dtypes),
        pl.read_csv(dataPath + "csv_files/test/test_static_0_2.csv").pipe(set_table_dtypes),
    ],
    how="vertical_relaxed",
)
test_static_cb = pl.read_csv(dataPath + "csv_files/test/test_static_cb_0.csv").pipe(set_table_dtypes)
test_person_1 = pl.read_csv(dataPath + "csv_files/test/test_person_1.csv").pipe(set_table_dtypes) 
test_credit_bureau_b_2 = pl.read_csv(dataPath + "csv_files/test/test_credit_bureau_b_2.csv").pipe(set_table_dtypes) 

## Feature engineering

In this part, we can see a simple example of joining tables via `case_id`. Here the loading and joining is done with polars library. Polars library is blazingly fast and has much smaller memory footprint than pandas. 

In [5]:
# We need to use aggregation functions in tables with depth > 1, so tables that contain num_group1 column or 
# also num_group2 column.
train_person_1_feats_1 = train_person_1.group_by("case_id").agg(
    pl.col("mainoccupationinc_384A").max().alias("mainoccupationinc_384A_max"),
    (pl.col("incometype_1044T") == "SELFEMPLOYED").max().alias("mainoccupationinc_384A_any_selfemployed")
)

# Here num_group1=0 has special meaning, it is the person who applied for the loan.
train_person_1_feats_2 = train_person_1.select(["case_id", "num_group1", "housetype_905L"]).filter(
    pl.col("num_group1") == 0
).drop("num_group1").rename({"housetype_905L": "person_housetype"})

# Here we have num_goup1 and num_group2, so we need to aggregate again.
train_credit_bureau_b_2_feats = train_credit_bureau_b_2.group_by("case_id").agg(
    pl.col("pmts_pmtsoverdue_635A").max().alias("pmts_pmtsoverdue_635A_max"),
    (pl.col("pmts_dpdvalue_108P") > 31).max().alias("pmts_dpdvalue_108P_over31")
)

# We will process in this examples only A-type and M-type columns, so we need to select them.
selected_static_cols = []
for col in train_static.columns:
    if col[-1] in ("A", "M"):
        selected_static_cols.append(col)
print(selected_static_cols)

selected_static_cb_cols = []
for col in train_static_cb.columns:
    if col[-1] in ("A", "M"):
        selected_static_cb_cols.append(col)
print(selected_static_cb_cols)

# Join all tables together.
data = train_basetable.join(
    train_static.select(["case_id"]+selected_static_cols), how="left", on="case_id"
).join(
    train_static_cb.select(["case_id"]+selected_static_cb_cols), how="left", on="case_id"
).join(
    train_person_1_feats_1, how="left", on="case_id"
).join(
    train_person_1_feats_2, how="left", on="case_id"
).join(
    train_credit_bureau_b_2_feats, how="left", on="case_id"
)

['amtinstpaidbefduel24m_4187115A', 'annuity_780A', 'annuitynextmonth_57A', 'avginstallast24m_3658937A', 'avglnamtstart24m_4525187A', 'avgoutstandbalancel6m_4187114A', 'avgpmtlast12m_4525200A', 'credamount_770A', 'currdebt_22A', 'currdebtcredtyperange_828A', 'disbursedcredamount_1113A', 'downpmt_116A', 'inittransactionamount_650A', 'lastapprcommoditycat_1041M', 'lastapprcommoditytypec_5251766M', 'lastapprcredamount_781A', 'lastcancelreason_561M', 'lastotherinc_902A', 'lastotherlnsexpense_631A', 'lastrejectcommoditycat_161M', 'lastrejectcommodtypec_5251769M', 'lastrejectcredamount_222A', 'lastrejectreason_759M', 'lastrejectreasonclient_4145040M', 'maininc_215A', 'maxannuity_159A', 'maxannuity_4075009A', 'maxdebt4_972A', 'maxinstallast24m_3658928A', 'maxlnamtstart6m_4525199A', 'maxoutstandbalancel12m_4187113A', 'maxpmtlast3m_4525190A', 'previouscontdistrict_112M', 'price_1097A', 'sumoutstandtotal_3546847A', 'sumoutstandtotalest_4493215A', 'totaldebt_9A', 'totalsettled_863A', 'totinstallas

In [6]:
test_person_1_feats_1 = test_person_1.group_by("case_id").agg(
    pl.col("mainoccupationinc_384A").max().alias("mainoccupationinc_384A_max"),
    (pl.col("incometype_1044T") == "SELFEMPLOYED").max().alias("mainoccupationinc_384A_any_selfemployed")
)

test_person_1_feats_2 = test_person_1.select(["case_id", "num_group1", "housetype_905L"]).filter(
    pl.col("num_group1") == 0
).drop("num_group1").rename({"housetype_905L": "person_housetype"})

test_credit_bureau_b_2_feats = test_credit_bureau_b_2.group_by("case_id").agg(
    pl.col("pmts_pmtsoverdue_635A").max().alias("pmts_pmtsoverdue_635A_max"),
    (pl.col("pmts_dpdvalue_108P") > 31).max().alias("pmts_dpdvalue_108P_over31")
)

data_submission = test_basetable.join(
    test_static.select(["case_id"]+selected_static_cols), how="left", on="case_id"
).join(
    test_static_cb.select(["case_id"]+selected_static_cb_cols), how="left", on="case_id"
).join(
    test_person_1_feats_1, how="left", on="case_id"
).join(
    test_person_1_feats_2, how="left", on="case_id"
).join(
    test_credit_bureau_b_2_feats, how="left", on="case_id"
)

In [7]:
case_ids = data["case_id"].unique().shuffle(seed=1)
case_ids_train, case_ids_test = train_test_split(case_ids, train_size=0.6, random_state=1)
case_ids_valid, case_ids_test = train_test_split(case_ids_test, train_size=0.5, random_state=1)

cols_pred = []
for col in data.columns:
    if col[-1].isupper() and col[:-1].islower():
        cols_pred.append(col)

print(cols_pred)

def from_polars_to_pandas(case_ids: pl.DataFrame) -> pl.DataFrame:
    return (
        data.filter(pl.col("case_id").is_in(case_ids))[["case_id", "WEEK_NUM", "target"]].to_pandas(),
        data.filter(pl.col("case_id").is_in(case_ids))[cols_pred].to_pandas(),
        data.filter(pl.col("case_id").is_in(case_ids))["target"].to_pandas()
    )

base_train, X_train, y_train = from_polars_to_pandas(case_ids_train)
base_valid, X_valid, y_valid = from_polars_to_pandas(case_ids_valid)
base_test, X_test, y_test = from_polars_to_pandas(case_ids_test)

for df in [X_train, X_valid, X_test]:
    df = convert_strings(df)

['amtinstpaidbefduel24m_4187115A', 'annuity_780A', 'annuitynextmonth_57A', 'avginstallast24m_3658937A', 'avglnamtstart24m_4525187A', 'avgoutstandbalancel6m_4187114A', 'avgpmtlast12m_4525200A', 'credamount_770A', 'currdebt_22A', 'currdebtcredtyperange_828A', 'disbursedcredamount_1113A', 'downpmt_116A', 'inittransactionamount_650A', 'lastapprcommoditycat_1041M', 'lastapprcommoditytypec_5251766M', 'lastapprcredamount_781A', 'lastcancelreason_561M', 'lastotherinc_902A', 'lastotherlnsexpense_631A', 'lastrejectcommoditycat_161M', 'lastrejectcommodtypec_5251769M', 'lastrejectcredamount_222A', 'lastrejectreason_759M', 'lastrejectreasonclient_4145040M', 'maininc_215A', 'maxannuity_159A', 'maxannuity_4075009A', 'maxdebt4_972A', 'maxinstallast24m_3658928A', 'maxlnamtstart6m_4525199A', 'maxoutstandbalancel12m_4187113A', 'maxpmtlast3m_4525190A', 'previouscontdistrict_112M', 'price_1097A', 'sumoutstandtotal_3546847A', 'sumoutstandtotalest_4493215A', 'totaldebt_9A', 'totalsettled_863A', 'totinstallas

In [8]:
print(f"Train: {X_train.shape}")
print(f"Valid: {X_valid.shape}")
print(f"Test: {X_test.shape}")

Train: (915995, 48)
Valid: (305332, 48)
Test: (305332, 48)


In [9]:
print("Train set class distribution:")
print(y_train.value_counts())

print("\nValidation set class distribution:")
print(y_valid.value_counts())

print("\nTest set class distribution:")
print(y_test.value_counts())

Train set class distribution:
target
0    887123
1     28872
Name: count, dtype: int64

Validation set class distribution:
target
0    295890
1      9442
Name: count, dtype: int64

Test set class distribution:
target
0    295652
1      9680
Name: count, dtype: int64


##  Class Weights in the LightGBM training

In [10]:
# Calculate class weights
total = y_train.shape[0]
class_counts = y_train.value_counts()
class_weights = {0: total / class_counts[0], 1: total / class_counts[1]}
print("Class Weights:", class_weights)

Class Weights: {0: 1.0325456560138786, 1: 31.726066777500694}


In [11]:
# Define a custom evaluation metric with class weights
def weighted_log_loss(y_true, y_pred):
    sample_weight = np.array([class_weights[label] for label in y_true])
    return 'weighted_log_loss', log_loss(y_true, y_pred, sample_weight=sample_weight), True

In [12]:
from sklearn.metrics import log_loss

In [13]:
# Train Light GBM with class weights
params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": ["auc", weighted_log_loss],
    "max_depth": 3,
    "num_leaves": 31,
    "learning_rate": 0.05,
    "feature_fraction": 0.9,
    "bagging_fraction":0.8,
    "bagging_freq": 5,
    "n_estimators": 1000,
    "verbose": -1
}

lgb_train = lgb.Dataset(X_train, label=y_train)
lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)

gbm = lgb.train(
    params,
    lgb_train,
    valid_sets=lgb_valid,
    callbacks=[lgb.log_evaluation(50), lgb.early_stopping(10)]
)



Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.705963
[100]	valid_0's auc: 0.724362
[150]	valid_0's auc: 0.731423
[200]	valid_0's auc: 0.735874
[250]	valid_0's auc: 0.739009
[300]	valid_0's auc: 0.740965
[350]	valid_0's auc: 0.742924
[400]	valid_0's auc: 0.744582
[450]	valid_0's auc: 0.745977
[500]	valid_0's auc: 0.747033
[550]	valid_0's auc: 0.747877
[600]	valid_0's auc: 0.749039
[650]	valid_0's auc: 0.750087
[700]	valid_0's auc: 0.750863
Early stopping, best iteration is:
[739]	valid_0's auc: 0.751216


In [14]:
params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": ["auc", "binary_cross_entropy"],
    "max_depth": 3,
    "num_leaves": 31,
    "learning_rate": 0.05,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "n_estimators": 1000,
    "verbose": -1,
    "scale_pos_weight": class_weights[1] / class_weights[0] 
}
lgb_train = lgb.Dataset(X_train, label=y_train)
lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)

gbm = lgb.train(
    params,
    lgb_train,
    valid_sets=lgb_valid,
    callbacks=[lgb.log_evaluation(50), lgb.early_stopping(10)]
)



Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.715839
[100]	valid_0's auc: 0.728213
[150]	valid_0's auc: 0.733725
[200]	valid_0's auc: 0.737732
[250]	valid_0's auc: 0.740725
[300]	valid_0's auc: 0.742567
[350]	valid_0's auc: 0.744519
[400]	valid_0's auc: 0.746128
[450]	valid_0's auc: 0.747383
[500]	valid_0's auc: 0.748473
[550]	valid_0's auc: 0.749145
[600]	valid_0's auc: 0.749833
[650]	valid_0's auc: 0.750607
[700]	valid_0's auc: 0.751741
[750]	valid_0's auc: 0.752269
[800]	valid_0's auc: 0.752785
[850]	valid_0's auc: 0.753432
Early stopping, best iteration is:
[875]	valid_0's auc: 0.753813


## Hyperparameter Tuning

In [15]:
from skopt import BayesSearchCV
from lightgbm import LGBMClassifier
import optuna
from sklearn.metrics import roc_auc_score

In [16]:
def objective(trial):
    # Define the hyperparameter search space
    max_depth = trial.suggest_int('max_depth', 2, 10)
    num_leaves = trial.suggest_int('num_leaves', 10, 100)
    learning_rate = trial.suggest_float('learning_rate', 1e-3, 1e-1, log=True)
    feature_fraction = trial.suggest_float('feature_fraction', 0.5, 1.0)
    bagging_fraction = trial.suggest_float('bagging_fraction', 0.5, 1.0)
    bagging_freq = trial.suggest_int('bagging_freq', 1, 10)
    
    # Create the parameter dictionary
    params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': 'auc',
        'max_depth': max_depth,
        'num_leaves': num_leaves,
        'learning_rate': learning_rate,
        'feature_fraction': feature_fraction,
        'bagging_fraction': bagging_fraction,
        'bagging_freq': bagging_freq,
        'verbose': -1
    }
    
    # Train the LightGBM model
    lgb_train = lgb.Dataset(X_train, label=y_train)
    lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)
    gbm = lgb.train(params, lgb_train, valid_sets=lgb_valid, early_stopping_rounds=10, verbose_eval=False)
    
    # Evaluate the model on the validation set
    y_pred = gbm.predict(X_valid)
    auc = roc_auc_score(y_valid, y_pred)
    
    # Return the negative validation AUC (since we want to minimize)
    return -auc

In [17]:
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=62)
best_params = study.best_trial.params
print("Best hyperparameters: ", best_params)

[I 2024-04-29 20:13:33,650] A new study created in memory with name: no-name-4a01a59c-902b-4a6e-beb2-486a26fc6478
[I 2024-04-29 20:13:58,488] Trial 0 finished with value: -0.7372902191857869 and parameters: {'max_depth': 8, 'num_leaves': 33, 'learning_rate': 0.028499319863939006, 'feature_fraction': 0.9944268602989839, 'bagging_fraction': 0.6699897332635285, 'bagging_freq': 4}. Best is trial 0 with value: -0.7372902191857869.
[I 2024-04-29 20:14:23,231] Trial 1 finished with value: -0.7159625401503386 and parameters: {'max_depth': 8, 'num_leaves': 68, 'learning_rate': 0.002522574660542469, 'feature_fraction': 0.6019235903896689, 'bagging_fraction': 0.5600993150714382, 'bagging_freq': 10}. Best is trial 0 with value: -0.7372902191857869.
[I 2024-04-29 20:14:45,812] Trial 2 finished with value: -0.7088587089786862 and parameters: {'max_depth': 5, 'num_leaves': 64, 'learning_rate': 0.00983015666307963, 'feature_fraction': 0.7251861331238726, 'bagging_fraction': 0.9277361271791191, 'baggin

Best hyperparameters:  {'max_depth': 10, 'num_leaves': 49, 'learning_rate': 0.07904172100033731, 'feature_fraction': 0.6375244689005137, 'bagging_fraction': 0.9501474351354173, 'bagging_freq': 7}


In [18]:
# Train the model with the best hyperparameters
best_params['metric'] = 'auc'
lgb_train = lgb.Dataset(X_train, label=y_train, free_raw_data=False)
lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train, free_raw_data=False)

gbm = lgb.train(best_params, lgb_train, valid_sets=lgb_valid, early_stopping_rounds=10)

# Generate predictions for the test set
y_pred = gbm.predict(X_test)



You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] Start training from score 0.031520
[1]	valid_0's auc: 0.674208
Training until validation scores don't improve for 10 rounds
[2]	valid_0's auc: 0.685696
[3]	valid_0's auc: 0.692202
[4]	valid_0's auc: 0.694528
[5]	valid_0's auc: 0.698409
[6]	valid_0's auc: 0.702197
[7]	valid_0's auc: 0.704532
[8]	valid_0's auc: 0.705057
[9]	valid_0's auc: 0.706823
[10]	valid_0's auc: 0.708467
[11]	valid_0's auc: 0.711876
[12]	valid_0's auc: 0.713804
[13]	valid_0's auc: 0.714788
[14]	valid_0's auc: 0.715969
[15]	valid_0's auc: 0.717081
[16]	valid_0's auc: 0.718543
[17]	valid_0's auc: 0.719747
[18]	valid_0's auc: 0.72031
[19]	valid_0's auc: 0.721102
[20]	valid_0's auc: 0.722464
[21]	valid_0's auc: 0.72331
[22]	valid_0's auc: 0.724877
[23]	vali

In [19]:
# Load the test data
X_test = data_submission[cols_pred].to_pandas()
X_test = convert_strings(X_test)

In [20]:
# Handle new categories in categorical features
cat_cols = X_train.select_dtypes(include=['category']).columns
for col in cat_cols:
    train_categories = set(X_train[col].cat.categories)
    test_categories = set(X_test[col].cat.categories)
    new_categories = test_categories - train_categories
    X_test.loc[X_test[col].isin(new_categories), col] = "Unknown"
    new_dtype = pd.CategoricalDtype(categories=train_categories, ordered=True)
    X_train[col] = X_train[col].astype(new_dtype)
    X_test[col] = X_test[col].astype(new_dtype)

In [21]:
# Make predictions
y_test_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

In [22]:
submission = pd.DataFrame({
    "case_id": data_submission["case_id"].to_numpy(),
    "score": y_test_pred
}).set_index('case_id')

submission.to_csv("submission.csv")