# <center> LGD models </center>

### WOE Logistic Regression LGD model

**WoE transformation in LGD (Loss Given Default) model development**

[![Python 3.11](https://img.shields.io/badge/Python-3.11-3776AB?logo=python&logoColor=white)](https://www.python.org/downloads/release/python-3110/)


Author: https://github.com/deburky

Implementation of the approach by A. van Berkel and N. Siddiqi: [Building Loss Given Default Scorecard Using Weight of Evidence Bins](https://support.sas.com/resources/papers/proceedings12/141-2012.pdf).

Data is from B. Baesens' book, Credit Risk Analytics.

The data set has been kindly provided by a European bank and has been slightly
modified and anonymized. It includes 2,545 observations on loans and LGDs. Key
variables are:
* LTV: Loan-to-value ratio, in %
* Recovery_rate: Recovery rate, in %
* lgd_time: Loss rate given default (LGD), in %
* y_logistic: Logistic transformation of the LGD
* lnrr: Natural logarithm of the recovery rate
* Y_probit: Probit transformation of the LGD
* purpose1: Indicator variable for the purpose of the loan; 1 = renting purpose,
0 = other
*  event: Indicator variable for a default or cure event; 1 = event, 0 = no event
<hlink> https://www.sas.com/storefront/aux/en/spcriskmdl/68835_excerpt.pdf </hlink>

WoE transformation algorithm designed for the binary response can be directly applied to the new dataset with the converted LGD.
<p>We use lightGBM to perform granular monotonic binning aimed at miminizing logistic loss.</p>

In [None]:
import flatdict
import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder

In [45]:
path_to_data = (
    "https://raw.githubusercontent.com/deburky/calibration/"
    "refs/heads/main/logistic-regression-inference/datasets/lgd_woe.csv"
)
lgd = pd.read_csv(path_to_data)

In [23]:
def duplicate_dataset(df, id_col):
    lgd_woe = []
    for i in df[id_col].unique():
        mask = df[id_col] == i
        df_bad = df[mask].loc[df[mask].index.repeat(df[mask].bads)]
        df_bad["is_default"] = 1
        df_good = df[mask].loc[df[mask].index.repeat(df[mask].goods)]
        df_good["is_default"] = 0
        df_all = pd.concat([df_bad, df_good], axis=0)
        lgd_woe.append(df_all)
    return pd.concat(lgd_woe)


def calculate_woe(df, col, target_col):
    """
    Calculate the Weight of Evidence (WOE) for a categorical variable.

    Parameters:
    df (pandas DataFrame): The dataframe containing the data.
    col (str): The name of the categorical variable for which to calculate WOE.
    target_col (str): The name of the target variable.

    Returns:
    pandas Series: A series containing the WOE values for each category of the variable.
    """
    categories = df[col].unique()
    woe_values = {}
    for category in categories:
        category_df = df[df[col] == category]

        # bin counts
        good_count_bin = category_df[category_df[target_col] == 0][target_col].count()
        bad_count_bin = category_df[category_df[target_col] == 1][target_col].count()

        # total counts
        good_count = df[df[target_col] == 0][target_col].count()
        bad_count = df[df[target_col] == 1][target_col].count()

        # conditions
        if good_count == 0 or bad_count == 0:
            woe_values[category] = 0
        elif good_count_bin == 0 or bad_count_bin == 0:
            woe_values[category] = 0
        else:
            goods = good_count_bin / good_count
            bads = bad_count_bin / bad_count
            woe = np.log(goods / bads)
            woe_values[category] = woe

    return pd.Series(df[col].map(woe_values))

### Dataset duplication

In [24]:
lgd_woe = duplicate_dataset(lgd, "id")
print(f"Original sample: {len(lgd):,.0f}")
print(f"Duplicated sample: {len(lgd_woe):,.0f}")

Original sample: 254,500
Duplicated sample: 25,450,000


### Logisic Regression

In [None]:
# mirroring statsmodels logistic regression
lr_params = {
    "fit_intercept": True,
    "penalty": None,
    "random_state": 72,
    "solver": "newton-cg",
}

X = lgd_woe[["ltv", "purpose1"]]
y = lgd_woe["is_default"].values

# splitting dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=True, random_state=62
)

categorical_features = ["purpose1"]

# one-hot encoding categorical features
transformer = ColumnTransformer(
    transformers=[
        (
            "OneHotEncoder",
            OneHotEncoder(drop="first", handle_unknown="ignore"),
            categorical_features,
        )
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
)

# creating a pipeline
sk_lr_model_ohe = make_pipeline(transformer, LogisticRegression(**lr_params))

# training the model
sk_lr_model_ohe.fit(X_train, y_train)
print(sk_lr_model_ohe[1].intercept_, sk_lr_model_ohe[1].coef_)
print(sk_lr_model_ohe[0].get_feature_names_out())

# discrimination ability
y_pred = sk_lr_model_ohe.predict_proba(X_test)[:, 1]
gini = roc_auc_score(y_test, y_pred) * 2 - 1
print(f"{gini:.2%}")


[-2.98809619] [[0.78799966 2.27144465]]
['purpose1_1' 'ltv']
46.46%


### Univariate Logistic Regression

In [28]:
X_train_ltv = X_train.loc[:, "ltv"].values.reshape(-1, 1)
X_test_ltv = X_test.loc[:, "ltv"].values.reshape(-1, 1)

# initializing LR class
sk_lr_model = LogisticRegression(**lr_params)

# training the model
sk_lr_model.fit(X_train_ltv, y_train)
print(sk_lr_model.intercept_, sk_lr_model.coef_)

y_pred = sk_lr_model.predict_proba(X_test_ltv)[:, 1]

gini = roc_auc_score(y_test, y_pred) * 2 - 1
print(f"Gini : {gini:.2%}")

[-2.92685801] [[2.28102999]]
Gini : 44.87%


### Binning with LightGBM
<hlink>https://github.com/microsoft/LightGBM/issues/638</hlink>
<p> Linearizing a tree into decision rules</p>

In [None]:
X = lgd_woe["ltv"].values.reshape(-1, 1)
y = lgd_woe["is_default"].values

# splitting dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=True, random_state=62
)

# create datasets for LightGBM
lgb_train = lgb.Dataset(X_train, y_train)  # params={"max_bin": 20}
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)  # params={"max_bin": 20}

tree_params = {
    "boosting_type": "rf",
    "max_depth": 5,
    "objective": "binary",
    "bagging_freq": 1,
    "bagging_fraction": 0.999,
    "feature_fraction": 0.999,
    "bagging_seed": 323,
    "verbosity": -1,
    "monotone_constraints": [1],
}

gbm = lgb.train(
    params=tree_params, train_set=lgb_train, num_boost_round=5, valid_sets=lgb_eval
)

# gbm.save_model("gbm_model.txt")

tree_info = gbm.dump_model()["tree_info"][0]
d_tree_prop = flatdict.FlatDict(tree_info, delimiter=".")
thresholds = [round(d_tree_prop[key], 3) for key in d_tree_prop if "threshold" in key]
bin_edges = [-np.inf, *sorted(thresholds), np.inf]
bin_edges = np.unique(bin_edges)

print(f"Bin edges: {bin_edges}")
print(f"# of bins: {bin_edges.size}")

y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
gini = roc_auc_score(y_test, y_pred) * 2 - 1
print(f"Gini: {gini:.2%}")

Bin edges: [ -inf 0.046 0.195 0.349 0.432 0.604 0.61  0.687 0.69  0.749 0.792 0.795
 0.845 0.917 0.933 0.94  0.965 1.069 1.096 1.124 1.313 1.495 1.964   inf]
# of bins: 24
Gini: 45.69%


In [None]:
# convert to bins
lgd_woe["ltv_woe_bins"] = np.digitize(lgd_woe["ltv"], bin_edges)
# create WOE features
lgd_woe["ltv_woe"] = calculate_woe(lgd_woe, "ltv_woe_bins", "is_default")

### Inference

In [None]:
lgd_woe["lgd_proba"] = gbm.predict(lgd_woe[["ltv_woe"]])

### Downturn LGD calibration
The goal is to incorporate Downturn LGD into our calibrated predictions. <p> This is achieved by using an adjusted weight of non-defaulted observations to change average LGD in our calibration sample to match DT LGD.</p>
<p> <b>Instead of repeating samples, re-weight the loss function.</b>
Same effect as over-sampling (though not random), but not as expensive (dataset size the same).

One way we can make the resampling more efficient is by using class weights instead of actually resampling. So we can change our loss function to do the same thing as if you would resample but under sampling case, we don’t actually throw away any data and in the oversampling case, we don’t actually make our computational problem harder by repeating some of the samples. This works for most models and it’s pretty simple to do in scikit-learn. Basically, it’s the same as oversampling in a sense because you’re not throwing away any data. - Source A. Mueller.<hlink> https://amueller.github.io/aml/05-advanced-topics/11-imbalanced-datasets.html#class-weights</hlink></p>
<p> Already fitted classifiers can be calibrated via the parameterv`cv="prefit"`. In this case, no cross-validation is used and all provided data is used for calibration. The user has to take care manually that data for model fitting and calibration are disjoint. - Source sklearn.</p>

In [40]:
# using weights for non-defaulted observations
LGD_SAMPLE = np.mean(lgd_woe["is_default"].values)
DT_MULTIPLICATOR = 1.08  # LGD downturn multiplicator
LGD_TARGET = LGD_SAMPLE * DT_MULTIPLICATOR
WEIGHT = (LGD_SAMPLE * (1 - LGD_TARGET)) / (LGD_TARGET * (1 - LGD_SAMPLE))
print(
    f"LGD sample: {LGD_SAMPLE:.2%}\nLGD target: {LGD_TARGET:.2%}\nSample weight: {WEIGHT:.2%}"
)

LGD sample: 22.80%
LGD target: 24.63%
Sample weight: 90.40%


In [41]:
# ((1-LGD_TARGET) / LGD_TARGET) / ((1-LGD_SAMPLE) / LGD_SAMPLE)

In [42]:
# King and Zeng (2001) - Weighting
w1 = LGD_TARGET / LGD_SAMPLE
w0 = (1 - LGD_TARGET) / (1 - LGD_SAMPLE)
print(LGD_SAMPLE, LGD_TARGET)
print(w1, w0)

# weight for non-defaulted observations
# 0.98/1.08 = 0.904
print(w0 / w1)

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=72)

# weighted exogenous sampling maximum-likelihood estimator
sk_lr_model_cw = LogisticRegressionCV(
    fit_intercept=True,
    cv=skf,
    refit=False,
    Cs=[1e-4],
    intercept_scaling=0.1,
    #     class_weight={0: w0, 1: w1}
    class_weight={0: w0 / w1, 1: 1},
)

sk_lr_model_cw.fit(X_train, y_train)
print(sk_lr_model_cw.intercept_, sk_lr_model_cw.coef_)

y_pred = sk_lr_model_cw.predict_proba(X_test)[:, 1]
print(f"Average predicted probability: {y_pred.mean():.2%}")

0.22801178781925344 0.24625273084479374
1.08 0.9763714746705621
0.9040476617320019
[-1.11724229] [[-0.99454606]]
Average predicted probability: 24.39%


In [43]:
# King and Zeng (2001) - Prior correction
sk_lr_model_cw = LogisticRegressionCV(cv=skf, refit=False)
sk_lr_model_cw.fit(X_train, y_train)
print(sk_lr_model_cw.intercept_, sk_lr_model_cw.coef_)

y_pred = sk_lr_model_cw.predict_proba(X_test)[:, 1]
print(f"Average predicted probability: {y_pred.mean():.2%}")

# correcting the estimates based on prior information
# MLE of β1 need not be changed, but the constant term should
# be corrected by subtracting out the bias factor
prior = np.log(((1 - LGD_TARGET) / LGD_TARGET) * (LGD_SAMPLE / (1 - LGD_SAMPLE)))
# intercept_corr = -3.6522636 - prior # b0
y_pred_logit = (
    sk_lr_model_cw.intercept_ - prior + (X_train * sk_lr_model_cw.coef_).sum(axis=1)
)
# y_pred_logit_pc = y_pred_logit - prior # as in King's formula
y_pred_pc = 1 / (1 + np.exp(-y_pred_logit))
print(f"Average predicted probability with prior: {y_pred_pc.mean():.2%}")

[-1.21801046] [[-0.9946171]]
Average predicted probability: 22.81%
Average predicted probability with prior: 24.39%
