**Tutorial: how to use CustomGBM for imbalanced classification**  

Let's import all relevant packages and append the path to the code used for this example

In [6]:
# general packages
import sys
sys.path.append("../code")
from custom_gbm import CustomGBM
from loss_function import loss_function
from focal import Focal_loss
import pickle as pkl
from sklearn.metrics import *
import time
import lightgbm as lgb
import numpy as np

Let's import the data we will use for this tutorial.  

We are going to predict a Quantitative Structure-Activity Relationship model for identifying CYP2C9 substrates. The raw data was downloaded from Therapeutic Data Commons (https://tdcommons.ai/single_pred_tasks/adme/#cyp2c9-substrate-carbon-mangels-et-al).  

Compounds have been converted already to 208 2D molecular descriptors from RDKIT. The training set contains 468 compounds (90 are active), while the test set has 135 (38 are active). The train and test sets were obtained via scaffold split using the TCD API.

In [2]:
with open('../data/train.pkl', 'rb') as handle:
    train = pkl.load(handle)
with open('../data/test.pkl', 'rb') as handle:
    test = pkl.load(handle) 

x_train, y_train = train
x_test, y_test = test

Let's make a baseline using the default implementation of LightGBM using weighted cross-entropy.

In [4]:
gbm = lgb.LGBMClassifier(n_estimators=100, class_weight="balanced")
gbm.fit(x_train, y_train)

predictions_default = gbm.predict_proba(x_test)[:,1]
pr_auc_default = average_precision_score(y_test, predictions_default)

print(f"Default PR-AUC: {pr_auc_default}")

Default PR-AUC: 0.3628408002927406


Let's make a new model using Focal loss and class weighting using CustomGBM.

In [18]:
booster_params = {"num_boost_round":100, "verbose":-100}
loss_fn = Focal_loss
loss_params = {"gamma": 2, "class_weight": "balanced"}
gbm_1 = CustomGBM(loss_fn, loss_params, booster_params)

gbm_1.fit(x_train, y_train)
predictions_1 = gbm_1.predict(x_test)
pr_auc_1 = average_precision_score(y_test, predictions_1)

print(f"Focal loss PR-AUC: {pr_auc_1}")

Focal loss PR-AUC: 0.3755544120238292




We get a nice improvement over the baseline, but we can push it even further by using the polynomial expansion of cross entropy, PolyLoss (https://arxiv.org/abs/2204.12511). This feature is already implemented either as a stand-alone loss function, or as an additional parameter for any of the custom losses. Let's try it in combination with Focal loss:

In [24]:
booster_params = {"num_boost_round":100, "verbose":-100}
loss_fn = Focal_loss
loss_params = {"gamma": 2, "epsilon": 0.4}
gbm_2 = CustomGBM(loss_fn, loss_params, booster_params)

gbm_2.fit(x_train, y_train)
predictions_2 = gbm_2.predict(x_test)
pr_auc_2 = average_precision_score(y_test, predictions_2)

print(f"Focal+Poly loss PR-AUC: {pr_auc_2}")

Focal loss PR-AUC: 0.3908879690946251


