# LightGBM Method

__Our third method will be a Light Gradient Boosting Machine__

This model was originaly created by Microsoft. It basically is a bunch of many small decision trees, build one after another, where each tree tries to correct the mistakes of the preious ones. 

We are going to trin a LightGBM model (Poisson) on the dataset,, we will evaluate with Poisson deviance, RMSE/MAE, and show calliration.

__First the libraries__

In [18]:
import pandas as pd
import numpy as np
import lightgbm as lgb #the library of our chosen method



from sklearn.model_selection import KFold, train_test_split #we already have two sets, but with this tool we will get a validation set before the test set
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import joblib
import warnings
warnings.filterwarnings("ignore")

__Now we code the poisson deviance and evaluation metrics__

This are just small utility functions, just to help us a bit with this.

Because of the nature of the __LightGBM__ method, Poisson deviance becomes a very usefull tool. 

Counts are different from continous values, Poisson deviance measures how well predicted expected counts exxplain the boserved counts, paying attention to the relative scale - _error when tru counts are large are treated differently than errors when counts are small_.


By computing Poisson deviance we can compare our model to what it actually optimized.

In [19]:
def poisson_deviance(y_true, y_pred, eps=1e-9):
    """
    Compute the mean Poisson deviance between true counts and predicted expected counts.
    y_true: observed counts
    y_pred: predicted (what we expect) counts (more or equal to 0)
    eps    : small value to avoid log(0)
    """
    y_pred = np.maximum(y_pred, eps)
    y_true = np.asarray(y_true)
    term = y_true * np.log((y_true + eps) / y_pred) - (y_true - y_pred)

    return 2.0 * np.mean(term)

def evaluate_counts(y_true, y_pred, exposure = None):
    """
    Evaluate predicted counts vs true counts.
    IF exposure is provided, predictons/true sould be counts.
    Return dict whit RMSE, MAE, Poisson deviance, total observed vs predicted.
    
    """
    return {
        "rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
        "mae": mean_absolute_error(y_true, y_pred),
        "poisson_deviance": poisson_deviance(y_true, y_pred),
        "total_true": float(np.sum(y_true)),
        "total_pred": float(np.sum(y_pred))
    }

__Then we load and clean the data__

In [20]:
# --- LOAD THE DATA --- #

train = pd.read_csv("../data/claims_train.csv")
test = pd.read_csv("../data/claims_test.csv")

print("Train shape:", train.shape)
print("Test shape:", test.shape)
print("\nTrain columns:", train.columns.tolist())

train = train.copy() #just in case


# --- ABOUT THE EXPOSURE FEATURE --- #

#so we got an email from Gabriel that told us how exposure significantly greater than 1 can be suspicious (entry mistakes, policies observed for longer periods or duplaces), so we are capping it at 1, prevvents a few large-exposure rows from skewing the model. 
train['Exposure_orig'] = train['Exposure']
train['exposure_large'] = (train['Exposure'] > 1).astype(int) #new column with the exposures that are above 1

num_large = int(train['exposure_large'].sum())
print(f"Rows with Exposure > 1: {num_large} ({num_large/len(train):.2%} of dataset)")
if num_large > 0:
    display(train.loc[train['exposure_large'] == 1, ['Exposure_orig', 'Exposure']].head(10))

train['Exposure'] = train['Exposure'].clip(upper=1.0)

print("Rows with Exposure > 1 (capped):", int(train['exposure_large'].sum()))


# --- SELECTING THE DATA --- #

target = 'ClaimNb'
exposure = 'Exposure'
#we are going to exlude both target and exposure as well as ID for the teatures

exclude = {target, 'Exposure_orig', exposure, 'IDpol', 'Id', 'PolicyID'}
features = [c for c in train.columns if c not in exclude and c != target]#we select the colums we want the model to learn from basically

train['log_exposure'] = np.log(train[exposure].replace(0, 1e-6)) #we create a new colum that basically makes exposure a numeric value centered around 0:
    # when exposure = 1   --> log_exposure = 0
    # when exposure = 0.5 --> log_exposure = -0.693
    # when exposure = 2   --> log_exposure = 0.693

if 'log_exposure' not in features:
    features.append('log_exposure')
if 'exposure_large' not in features:
    features.append('exposure_large')
#we make sure to append those to the features




#now we just identify the colums this is just to visualize it better dw twin
cat_cols = [c for c in features if (train[c].dtype == 'object' or train[c].nunique() <= 50)]
num_cols = [c for c in features if c not in cat_cols]

# print that shit
print("Numeric columns:", num_cols)
print("Categorical columns:", cat_cols)
print("Total features used:", len(features))


#kinda the whole goal of this cell was to see which colums of the table will the model look at and in what form, real long tho, boring as well 

Train shape: (542410, 12)
Test shape: (135603, 12)

Train columns: ['IDpol', 'ClaimNb', 'Exposure', 'Area', 'VehPower', 'VehAge', 'DrivAge', 'BonusMalus', 'VehBrand', 'VehGas', 'Density', 'Region']
Rows with Exposure > 1: 994 (0.18% of dataset)


Unnamed: 0,Exposure_orig,Exposure
272,1.03,1.03
1298,1.01,1.01
1767,1.06,1.06
1990,1.01,1.01
2209,1.03,1.03
2667,1.03,1.03
3380,1.1,1.1
4342,1.13,1.13
4555,1.05,1.05
4844,1.02,1.02


Rows with Exposure > 1 (capped): 994
Numeric columns: ['VehAge', 'DrivAge', 'BonusMalus', 'Density', 'log_exposure']
Categorical columns: ['Area', 'VehPower', 'VehBrand', 'VehGas', 'Region', 'exposure_large']
Total features used: 11


__Now, we encode categorical features__

Basically we convert categorical colums into numerci codes our model can use directly - think of dummy variables ðŸ˜Œ

In [21]:
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value= -1) #this creates the encoder, the parameter ensures unseen categories get -1 at transofrm time

if len(cat_cols):
    enc.fit(train[cat_cols].astype(str))
    train[cat_cols] = enc.transform(train[cat_cols].astype(str))

display(train[features].head())
#this cell is basically a way better version of the pandas function, way more optimzed

Unnamed: 0,Area,VehPower,VehAge,DrivAge,BonusMalus,VehBrand,VehGas,Density,Region,exposure_large,log_exposure
0,3.0,9.0,18,36,95,0.0,1.0,1054,4.0,0.0,-0.84397
1,3.0,9.0,17,80,95,6.0,1.0,598,5.0,0.0,-2.302585
2,4.0,9.0,3,36,76,10.0,1.0,4172,17.0,0.0,-1.108663
3,0.0,7.0,4,73,52,4.0,0.0,15,4.0,0.0,-0.579818
4,4.0,10.0,0,37,50,2.0,0.0,3021,12.0,0.0,-1.309333


__Then, we fit the data inot LightGBM-friendly datasets__

So LightGBM uses _lightgbm.Dataset_ for faster training. It's the most suitable form of data for our model, it will allow it to be faster.

It's just the best for this method twin, dont even worry bout it

In [22]:
def make_lgb_dataset(X_df, y = None):
    """
    Convert a pandas df (X_df --> features) and optional target y info a lightgbm.Dataset.
    It also returns the list of categorical feature indices.
    """
    if not isinstance(X_df, pd.DataFrame): #make sure the provided data is 
        X_df = pd.DataFrame(X_df, columns = features)
    

    cat_feature_indices = [X_df.columns.get_loc(c) for c in cat_cols] if len(cat_cols) else []
    #LightGBM accepts categorical features as colum indices, we are going to use 'cat_cols' for this

    if y is None:
        dset = lgb.Dataset(X_df, free_raw_data = False)
    else:
        dset = lgb.Dataset(X_df, label = y, categorical_feature = cat_feature_indices, free_raw_data = False)
    
    #this just creates the dataset, if we have y, we include it as labels


    return dset, cat_feature_indices

__And now, we code teh actual training cell__

This is the 5-fold cross_validation loop that actually trains LightGBM, it collects out-of-fold (OOF) predictions for honest evaluation, and prints per-fold metrics.



In [None]:
X = train[features].reset_index(drop=True)
y = train[target].reset_index(drop = True)
#we reset index just in case we get some index alighnemt issues later

kf = KFold(n_splits= 5, shuffle= True, random_state= 42)#this splits the training data into 5 different validaiton folds --> every observarion is in validation only once


params = {
    "objective": "poisson",     #this tells lightgbm to "optimize a loss appropiate for count data"
    "metric": "poisson",        #means "the model will report poisson emtric durin training"
    "learning_rate": 0.05,      #default
    "num_leaves": 31,           #default
    "min_data_in_leaf": 20,     #default
    "verbosity": -1,            #to reduce the spam of training lines
    "seed": 42                  #this makes the training "deterministic" --> makes the model have the same splits, same initial randomness ... = consiestent results
}

oof_preds = np.zeros(len(X))
models = []
fold_scores = []

print("Beginning 5-fold training")

#ok im going to comment a lot here bc this took me a while to understand

for fold, (tr_idx, val_idx) in enumerate(kf.split(X,y), start = 1):  
   #kf.split separates 4 pairs of index arrays --> train_indices, val_indices; tr_idx (80%) and val_idx (20%) are just arrays of int row indeces referring back to the rows of X and y
    print(f'\n --- Fold {fold} ---')
    X_tr, X_val = X.iloc[tr_idx], X.iloc[val_idx]
    y_tr, y_val = y.iloc[tr_idx], y.iloc[val_idx]
    #we use iloc bc it selects rows by int position --> thats how we select the rows for each fold
    #--- It is important that we kee the validation set entirely separate form the fold's training set to avoid data leakage ---

    dtrain, cat_idx = make_lgb_dataset(X_tr, y_tr)
    dval, _ = make_lgb_dataset(X_val, y_val) 
    # dtrain / dval --> lightgbm.DAtaset objects, very compact and optimized, great for training
    # cat_idx is a list of int column indices that correspond to categorical features --> LightGBM can handl ethem specifically
    
    
    
    model = lgb.train(
        params,                         #ofc
        dtrain,                         #the prepared training dataset for the fold
        num_boost_round=2000,           #max number of trees
        valid_sets=[dtrain, dval],      #
        valid_names=["train", "valid"],
        #early_stopping_rounds=50,
        #verbose_eval=100
    )

    preds_val = model.predict(X_val, num_iteration = model.best_iteration)
    oof_preds[val_idx] = preds_val
    models.append(model)

    eval_dict = evaluate_counts(y_val.values, preds_val)
    fold_scores.append(eval_dict)

    print(f'Fold {fold} results: RMSE = {eval_dict['rmse']:.4f}, MAE = {eval_dict['mae']:.4f}, PoissonDeviance = {eval_dict['poisson_deviance']:.4f}')
    print(f'Fold {fold} total observed={eval_dict['total_true']:.1f}, total predicted={eval_dict['total_pred']:.1f}')


df_fold = pd.DataFrame(fold_scores)
print("\nFold metrics summary (mean Â± std):")
print(df_fold.agg(['mean','std']).T[['mean','std']])
oof_eval = evaluate_counts(y.values, oof_preds)
print("\nOOF evaluation on full training set:")
print(f" RMSE: {oof_eval['rmse']:.4f}")
print(f" MAE: {oof_eval['mae']:.4f}")
print(f" Poisson deviance: {oof_eval['poisson_deviance']:.4f}")
print(f" Total observed: {oof_eval['total_true']:.1f}, Total predicted: {oof_eval['total_pred']:.1f}")


Beginning 5-fold training

 --- Fold 1 ---
Fold 1 results: RMSE = 0.2348, MAE = 0.0964, PoissonDeviance = 0.2935
Fold 1 total observed=5884.0, total predicted=5678.0

 --- Fold 2 ---
Fold 2 results: RMSE = 0.2294, MAE = 0.0946, PoissonDeviance = 0.2824
Fold 2 total observed=5607.0, total predicted=5745.4

 --- Fold 3 ---
Fold 3 results: RMSE = 0.2371, MAE = 0.0956, PoissonDeviance = 0.2903
Fold 3 total observed=5772.0, total predicted=5677.4

 --- Fold 4 ---
Fold 4 results: RMSE = 0.2326, MAE = 0.0955, PoissonDeviance = 0.2881
Fold 4 total observed=5722.0, total predicted=5713.7

 --- Fold 5 ---
Fold 5 results: RMSE = 0.2348, MAE = 0.0958, PoissonDeviance = 0.2902
Fold 5 total observed=5829.0, total predicted=5670.2

Fold metrics summary (mean Â± std):
                         mean         std
rmse                 0.233712    0.002904
mae                  0.095555    0.000634
poisson_deviance     0.288891    0.004113
total_true        5762.800000  106.177681
total_pred        5696.9557