# LightGBM Method

__Our third method will be a Light Gradient Boosting Machine__

This model was originaly created by Microsoft. It basically is a bunch of many small decision trees, build one after another, where each tree tries to correct the mistakes of the preious ones. 

We are going to trin a LightGBM model (Poisson) on the dataset,, we will evaluate with Poisson deviance, RMSE/MAE, and show calliration.

__First the libraries__

In [None]:
import pandas as pd
import numpy as np
import lightgbm as lgb #the library of our chosen method



from sklearn.model_selection import KFold, train_test_split #we already have two sets, but with this tool we will get a validation set before the test set
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error
import matplotlib.pyplot as plt
import joblib
import warnings
warnings.filterwarnings("ignore")

__Now we code the poisson deviance and evaluation metrics__

This are just small utility functions, just to help us a bit with this.

Because of the nature of the __LightGBM__ method, Poisson deviance becomes a very usefull tool. 

Counts are different from continous values, Poisson deviance measures how well predicted expected counts exxplain the boserved counts, paying attention to the relative scale - _error when tru counts are large are treated differently than errors when counts are small_.


By computing Poisson deviance we can compare our model to what it actually optimized.

In [7]:
def poisson_deviance(y_true, y_pred, eps=1e-9):
    """
    Compute the mean Poisson deviance between true counts and predicted expected counts.
    y_true: observed counts
    y_pred: predicted (what we expect) counts (more or equal to 0)
    eps    : small value to avoid log(0)
    """
    y_pred = np.maximum(y_pred, eps)
    y_true = np.asarray(y_true)
    term = y_true * np.log((y_true + eps) / y_pred) - (y_true - y_pred)

    return 2.0 * np.mean(term)

def evaluate_counts(y_true, y_pred, exposure = None):
    """
    Evaluate predicted counts vs true counts.
    IF exposure is provided, predictons/true sould be counts.
    Return dict whit RMSE, MAE, Poisson deviance, total observed vs predicted.
    
    """
    return {
        "rmse": np.sqrt(mean_squared_error(y_true, y_pred)),
        "mae": mean_absolute_error(y_true, y_pred),
        "poisson_deviance": poisson_deviance(y_true, y_pred),
        "total_true": float(np.sum(y_true)),
        "total_pred": float(np.sum(y_pred))
    }

__Then we load and clean the data__

In [None]:
# --- LOAD THE DATA --- #

train = pd.read_csv("../data/claims_train.csv")
test = pd.read_csv("../data/claims_test.csv")

print("Train shape:", train.shape)
print("Test shape:", test.shape)
print("\nTrain columns:", train.columns.tolist())

train = train.copy() #just in case


# --- ABOUT THE EXPOSURE FEATURE --- #

#so we got an email from Gabriel that told us how exposure significantly greater than 1 can be suspicious (entry mistakes, policies observed for longer periods or duplaces), so we are capping it at 1, prevvents a few large-exposure rows from skewing the model. 
train['Exposure_orig'] = train['Exposure']
train['exposure_large'] = (train['Exposure'] > 1).astype(int) #new column with the exposures that are above 1

num_large = int(train['exposure_large'].sum())
print(f"Rows with Exposure > 1: {num_large} ({num_large/len(train):.2%} of dataset)")
if num_large > 0:
    display(train.loc[train['exposure_large'] == 1, ['Exposure_orig', 'Exposure']].head(10))

train['Exposure'] = train['Exposure'].clip(upper=1.0)

print("Rows with Exposure > 1 (capped):", int(train['exposure_large'].sum()))


# --- SELECTING THE DATA --- #

target = 'ClaimNb'
exposure = 'Exposure'
#we are going to exlude both target and exposure as well as ID for the teatures

exclude = {target, 'Exposure_orig', exposure, 'IDpol', 'Id', 'PolicyID'}
features = [c for c in train.columns if c not in exclude and c != target]#we select the colums we want the model to learn from basically

train['log_exposure'] = np.log(train[exposure].replace(0, 1e-6)) #we create a new colum that basically makes exposure a numeric value centered around 0:
    # when exposure = 1   --> log_exposure = 0
    # when exposure = 0.5 --> log_exposure = -0.693
    # when exposure = 2   --> log_exposure = 0.693

if 'log_exposure' not in features:
    features.append('log_exposure')
if 'exposure_large' not in features:
    features.append('exposure_large')
#we make sure to append those to the features




#now we just identify the colums this is just to visualize it better dw twin
cat_cols = [c for c in features if (train[c].dtype == 'object' or train[c].nunique() <= 50)]
num_cols = [c for c in features if c not in cat_cols]

# print that shit
print("Numeric columns:", num_cols)
print("Categorical columns:", cat_cols)
print("Total features used:", len(features))


#kinda the whole goal of this cell was to see which colums of the table will the model look at and in what form, real long tho, boring as well 

Train shape: (542410, 12)
Test shape: (135603, 12)

Train columns: ['IDpol', 'ClaimNb', 'Exposure', 'Area', 'VehPower', 'VehAge', 'DrivAge', 'BonusMalus', 'VehBrand', 'VehGas', 'Density', 'Region']
Rows with Exposure > 1: 994 (0.18% of dataset)


Unnamed: 0,Exposure_orig,Exposure
272,1.03,1.03
1298,1.01,1.01
1767,1.06,1.06
1990,1.01,1.01
2209,1.03,1.03
2667,1.03,1.03
3380,1.1,1.1
4342,1.13,1.13
4555,1.05,1.05
4844,1.02,1.02


Rows with Exposure > 1 (capped): 994
Numeric columns: ['VehAge', 'DrivAge', 'BonusMalus', 'Density', 'log_exposure']
Categorical columns: ['Area', 'VehPower', 'VehBrand', 'VehGas', 'Region', 'exposure_large']
Total features used: 11


__Now, we Encode categorical features__

Basically we convert categorical colums into numerci codes our model can use directly - think of dummy variables ðŸ˜Œ

In [None]:
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value= 1) #this creates the encoder, the parameter ensures unseen categories get -1 at transofrm time.


