<a href="https://colab.research.google.com/github/fabriziobasso/Colab_backup/blob/main/EDA_V2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **S4E12 - Insurance Premium Prediction Dataset**

## Problem Statement

The goal of this dataset is to facilitate the development and testing of regression models for predicting insurance premiums based on various customer characteristics and policy details. Insurance companies often rely on data-driven approaches to estimate premiums, taking into account factors such as age, income, health status, and claim history. This synthetic dataset simulates real-world scenarios to help practitioners practice feature engineering, data cleaning, and model training.

## Dataset Overview

This dataset contains 2Lk+ and 20 features with a mix of categorical, numerical, and text data. It includes missing values, incorrect data types, and skewed distributions to mimic the complexities faced in real-world datasets. The target variable for prediction is the "Premium Amount".

### Features

1. Age: Age of the insured individual (Numerical)
2. Gender: Gender of the insured individual (Categorical: Male, Female)
3. Annual Income: Annual income of the insured individual (Numerical, skewed)
4. Marital Status: Marital status of the insured individual (Categorical: Single, Married, Divorced)
5. Number of Dependents: Number of dependents (Numerical, with missing values)
6. Education Level: Highest education level attained (Categorical: High School, Bachelor's, Master's, PhD)
7. Occupation: Occupation of the insured individual (Categorical: Employed, Self-Employed, Unemployed)
8. Health Score: A score representing the health status (Numerical, skewed)
9. Location: Type of location (Categorical: Urban, Suburban, Rural)
10. Policy Type: Type of insurance policy (Categorical: Basic, Comprehensive, Premium)
11. Previous Claims: Number of previous claims made (Numerical, with outliers)
12. Vehicle Age: Age of the vehicle insured (Numerical)
13. Credit Score: Credit score of the insured individual (Numerical, with missing values)
14. Insurance Duration: Duration of the insurance policy (Numerical, in years)
15. Premium Amount: Target variable representing the insurance premium amount (Numerical, skewed)
16. Policy Start Date: Start date of the insurance policy (Text, improperly formatted)
17. Customer Feedback: Short feedback comments from customers (Text)
18. Smoking Status: Smoking status of the insured individual (Categorical: Yes, No)
19. Exercise Frequency: Frequency of exercise (Categorical: Daily, Weekly, Monthly, Rarely)
20. Property Type: Type of property owned (Categorical: House, Apartment, Condo)

# Data Characteristics

- Missing Values: Certain features contain missing values to simulate real-world data collection issues.
- Incorrect Data Types: Some fields are intentionally set to incorrect data types to practice data cleaning.
- Skewed Distributions: Numerical features like **Annual Income** and **Premium Amount** have skewed distributions, which can be addressed through transformations.

In [None]:
%%capture
!pip install -qq pytorch_tabnet
!pip install optuna
!pip install catboost
!pip install optuna-integration-pytorch-tabnet

from pytorch_tabnet.tab_model import TabNetRegressor

!pip install category-encoders
!pip install optuna-integration

!pip install keras-tuner --upgrade
!pip install keras-nlp
!pip install BorutaShap
!pip install scikit-lego
!pip install skops

from pytorch_tabnet.tab_model import TabNetRegressor

In [None]:
# Setup notebook
from pathlib import Path
import ipywidgets as widgets
import pandas as pd
import numpy as np
from pickle import load, dump
import json
import joblib
#import calplot as cal

# Graphic Libraries:
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.image as mpimg
# Set Style
sns.set_style("whitegrid",{"grid.linestyle":"--", 'grid.linewidth':0.2, 'grid.alpha':0.5});
sns.despine(left=True, bottom=True, top=False, right=False);
mpl.rcParams['figure.dpi'] = 120;
mpl.rc('axes', labelsize=12);
plt.rc('xtick',labelsize=10);
plt.rc('ytick',labelsize=10);

mpl.rcParams['axes.spines.top'] = False;
mpl.rcParams['axes.spines.right'] = False;
mpl.rcParams['axes.spines.left'] = True;

# Palette Setup
colors = ['#FB5B68','#FFEB48','#2676A1','#FFBDB0',]
colormap_0 = mpl.colors.LinearSegmentedColormap.from_list("",colors)
palette_1 = sns.color_palette("coolwarm", as_cmap=True)
palette_2 = sns.color_palette("YlOrBr", as_cmap=True)
palette_3 = sns.light_palette("red", as_cmap=True)
palette_4 = sns.color_palette("viridis", as_cmap=True)
palette_5 = sns.color_palette("rocket", as_cmap=True)
palette_6 = sns.color_palette("GnBu", as_cmap=True)
palette_7 = sns.color_palette("tab20c", as_cmap=False)
palette_8 = sns.color_palette("Set2", as_cmap=False)

palette_custom = ['#fbb4ae','#b3cde3','#ccebc5','#decbe4','#fed9a6','#ffffcc','#e5d8bd','#fddaec','#f2f2f2']
palette_9 = sns.color_palette(palette_custom, as_cmap=False)

# tool for Excel:
from openpyxl import load_workbook, Workbook
from openpyxl.drawing.image import Image
from openpyxl.styles import Border, Side, PatternFill, Font, GradientFill, Alignment
from openpyxl.worksheet.cell_range import CellRange

from openpyxl.formatting import Rule
from openpyxl.styles import Font, PatternFill, Border
from openpyxl.styles.differential import DifferentialStyle

# Bloomberg
#from xbbg import blp
from catboost import CatBoostRegressor, Pool, CatBoostClassifier
import xgboost as xgb
from xgboost import XGBRegressor, XGBClassifier
from xgboost.callback import EarlyStopping

import lightgbm as lgb
from lightgbm import (LGBMRegressor,
                      LGBMClassifier,
                      early_stopping,
                      record_evaluation,
                      log_evaluation)

# Time Management
from tqdm import tqdm
from datetime import date
from datetime import datetime
from pandas.tseries.offsets import BMonthEnd, QuarterEnd
import datetime
from pandas.tseries.offsets import BDay # BDay is business day, not birthday...
import datetime as dt
import click
import glob
import os
import gc
import re
import string

from ipywidgets import AppLayout
from ipywidgets import Dropdown, Layout, HTML, AppLayout, VBox, Label, HBox, BoundedFloatText, interact, Output

#from my_func import *

import optuna
from optuna.integration import TFKerasPruningCallback
from optuna.trial import TrialState
from optuna.visualization import plot_intermediate_values
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_param_importances
from optuna.visualization import plot_contour

os.environ["KERAS_BACKEND"] = "tensorflow"

import numpy as np
import tensorflow as tf
import keras
from keras import ops
from keras import layers

from keras.layers import Input, LSTM, Dense, Lambda, RepeatVector, Reshape
from keras.models import Model
from keras.losses import MeanSquaredError
from keras.metrics import RootMeanSquaredError

from keras.utils import FeatureSpace, plot_model

# Import libraries for Hypertuning
import kerastuner as kt
from kerastuner.tuners import RandomSearch, GridSearch, BayesianOptimization

#from my_func import *

# preprocessing modules
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold, RepeatedKFold, cross_val_score, cross_validate, GroupKFold, GridSearchCV, RepeatedStratifiedKFold, cross_val_predict

from sklearn.preprocessing import (LabelEncoder,
                                   StandardScaler,
                                   MinMaxScaler,
                                   OrdinalEncoder,
                                   RobustScaler,
                                   PowerTransformer,
                                   OneHotEncoder,
                                   QuantileTransformer,
                                   PolynomialFeatures)

# metrics
import sklearn
import skops.io as sio
from sklearn.metrics import (mean_squared_error,
                             root_mean_squared_error,
                             root_mean_squared_log_error,
                             r2_score,
                             mean_absolute_error,
                             mean_absolute_percentage_error,
                             classification_report,
                             confusion_matrix,
                             ConfusionMatrixDisplay,
                             multilabel_confusion_matrix,
                             accuracy_score,
                             roc_auc_score,
                             auc,
                             roc_curve,
                             log_loss,
                             make_scorer)
# modeling algos
from sklearn.linear_model import (LogisticRegression,
                                  Lasso,
                                  ridge_regression,
                                  LinearRegression,
                                  Ridge,
                                  RidgeCV,
                                  ElasticNet,
                                  BayesianRidge,
                                  HuberRegressor,
                                  TweedieRegressor,
                                  QuantileRegressor,
                                  ARDRegression,
                                  TheilSenRegressor,
                                  PoissonRegressor,
                                  GammaRegressor)

from sklearn.ensemble import (AdaBoostRegressor,
                              AdaBoostClassifier,
                              RandomForestRegressor,
                              RandomForestClassifier,
                              VotingRegressor,
                              GradientBoostingRegressor,
                              GradientBoostingClassifier,
                              StackingRegressor,
                              StackingClassifier,
                              HistGradientBoostingClassifier,
                              HistGradientBoostingRegressor,
                              ExtraTreesClassifier)

from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import FunctionTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

import warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
%matplotlib inline

from sklearn.linear_model import LinearRegression
import numpy as np
import seaborn as sns
from statsmodels.tsa.deterministic import CalendarFourier, DeterministicProcess

from sklearn.multioutput import RegressorChain
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBRegressor

import itertools
import warnings
from openpyxl import load_workbook

from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

import statsmodels.api as sm
from pylab import rcParams
import scipy.stats as ss

warnings.filterwarnings('ignore')
#plt.style.use('fivethirtyeight')

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

  import kerastuner as kt


<Figure size 640x480 with 0 Axes>

In [None]:
sns.set({"axes.facecolor"       : "#ffffff",
         "figure.facecolor"     : "#ffffff",
         "axes.edgecolor"       : "#000000",
         "grid.color"           : "#ffffff",
         "font.family"          : ['Cambria'],
         "axes.labelcolor"      : "#000000",
         "xtick.color"          : "#000000",
         "ytick.color"          : "#000000",
         "grid.linewidth"       : 0.5,
         'grid.alpha'           :0.5,
         "grid.linestyle"       : "--",
         "axes.titlecolor"      : 'black',
         'axes.titlesize'       : 12,
         'axes.labelweight'     : "bold",
         'legend.fontsize'      : 7.0,
         'legend.title_fontsize': 7.0,
         'font.size'            : 7.5,
         'xtick.labelsize'      : 7.5,
         'ytick.labelsize'      : 7.5,
        });

sns.set_style("whitegrid",{"grid.linestyle":"--", 'grid.linewidth':0.2, 'grid.alpha':0.5})
# Set Style
mpl.rcParams['figure.dpi'] = 120;

# Making sklearn pipeline outputs as dataframe:-
pd.set_option('display.max_columns', 100);
pd.set_option('display.max_rows', 50);

sns.despine(left=True, bottom=True, top=False, right=False)

mpl.rcParams['axes.spines.left'] = True
mpl.rcParams['axes.spines.right'] = False
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.bottom'] = True

<Figure size 960x660 with 0 Axes>

In [None]:
from tqdm import tqdm
from itertools import product

import numpy as np
import pandas as pd
import gc
import matplotlib.pyplot as plt
import seaborn as sns

from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor

from sklearn.model_selection import GroupKFold
from sklearn.impute import SimpleImputer
import torch

import warnings
warnings.filterwarnings("ignore")

# Connect to Colab:#
from google.colab import drive
import os
drive.mount('/content/drive')

Mounted at /content/drive


<div style="text-align:center; border-radius:15px; padding:15px; margin:0; font-size:100%; font-family:Arial, sans-serif; background-color:#A8DADC; color:#1D3557; overflow:hidden; box-shadow:0 3px 6px rgba(0, 0, 0, 0.2);">
    <h3>Loading and Preprocessing Data for Compatibility</h3>
</div>


In [None]:
df_train = pd.read_csv('/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/train_no_nan.csv')

df_test = pd.read_csv('/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/test_no_nan.csv')

df_test_orig = pd.read_csv(
    '/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/test.csv',
     index_col='id',
     parse_dates=['Policy Start Date'],
)

df_train_orig = pd.read_csv(
    '/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/train.csv',
     index_col='id',
     parse_dates=['Policy Start Date'],
)

df_subm = pd.read_csv(
    "/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/sample_submission.csv",
     index_col='id',
)

# df_orig = pd.read_csv(
#     "/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/Insurance Premium Prediction Dataset.csv",
#      parse_dates=['Policy Start Date'],
#     #     index_col='id',
# )


mapping = {2.0:0.0,0.0:1.0,1.0:2.0}
df_train["Customer Feedback"] = df_train["Customer Feedback"].map(mapping)
df_test["Customer Feedback"] = df_test["Customer Feedback"].map(mapping)

In [None]:
# # Convert `Policy Start Date` column to datetime64 format
# df_orig['Policy Start Date'] = pd.to_datetime(df_orig['Policy Start Date'])

# # Calculate the difference in days between today and the `Policy Start Date` column
# today = pd.to_datetime('today')
# difference_in_days = today - df_orig['Policy Start Date']

# # Divide the difference in days by 365 to get the difference in years
# difference_in_years = difference_in_days / pd.Timedelta(days=365)

# # Convert the `Policy Start Date` column to the number of years since the policy start date
# df_orig['Policy Start Date'] = difference_in_years

In [None]:
df_train["Start_Year"] = df_train_orig["Policy Start Date"].dt.year
df_train["Start_Month"] = df_train_orig["Policy Start Date"].dt.month
df_train["Start_Day"] = df_train_orig["Policy Start Date"].dt.day

df_test["Start_Year"] = df_test_orig["Policy Start Date"].dt.year.values
df_test["Start_Month"] = df_test_orig["Policy Start Date"].dt.month.values
df_test["Start_Day"] = df_test_orig["Policy Start Date"].dt.day.values

In [None]:
# df_orig.dropna(axis=0,how="any").shape
df_train_orig.head()
#df_test["Policy Start Date"].dt.year

Unnamed: 0_level_0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Premium Amount
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
0,19.0,Female,10049.0,Married,1.0,Bachelor's,Self-Employed,22.598761,Urban,Premium,2.0,17.0,372.0,5.0,2023-12-23 15:21:39.134960,Poor,No,Weekly,House,2869.0
1,39.0,Female,31678.0,Divorced,3.0,Master's,,15.569731,Rural,Comprehensive,1.0,12.0,694.0,2.0,2023-06-12 15:21:39.111551,Average,Yes,Monthly,House,1483.0
2,23.0,Male,25602.0,Divorced,3.0,High School,Self-Employed,47.177549,Suburban,Premium,1.0,14.0,,3.0,2023-09-30 15:21:39.221386,Good,Yes,Weekly,House,567.0
3,21.0,Male,141855.0,Married,2.0,Bachelor's,,10.938144,Rural,Basic,1.0,0.0,367.0,1.0,2024-06-12 15:21:39.226954,Poor,Yes,Daily,Apartment,765.0
4,21.0,Male,39651.0,Single,1.0,Bachelor's,Self-Employed,20.376094,Rural,Premium,0.0,8.0,598.0,4.0,2021-12-01 15:21:39.252145,Poor,Yes,Weekly,House,2022.0


In [None]:
df_train.shape, df_test.shape, df_subm.shape #, df_orig.shape, df_orig.shape

((1200000, 23), (800000, 23), (800000, 1))

In [None]:
print("Pytorch Version: {}".format(torch.__version__))
print("SKLearn Version: {}".format(sklearn.__version__))

Pytorch Version: 2.5.1+cu121
SKLearn Version: 1.5.2


In [None]:
100*df_train.isnull().sum()/df_train.shape[0]

Unnamed: 0,0
Age,0.0
Gender,0.0
Annual Income,0.0
Marital Status,0.0
Number of Dependents,0.0
Education Level,0.0
Occupation,0.0
Health Score,0.0
Location,0.0
Policy Type,0.0


In [None]:
100*df_test.isnull().sum()/df_test.shape[0]

Unnamed: 0,0
Age,0.0
Gender,0.0
Annual Income,0.0
Marital Status,0.0
Number of Dependents,0.0
Education Level,0.0
Occupation,0.0
Health Score,0.0
Location,0.0
Policy Type,0.0


In [None]:
df_train.head(3)
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 23 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   Age                   1200000 non-null  float64
 1   Gender                1200000 non-null  object 
 2   Annual Income         1200000 non-null  float64
 3   Marital Status        1200000 non-null  float64
 4   Number of Dependents  1200000 non-null  float64
 5   Education Level       1200000 non-null  object 
 6   Occupation            1200000 non-null  float64
 7   Health Score          1200000 non-null  float64
 8   Location              1200000 non-null  object 
 9   Policy Type           1200000 non-null  object 
 10  Previous Claims       1200000 non-null  float64
 11  Vehicle Age           1200000 non-null  float64
 12  Credit Score          1200000 non-null  float64
 13  Insurance Duration    1200000 non-null  float64
 14  Policy Start Date     1200000 non-

In [None]:
df_train.nunique()

Unnamed: 0,0
Age,47
Gender,2
Annual Income,91633
Marital Status,3
Number of Dependents,5
Education Level,4
Occupation,3
Health Score,532670
Location,3
Policy Type,3


In [None]:
categorical_cols = ["Gender","Marital Status","Number of Dependents","Education Level","Occupation","Location","Policy Type","Previous Claims","Insurance Duration","Customer Feedback","Smoking Status","Exercise Frequency","Property Type",
                    "Start_Year","Start_Month","Start_Day"]
numerical_cols = ["Age","Annual Income","Health Score","Vehicle Age","Credit Score","Policy Start Date"]

target = ['Premium Amount']

len(categorical_cols+numerical_cols+target),len(df_train.columns)

df_train[categorical_cols] = df_train[categorical_cols].astype('category')
df_test[categorical_cols] = df_test[categorical_cols].astype('category')

df_train[numerical_cols] = df_train[numerical_cols].astype(np.float64)
df_test[numerical_cols] = df_test[numerical_cols].astype(np.float64)

## 1.0 Time Features:

Here the following features are reviewed:

* "Start_Year"
* "Start_Month"
* "Start_Day"
* "Policy Start Date"

### **Year**

In [None]:
# ts_monthly = df_train.groupby(["Start_Year"], as_index=False)["Premium Amount"].agg(["mean","std","skew","median","min","max","count"])
# ts_monthly_test = df_test.groupby(["Start_Year"], as_index=False)["Health Score"].agg(["mean","std","skew","min","max","count"])

# fig, axs = plt.subplots(1,2,figsize=(10, 4))
# ts_monthly["mean"].plot(color="#0485d1", ax=axs[0])
# ts_monthly["std"].plot(color="#c875c4", ax=axs[0])
# ts_monthly["median"].plot(color="#fd411e", ax=axs[0])
# ts_monthly["skew"].plot(color="#fd411e", ax=axs[1])

### **Month**

In [None]:
# ts_monthly = df_train.groupby(["Start_Month"], as_index=False)["Premium Amount"].agg(["mean","std","skew","median","min","max","count"])
# ts_monthly_test = df_test.groupby(["Start_Month"], as_index=False)["Health Score"].agg(["mean","std","skew","min","max","count"])

# fig, axs = plt.subplots(1,2,figsize=(10, 4))
# ts_monthly["mean"].plot(color="#0485d1", ax=axs[0])

# ts_monthly["skew"].plot(color="#fd411e", ax=axs[1])

### **Year and Month**

In [None]:
ts_monthly = df_train.groupby(["Start_Year","Start_Month"])["Premium Amount"].agg(["mean","std","skew","median","min","max","count"])
ts_monthly_test = df_test.groupby(["Start_Year","Start_Month"], as_index=False)["Health Score"].agg(["mean","std","skew","min","max","count"])

fig, axs = plt.subplots(1,2,figsize=(10, 4))
ts_monthly["mean"].plot(color="#0485d1", ax=axs[0])
ts_monthly["std"].plot(color="#c875c4", ax=axs[0])
ts_monthly["median"].plot(color="#fd411e", ax=axs[0])
ts_monthly["skew"].plot(color="#fd411e", ax=axs[1])
plt.show()

In [None]:
ts_monthly=ts_monthly.dropna(axis=0,how="any")
ts_monthly_test=ts_monthly_test.dropna(axis=0,how="any")
ts_monthly.head()

In [None]:
fig, axs = plt.subplots(1,2,figsize=(13, 4))
ts_monthly["count"].plot(kind="bar",color="#1fa774", ax=axs[0])
ts_monthly_test["count"].plot(kind="bar",color="#fd411e", ax=axs[1])
plt.show()

In [None]:
ts_day = df_train.groupby(["Start_Day"], as_index=False)["Premium Amount"].agg(["mean","std","skew","median","min","max","count"])
ts_day_test = df_test.groupby(["Start_Day"], as_index=False)["Health Score"].agg(["mean","std","skew","min","max","count"])

fig, axs = plt.subplots(1,2,figsize=(10, 4))
ts_day["mean"].plot(color="#0485d1", ax=axs[0])
ts_day["std"].plot(color="#c875c4", ax=axs[0])
ts_day["median"].plot(color="#fd411e", ax=axs[0])
ts_day["skew"].plot(color="#fd411e", ax=axs[1])

In [None]:
fig, axs = plt.subplots(1,2,figsize=(10, 4))
ts_day["count"].plot(kind="bar",color="#1fa774", ax=axs[0])
ts_day_test["count"].plot(kind="bar",color="#fd411e", ax=axs[1])

In [None]:
ts_monthly_test

### **Year and Month + Policy Type**

In [None]:
ts_monthly = df_train.groupby(["Start_Year","Start_Month","Policy Type"], as_index=False)[["Premium Amount"]].agg(["mean"])
# fig, axs = plt.subplots(1,2,figsize=(10, 4))
# ts_monthly[("Premium Amount","mean")].plot(ax=axs[0])
# ts_monthly[("Credit Score","mean")].plot(ax=axs[1])
# ts_monthly["std"].plot(color="#c875c4", ax=axs[0])
# ts_monthly["median"].plot(color="#fd411e", ax=axs[0])
# ts_monthly["skew"].plot(color="#fd411e", ax=axs[1])

In [None]:
ts_monthly[ts_monthly["Policy Type"]=="Basic"][("Premium Amount","mean")].plot(color="#c875c4")
ts_monthly[ts_monthly["Policy Type"]=="Comprehensive"][("Premium Amount","mean")].plot(color="#fd411e")
ts_monthly[ts_monthly["Policy Type"]=="Premium"][("Premium Amount","mean")].plot(color="#0485d1")

In [None]:
ts_monthly

### **Year and Month + Customer Feedback + Policy Type**

In [None]:
ts_monthly = df_train.groupby(["Start_Year","Start_Month","Customer Feedback","Policy Type"], as_index=False)[["Premium Amount"]].agg(["mean"])
# fig, axs = plt.subplots(1,2,figsize=(10, 4))
# ts_monthly[("Premium Amount","mean")].plot(ax=axs[0])
# ts_monthly[("Credit Score","mean")].plot(ax=axs[1])
# ts_monthly["std"].plot(color="#c875c4", ax=axs[0])
# ts_monthly["median"].plot(color="#fd411e", ax=axs[0])
# ts_monthly["skew"].plot(color="#fd411e", ax=axs[1])
ts_monthly.head()

In [None]:
ts_monthly[(ts_monthly["Customer Feedback"]==0)&(ts_monthly["Policy Type"]=="Premium")][("Premium Amount","mean")].plot(color="#c875c4")
ts_monthly[(ts_monthly["Customer Feedback"]==1)&(ts_monthly["Policy Type"]=="Premium")][("Premium Amount","mean")].plot(color="#fd411e")
ts_monthly[(ts_monthly["Customer Feedback"]==2)&(ts_monthly["Policy Type"]=="Premium")][("Premium Amount","mean")].plot(color="#0485d1")

In [None]:
ts_monthly[(ts_monthly["Customer Feedback"]==0)&(ts_monthly["Policy Type"]=="Comprehensive")][("Premium Amount","mean")].plot(color="#c875c4")
ts_monthly[(ts_monthly["Customer Feedback"]==1)&(ts_monthly["Policy Type"]=="Comprehensive")][("Premium Amount","mean")].plot(color="#fd411e")
ts_monthly[(ts_monthly["Customer Feedback"]==2)&(ts_monthly["Policy Type"]=="Comprehensive")][("Premium Amount","mean")].plot(color="#0485d1")

In [None]:
ts_monthly[(ts_monthly["Customer Feedback"]==0)&(ts_monthly["Policy Type"]=="Basic")][("Premium Amount","mean")].plot(color="#c875c4")
ts_monthly[(ts_monthly["Customer Feedback"]==1)&(ts_monthly["Policy Type"]=="Basic")][("Premium Amount","mean")].plot(color="#fd411e")
ts_monthly[(ts_monthly["Customer Feedback"]==2)&(ts_monthly["Policy Type"]=="Basic")][("Premium Amount","mean")].plot(color="#0485d1")

#### Add Mean to Dataset:

In [None]:
ts_monthly.head(3)
ts_monthly = ts_monthly.droplevel(level=1,axis=1).rename(columns={'Premium Amount': 'Premium_time_Mean'})

In [None]:
ts_monthly.head()

In [None]:
df_train_new = pd.merge(df_train, ts_monthly, on=["Start_Year","Start_Month","Customer Feedback","Policy Type"], how='left')
df_test_new = pd.merge(df_test, ts_monthly, on=["Start_Year","Start_Month","Customer Feedback","Policy Type"], how='left')

In [None]:
df_train_new.isna().sum(),df_test_new.isna().sum()

In [None]:
plt.scatter(df_train_new["Premium_time_Mean"],df_train_new["Premium Amount"])
plt.show()

In [None]:
df_train_new.head()

### **Year and Month + Customer Feedback + Policy Type**

In [None]:
ts_monthly = df_train.groupby(["Start_Year","Start_Month","Smoking Status","Policy Type"], as_index=False)[["Premium Amount"]].agg(["mean"])
# fig, axs = plt.subplots(1,2,figsize=(10, 4))
# ts_monthly[("Premium Amount","mean")].plot(ax=axs[0])
# ts_monthly[("Credit Score","mean")].plot(ax=axs[1])
# ts_monthly["std"].plot(color="#c875c4", ax=axs[0])
# ts_monthly["median"].plot(color="#fd411e", ax=axs[0])
# ts_monthly["skew"].plot(color="#fd411e", ax=axs[1])
ts_monthly.head()

In [None]:
ts_monthly[(ts_monthly["Smoking Status"]=="No")&(ts_monthly["Policy Type"]=="Comprehensive")][("Premium Amount","mean")].plot(color="#fd411e")
ts_monthly[(ts_monthly["Smoking Status"]=="Yes")&(ts_monthly["Policy Type"]=="Comprehensive")][("Premium Amount","mean")].plot(color="#0485d1")

## Previous Claims and Policy Type:

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

plt_1 = sns.boxplot(data=df_train_new, x='Previous Claims', y='Premium Amount', ax=ax[0], palette=palette_9)
plt_2 = sns.boxplot(data=df_train_new, x='Policy Type', y='Premium Amount', ax=ax[1], palette=palette_9);

In [None]:
df_train_new[df_train_new["Previous Claims"]==9]

In [None]:
df_test_new[df_test_new["Previous Claims"]==9]

In [None]:
df_train_new[df_train_new["Previous Claims"]==8].shape, df_test_new[df_test_new["Previous Claims"]==8].shape

## Customer Feedback and Insurance Duration:

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

plt_1 = sns.boxplot(data=df_train_new, x='Customer Feedback', y='Premium Amount', ax=ax[0], palette=palette_9)
plt_2 = sns.boxplot(data=df_train_new, x='Insurance Duration', y='Premium Amount', ax=ax[1], palette=palette_9);

## Annual Income and Credit Score:

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

plt_1 = sns.scatterplot(data=df_train_new, x='Annual Income', y='Premium Amount',ax=ax[0], palette=palette_9)
plt_2 = sns.boxplot(data=df_train_new, x='Property Type', y='Premium Amount', ax=ax[1], palette=palette_9);

#### **Number of Dependents and Marital Status**

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(15, 5))

plt_1 = sns.boxplot(data=df_train_new, x='Number of Dependents', y='Premium Amount',ax=ax[0], palette=palette_9)
plt_2 = sns.boxplot(data=df_train_new, x='Marital Status', y='Premium Amount', ax=ax[1], palette=palette_9);

In [None]:
df_train_new["Property Type"].nunique(),df_train_new["Property Type"].unique()

In [None]:
def replace_entries(df):

  map_gender = {'Male': 0, 'Female': 1}
  df['Gender'] = df['Gender'].map(map_gender)

  map_education = {'High School': 0, "Bachelor's": 1, "Master's": 2, 'PhD': 3}
  df['Education Level'] = df['Education Level'].map(map_education)

  map_loc = {'Rural': 0, 'Suburban': 1, 'Urban': 2}
  df['Location'] = df['Location'].map(map_loc)

  map_policy = {'Basic': 0, 'Comprehensive': 1, 'Premium': 2}
  df['Policy Type'] = df['Policy Type'].map(map_policy)

  map_smoking = {'No': 0, 'Yes': 1}
  df['Smoking Status'] = df['Smoking Status'].map(map_smoking)

  map_exercise = {'Rarely': 0, 'Monthly': 1, 'Weekly': 2, 'Daily': 3}
  df['Exercise Frequency'] = df['Exercise Frequency'].map(map_exercise)

  map_property = {'Condo': 0, 'Apartment': 1, 'House': 2}
  df['Property Type'] = df['Property Type'].map(map_property)

  return df

df_train_new = replace_entries(df_train_new)
df_test_new = replace_entries(df_test_new)


In [None]:
df_train_new.head()

In [None]:
df_train_new.describe(include="all").T
#df_train_new.info()

In [None]:
cat_cols = ['Gender', 'Marital Status','Number of Dependents','Occupation','Location','Policy Type','Previous Claims', 'Insurance Duration', 'Customer Feedback',
            'Smoking Status',  'Exercise Frequency', 'Property Type', 'Start_Year', 'Start_Month', 'Start_Day','Education Level']

num_cols = ['Age', 'Annual Income', 'Health Score', 'Vehicle Age', 'Credit Score', 'Policy Start Date', 'Premium Amount', 'Premium_time_Mean']

dtypes_num = {c:"float" for c in num_cols}
dtypes_cat = {c:"category" for c in cat_cols}

dtypes_all = {**dtypes_num, **dtypes_cat}

len(cat_cols+num_cols),len(dtypes_all.keys())#,len(df_train_new.columns)

In [None]:
#df_train_new.to_csv('/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/df_train_v3.csv', index=False)
#df_test_new.to_csv('/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/df_test_v3.csv', index=False)

df_train_new = pd.read_csv('/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/df_train_v3.csv', dtype=dtypes_all)
df_test_new = pd.read_csv('/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/df_test_v3.csv', dtype=dtypes_all)

In [None]:
df_train_new.info()

In [None]:
df_test_new.info()

## **MODELS**

**DATA UPLOAD**

In [None]:
cat_cols = ['Gender', 'Marital Status','Number of Dependents','Occupation','Location','Policy Type','Previous Claims', 'Insurance Duration', 'Customer Feedback',
            'Smoking Status',  'Exercise Frequency', 'Property Type', 'Start_Year', 'Start_Month', 'Start_Day','Education Level']

num_cols = ['Age', 'Annual Income', 'Health Score', 'Vehicle Age', 'Credit Score', 'Policy Start Date', 'Premium Amount', 'Premium_time_Mean']

dtypes_num = {c:"float" for c in num_cols}
dtypes_cat = {c:"category" for c in cat_cols}

dtypes_all = {**dtypes_num, **dtypes_cat}

len(cat_cols+num_cols),len(dtypes_all.keys())#,len(df_train_new.columns)

(24, 24)

In [None]:
#df_train_new.to_csv('/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/df_train_v3.csv', index=False)
#df_test_new.to_csv('/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/df_test_v3.csv', index=False)

df_train_new = pd.read_csv('/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/df_train_v3.csv', dtype=dtypes_all)
df_test_new = pd.read_csv('/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/df_test_v3.csv', dtype=dtypes_all)

In [None]:
df_train_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 24 columns):
 #   Column                Non-Null Count    Dtype   
---  ------                --------------    -----   
 0   Age                   1200000 non-null  float64 
 1   Gender                1200000 non-null  category
 2   Annual Income         1200000 non-null  float64 
 3   Marital Status        1200000 non-null  category
 4   Number of Dependents  1200000 non-null  category
 5   Education Level       1200000 non-null  category
 6   Occupation            1200000 non-null  category
 7   Health Score          1200000 non-null  float64 
 8   Location              1200000 non-null  category
 9   Policy Type           1200000 non-null  category
 10  Previous Claims       1200000 non-null  category
 11  Vehicle Age           1200000 non-null  float64 
 12  Credit Score          1200000 non-null  float64 
 13  Insurance Duration    1200000 non-null  category
 14  Policy Start Date 

**FUNCTIONS**

In [None]:
cat_cols = ['Gender', 'Marital Status','Number of Dependents','Occupation','Location','Policy Type','Previous Claims', 'Insurance Duration', 'Customer Feedback',
            'Smoking Status',  'Exercise Frequency', 'Property Type', 'Start_Year', 'Start_Month', 'Start_Day','Education Level']

num_cols = ['Age', 'Annual Income', 'Health Score', 'Vehicle Age', 'Credit Score', 'Policy Start Date', 'Premium_time_Mean']

params ={'reg_alpha': 0.4932, 'reg_lambda': 0.002076, 'max_depth': 21, 'colsample_bytree': 0.825, 'subsample': 0.75, 'num_leaves': 101, 'n_estimators': 2501, 'learning_rate': 0.01,
         'min_child_samples':35, 'random_state': 42, 'force_col_wise':True, 'verbose':-1}

In [None]:
def rmsle_obj(y_true, y_pred):
    grad = -2 * (np.log1p(y_true) - np.log1p(y_pred)) / (y_pred + 1)
    hess = 2 * (np.log1p(y_true) - np.log1p(y_pred) + 1) / ((y_pred + 1) ** 2)
    return grad, hess

def rmsle_objective(y_true, y_pred):
    y_pred = np.log1p(y_pred)  # Apply log1p(x) = log(1 + x) to predicted values
    y_true = np.log1p(y_true)  # Apply log1p to true values
    grad = (y_pred - y_true) / (y_pred * y_true)
    hess = 1 / (y_pred * y_true)
    return grad, hess

def rmsle_metric(y_true, y_pred):
    y_true = np.expm1(y_true)
    y_pred = np.expm1(y_pred)
    rmsle = np.sqrt(np.mean((np.log1p(y_pred) - np.log1p(y_true)) ** 2))
    return 'rmsle', rmsle, False

# Model Training
def train_lgbm(params, X, y, cat_loc, use_gpu=True, X_val=None, y_val=None, es=101):

    # Set to CPU if GPU usage is specified
    if use_gpu:
      params['device'] = 'gpu'
    else:
      params['device'] = 'cpu'

    model = LGBMRegressor(**params, objective="root_mean_squared_error", metric="rmse",boosting_type='gbdt', categorical_feature=cat_loc)
    model.fit(X, y, eval_set=(X_val, y_val), callbacks=[early_stopping(stopping_rounds=es)])

    return model

# Model Training
def train_lgbm_rsmle(params, X, y, cat_loc, use_gpu=True, X_val=None, y_val=None, es=101):

    # Set to CPU if GPU usage is specified
    if use_gpu:
      params['device'] = 'gpu'
    else:
      params['device'] = 'cpu'

    # Remove the 'metric' key from params or set it to a valid string or list of strings.
    params.pop('metric', None)  # Remove if present
    # Or, set to a valid string:
    # params['metric'] = 'rmse'

    model = LGBMRegressor(**params, objective=rmsle_obj,boosting_type='gbdt', categorical_feature=cat_loc)
    model.fit(X, y, eval_set=(X_val, y_val), callbacks=[early_stopping(stopping_rounds=es)], eval_metric=[rmsle_metric])

    return model

In [None]:
def plot_results(y,y_forecasted):
  fig, axs = plt.subplots(1,2,figsize=(10, 4))

  axs[0].hist(y_forecasted, bins=100, alpha=0.5, color="royalblue")
  axs[1].hist(y, bins=100, color="salmon", alpha=0.5)
  axs[0].set_xlabel("Premium Amount Forecast: Train")
  axs[1].set_xlabel("Premium Amount Forecast: Test")
  axs[0].set_ylabel("Frequency")
  plt.suptitle("Distribution of Predicted Premium Amounts", y=1.01)
  plt.show()

def store_results(for_test,for_train, model="LGBM", experiment=0):
  df_test_for = for_test.copy()
  df_train_for = for_train.copy()

  df_train_for["Average"] = df_train_for.mean(axis=1)

  train_forecast_to_store = df_train_for[["Average"]]
  train_forecast_to_store["Average"] = train_forecast_to_store["Average"].astype("float")
  train_forecast_to_store.columns = [f"{model}_{experiment}"]

  plot_results(train_forecast_to_store,df_test_for)

  print(train_forecast_to_store.min(),train_forecast_to_store.max(),train_forecast_to_store.mean(),train_forecast_to_store.median())

  df_test_for.to_csv(f'/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/Submissions/submission_{model}_{experiment}.csv')
  train_forecast_to_store.to_csv(f'/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E12/Submissions/train_{model}_{experiment}.csv')

  print(f'Results of {model}_{experiment} all saved')

  return (df_test_for, train_forecast_to_store)

def feature_engineering(df:pd.DataFrame):
    df1 = df.copy()

    df1["Annual_Income_Health_Score_Ratio"] = df1["Annual Income"] / df1["Health Score"]
    df1["Annual_Income_Health_Score"] = df1["Annual Income"] * df1["Health Score"]

    df1["Annual_Income_Credit_Score_Ratio"] = df1["Annual Income"] / df1["Credit Score"]
    df1["Annual_Income_Credit_Score"] = df1["Annual Income"] * df1["Credit Score"]

    df1["Vehicle_Age_Insurance_Duration"] = df1["Vehicle Age"] / df1["Insurance Duration"].astype("float")
    df1['Annual_Income_Previous_Claims'] = df1['Previous Claims'].astype("float") * df1['Annual Income']

    df1['Start_Month_Sin'] = np.sin(df1['Start_Month'].astype("float") / 12 * 2 * np.pi)
    df1['Start_Month_Cos'] = np.cos(df1['Start_Month'].astype("float") / 12 * 2 * np.pi)
    df1['Start_Day_Sin'] = np.sin(df1['Start_Day'].astype("float") / 31 * 2 * np.pi)
    df1['Start_Day_Cos'] = np.cos(df1['Start_Day'].astype("float") / 31 * 2 * np.pi)

    df1=df1.drop(columns=["Start_Month","Start_Day"])

    return df1

In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200000 entries, 0 to 1199999
Data columns (total 23 columns):
 #   Column                Non-Null Count    Dtype   
---  ------                --------------    -----   
 0   Age                   1200000 non-null  float64 
 1   Gender                1200000 non-null  category
 2   Annual Income         1200000 non-null  float64 
 3   Marital Status        1200000 non-null  category
 4   Number of Dependents  1200000 non-null  category
 5   Education Level       1200000 non-null  category
 6   Occupation            1200000 non-null  category
 7   Health Score          1200000 non-null  float64 
 8   Location              1200000 non-null  category
 9   Policy Type           1200000 non-null  category
 10  Previous Claims       1200000 non-null  category
 11  Vehicle Age           1200000 non-null  float64 
 12  Credit Score          1200000 non-null  float64 
 13  Insurance Duration    1200000 non-null  category
 14  Policy Start Date 

## **1.0 LGBMRegressor** - RMSE on Log Target

In [None]:
X = df_train_new.drop(columns=["Premium Amount"])
X_test = df_test_new.drop(columns=["Premium Amount"])
y = df_train_new["Premium Amount"]

log_y = np.log1p(y)

In [None]:
cat_loc = [X.columns.get_loc(i) for i in cat_cols]

### 1.1 Optuna Optimization:

In [None]:
def objective_lgbm(trial, X, y, n_splits, n_repeats, use_gpu=False):

    model_class = LGBMRegressor

    categorical_features = cat_cols.copy()
    tot_cat = categorical_features

    numeric_features = [col for col in X.columns if col not in tot_cat]

    params = {

    'num_leaves':         101, #trial.suggest_int('num_leaves', 31, 111, step=5),
    'n_estimators':       2501,
    'learning_rate':      0.01,
    'min_child_samples':  35, #trial.suggest_int('min_child_samples', 31, 51, step=1),
    #'min_child_weight' :  trial.suggest_float("min_child_weight", 1e-2, 1.0, log=True),
    "reg_alpha" :         trial.suggest_float("reg_alpha", 1e-3, 1.0, log=True),
    "reg_lambda" :        trial.suggest_float("reg_lambda", 1e-3, 1.0, log=True),
    "max_depth" :         trial.suggest_int('max_depth', 8, 21, step=1),
    'colsample_bytree':   trial.suggest_float("colsample_bytree", 0.65, 0.95, step=0.025),
    'subsample':          trial.suggest_float("subsample", 0.65, 0.95, step=0.025),
    'random_state':       42,
    'force_col_wise':     True,
    'verbose':-1
    }

    kf = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=42)
    rmsle_scores = []

    for idx_train, idx_valid in kf.split(X, y):

        # Split the data into training and validation sets for the current fold
        X_train, y_train = X.iloc[idx_train], y.iloc[idx_train].to_numpy().reshape(-1,1)
        X_valid, y_valid = X.iloc[idx_valid], y.iloc[idx_valid].to_numpy().reshape(-1,1)

        scaler = StandardScaler()
        X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
        X_valid[num_cols] = scaler.transform(X_valid[num_cols])

        X_train = X_train.to_numpy()
        X_valid = X_valid.to_numpy()

        # Create the pipeline
        model = model_class(**params, objective="root_mean_squared_error", metric="rmse",boosting_type='gbdt', categorical_feature=cat_loc)
        # Create the early stopping callback
        early_stop = early_stopping(stopping_rounds=101)
        # Fit the model:
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], callbacks=[early_stop])

        # Make predictions on the validation set
        y_pred = model.predict(X_valid)

        y_pred = np.expm1(y_pred)
        y_valid = np.expm1(y_valid)

        # Calculate the RMSE for the current fold

        rmsle_score = root_mean_squared_log_error(y_valid, y_pred)
        rmsle_scores.append(rmsle_score)

    # Calculate the mean RMSLE score across all folds
    mean_rmsle_score = np.mean(rmsle_scores)

    return mean_rmsle_score

In [None]:
# Step 2: Tuning Hyperparameters with Optuna
def tune_hyperparameters(X, y, model_class, n_trials, n_splits_ ,n_repeats_, use_gpu=False):  #use_gpu
    study = optuna.create_study(direction='minimize', sampler=optuna.samplers.TPESampler())
    study.optimize(lambda trial: objective_lgbm(trial, X, y, n_splits=n_splits_, n_repeats=n_repeats_, use_gpu=use_gpu), n_trials=n_trials)
    return study  # Return the study object

# Step 3: Saving Best Results and Models
def save_results(study, model_class, model_name):
    best_params_file = f"{model_name}_best_params.joblib"
    joblib.dump(study.best_params, best_params_file)
    print(f"Best parameters for {model_name} saved to {best_params_file}")

    verbose_file = f"{model_name}_optuna_verbose.log"
    with open(verbose_file, "w") as f:
        f.write(str(study.trials))
    print(f"Optuna verbose for {model_name} saved to {verbose_file}")

In [None]:
# usage with XGBRegressor
lgbm_study = tune_hyperparameters(X, log_y, model_class=LGBMRegressor, n_trials=51, n_splits_ = 3 ,n_repeats_=3, use_gpu=False)
save_results(lgbm_study, LGBMRegressor, "LGBMBoost_ext")
lgbm_params = lgbm_study.best_params

In [None]:
print(lgbm_params)

In [None]:
trial = lgbm_study.best_trial
print('MSE: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

* Experiment 1:

    - MSE: 1.0685007914142697
    - Best hyperparameters: {'num_leaves': 111, 'min_child_samples': 35, 'reg_alpha': 005448690915044739, 'reg_lambda': 0.016061276667668913, 'max_depth': 12, 'colsample_bytree': 0.8500000000000001, 'subsample': 0.75}

* Experiment 2:

    - MSE: 1.068389091315945
    - Best hyperparameters: {'reg_alpha': 0.4932, 'reg_lambda': 0.002076, 'max_depth': 21, 'colsample_bytree': 0.825, 'subsample': 0.75,  'num_leaves': 101, 'n_estimators': 2501,'learning_rate': 0.01, 'min_child_samples':  35}    

In [None]:
fig = optuna.visualization.plot_optimization_history(lgbm_study)
fig.show()

In [None]:
fig = optuna.visualization.plot_param_importances(lgbm_study)
fig.show()

In [None]:
#del xgb_study
gc.collect()

### 1.2 Fit Best Model:

**PARAMETERS**

In [None]:
# Define a common random seed for reproducibility
RANDOM_SEED = 42
N_ESTIMATORS = 3000  # Number of estimators for the ensemble models
n_splits = 3
n_repeats = 3

params ={'reg_alpha': 0.4932, 'reg_lambda': 0.002076, 'max_depth': 21, 'colsample_bytree': 0.825, 'subsample': 0.75, 'num_leaves': 101, 'n_estimators': 2501, 'learning_rate': 0.01,
         'min_child_samples':35, 'random_state': 42, 'force_col_wise':True, 'verbose':0}

cat_cols = ['Gender', 'Marital Status','Number of Dependents','Occupation','Location','Policy Type','Previous Claims', 'Insurance Duration', 'Customer Feedback',
            'Smoking Status',  'Exercise Frequency', 'Property Type', 'Start_Year', 'Start_Month', 'Start_Day','Education Level']

num_cols = ['Age', 'Annual Income', 'Health Score', 'Vehicle Age', 'Credit Score', 'Policy Start Date', 'Premium_time_Mean']

cat_loc = [X.columns.get_loc(i) for i in cat_cols]

**DATA**

In [None]:
df_subm_stack = df_subm.copy()

X = df_train_new.drop(columns=["Premium Amount"])
X_test = df_test_new.drop(columns=["Premium Amount"])
y = df_train_new["Premium Amount"]

log_y = np.log1p(y)

**FIT THE MODEL**

In [None]:
kf = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=42)
rmsle = []

# Initialize the Stack
df_subm_stack['Premium Amount'] = 0.0

i=0

oof_results_stack = pd.DataFrame(columns=list(range(n_splits*n_repeats)), index=X.index)

for idx_train, idx_valid in kf.split(X, log_y):

    print(f"Working on Fold {i}")

    # Split the data into training and validation sets for the current fold
    X_train, y_train = X.iloc[idx_train], log_y.iloc[idx_train].to_numpy().reshape(-1,1)
    X_valid, y_valid = X.iloc[idx_valid], log_y.iloc[idx_valid].to_numpy().reshape(-1,1)
    X_test_ = X_test.copy()

    scaler = StandardScaler()
    X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
    X_valid[num_cols] = scaler.transform(X_valid[num_cols])
    X_test_[num_cols] = scaler.transform(X_test[num_cols])


    X_train = X_train.to_numpy()
    X_valid = X_valid.to_numpy()
    X_test_ = X_test_.to_numpy()


    if i >= 9:
        #print(stacking_model.get_params())
        # Fit the StackingRegressor
        LGBM_model = train_lgbm(params, X_train, y_train, cat_loc, use_gpu=False, X_val=X_valid, y_val=y_valid, es=101)
        obj = sio.dump(LGBM_model, f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/LGBM_base_{i}.skops")

    else:
        unknown_types = sio.get_untrusted_types(file=f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/LGBM_base_{i}.skops")
        LGBM_model = sio.load(f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/LGBM_base_{i}.skops", trusted=unknown_types)

    stack_preds = np.exp(LGBM_model.predict(X_valid))

    oof_results_stack.iloc[idx_valid,i] = stack_preds.flatten()
    # Prepare the test data and make predictions
    error = root_mean_squared_log_error(np.exp(y_valid), stack_preds)

    rmsle.append(error)
    print(f"RMSLE fold {i}: {error}")

    # Aggregate the predictions across the 5 folds (averaging for ensemble)
    df_subm_stack['Premium Amount'] += np.exp(LGBM_model.predict(X_test_)) / (n_splits*n_repeats)
    i += 1

In [None]:
np.mean(rmsle), np.std(rmsle)

### **1.3 Save Results:**

In [None]:
(df_test_for,train_forecast_to_store) = store_results(for_test=df_subm_stack,for_train=oof_results_stack, model="LGBM", experiment=0)

In [None]:
df_test_for.max(),train_forecast_to_store.max()

In [None]:
train_forecast_to_store

## **2.0 CatBoostRegressor**

In [None]:
X = df_train_new.drop(columns=["Premium Amount"])
X_test = df_test_new.drop(columns=["Premium Amount"])
y = df_train_new["Premium Amount"]

log_y = np.log1p(y)

In [None]:
cat_loc = [X.columns.get_loc(i) for i in cat_cols]

### 1.1 Optuna Optimization:

In [None]:
def objective_catboost(trial, X, y, n_splits, n_repeats, model=CatBoostRegressor, use_gpu=True):

    model_class = model

    categorical_features = cat_cols.copy()
    tot_cat = categorical_features

    numeric_features = [col for col in X.columns if col not in tot_cat]

    params = {
        'iterations': 2501,
        'learning_rate': 0.025, #trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'depth': trial.suggest_int('depth', 7, 15),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-4, 1.0, log=True),
        "bootstrap_type": "Bernoulli",
        'subsample': trial.suggest_float('subsample ', 0.65, 0.90,step=0.025),
        'random_strength': trial.suggest_float('random_strength', 0.0, 1.5),
        #'border_count': trial.suggest_int('border_count', 32, 255),
        'cat_features': categorical_features,
        'task_type': 'GPU' if use_gpu else 'CPU',
        'verbose': 100
    }

    kf = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=42)
    rmsle_scores = []

    for idx_train, idx_valid in kf.split(X, y):

        # Split the data into training and validation sets for the current fold
        X_train, y_train = X.iloc[idx_train], y.iloc[idx_train].to_numpy().reshape(-1,1)
        X_valid, y_valid = X.iloc[idx_valid], y.iloc[idx_valid].to_numpy().reshape(-1,1)

        scaler = StandardScaler()
        X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
        X_valid[num_cols] = scaler.transform(X_valid[num_cols])

        # Create the Pool objects for CatBoost
        train_pool = Pool(data=X_train, label=y_train, cat_features=categorical_features)
        valid_pool = Pool(data=X_valid, label=y_valid, cat_features=categorical_features)

        # Create the pipeline
        model = model_class(**params, loss_function='RMSE', eval_metric='RMSE')
        # Fit the model:
        model.fit(train_pool, eval_set=valid_pool, early_stopping_rounds=101,
                  #callbacks=[optuna.integration.CatBoostPruningCallback(trial, "RMSE")]
                  )

        # Make predictions on the validation set
        y_pred = model.predict(X_valid)

        y_pred = np.expm1(y_pred)
        y_valid = np.expm1(y_valid)

        # Calculate the RMSE for the current fold

        rmsle_score = root_mean_squared_log_error(y_valid, y_pred)
        rmsle_scores.append(rmsle_score)

    # Calculate the mean RMSLE score across all folds
    mean_rmsle_score = np.mean(rmsle_scores)

    return mean_rmsle_score

In [None]:
# Step 2: Tuning Hyperparameters with Optuna
def tune_hyperparameters(X, y, model_class, n_trials, n_splits_ ,n_repeats_, use_gpu=True):  #use_gpu
    study = optuna.create_study(direction='minimize', sampler=optuna.samplers.TPESampler(), pruner=optuna.pruners.MedianPruner(n_warmup_steps=50))
    study.optimize(lambda trial: objective_catboost(trial, X, y, n_splits=n_splits_, n_repeats=n_repeats_, model=model_class, use_gpu=use_gpu), n_trials=n_trials)
    return study  # Return the study object

# Step 3: Saving Best Results and Models
def save_results(study, model_class, model_name):
    best_params_file = f"{model_name}_best_params.joblib"
    joblib.dump(study.best_params, best_params_file)
    print(f"Best parameters for {model_name} saved to {best_params_file}")

    verbose_file = f"{model_name}_optuna_verbose.log"
    with open(verbose_file, "w") as f:
        f.write(str(study.trials))
    print(f"Optuna verbose for {model_name} saved to {verbose_file}")

In [None]:
# usage with XGBRegressor
cat_study = tune_hyperparameters(X, log_y, model_class=CatBoostRegressor, n_trials=101, n_splits_ = 3 ,n_repeats_=3, use_gpu=True)
save_results(cat_study, CatBoostRegressor, "CatBoost_ext")
cat_params = cat_study.best_params

[I 2024-12-12 19:11:51,789] A new study created in memory with name: no-name-20b61bd5-fc59-49d4-b064-9bcf33e3d6bf


0:	learn: 1.0942937	test: 1.0959597	best: 1.0959597 (0)	total: 55.4ms	remaining: 2m 18s
100:	learn: 1.0732499	test: 1.0751133	best: 1.0751133 (100)	total: 4.18s	remaining: 1m 39s
200:	learn: 1.0718021	test: 1.0741140	best: 1.0741140 (200)	total: 8.2s	remaining: 1m 33s
300:	learn: 1.0709775	test: 1.0737213	best: 1.0737213 (300)	total: 12s	remaining: 1m 27s
400:	learn: 1.0702517	test: 1.0734661	best: 1.0734661 (400)	total: 15.9s	remaining: 1m 23s
500:	learn: 1.0695737	test: 1.0732931	best: 1.0732931 (500)	total: 19.8s	remaining: 1m 18s
600:	learn: 1.0685728	test: 1.0728312	best: 1.0728312 (600)	total: 23.7s	remaining: 1m 14s
700:	learn: 1.0673791	test: 1.0722562	best: 1.0722540 (698)	total: 27.9s	remaining: 1m 11s
800:	learn: 1.0664158	test: 1.0719886	best: 1.0719860 (797)	total: 32s	remaining: 1m 7s
900:	learn: 1.0654992	test: 1.0717719	best: 1.0717711 (896)	total: 36s	remaining: 1m 3s
1000:	learn: 1.0646683	test: 1.0715839	best: 1.0715839 (999)	total: 40.2s	remaining: 1m
1100:	learn: 1

[I 2024-12-12 19:27:52,641] Trial 0 finished with value: 1.0699177857001105 and parameters: {'depth': 8, 'l2_leaf_reg': 0.014594085473136007, 'subsample ': 0.75, 'random_strength': 1.4825387145010973}. Best is trial 0 with value: 1.0699177857001105.


0:	learn: 1.0941498	test: 1.0958334	best: 1.0958334 (0)	total: 156ms	remaining: 6m 29s
100:	learn: 1.0665326	test: 1.0732521	best: 1.0732521 (100)	total: 13.1s	remaining: 5m 12s
200:	learn: 1.0579690	test: 1.0722933	best: 1.0722933 (200)	total: 26.3s	remaining: 5m 1s
300:	learn: 1.0476860	test: 1.0719895	best: 1.0719895 (300)	total: 39.4s	remaining: 4m 47s
400:	learn: 1.0380204	test: 1.0717974	best: 1.0717923 (384)	total: 52.4s	remaining: 4m 34s
500:	learn: 1.0280587	test: 1.0717992	best: 1.0717541 (448)	total: 1m 5s	remaining: 4m 21s
bestTest = 1.071754075
bestIteration = 448
Shrink model to first 449 iterations.
0:	learn: 1.0954400	test: 1.0932569	best: 1.0932569 (0)	total: 147ms	remaining: 6m 7s
100:	learn: 1.0678266	test: 1.0713042	best: 1.0713042 (100)	total: 13.2s	remaining: 5m 13s
200:	learn: 1.0580321	test: 1.0703879	best: 1.0703821 (195)	total: 26.5s	remaining: 5m 3s
300:	learn: 1.0475933	test: 1.0701235	best: 1.0701194 (299)	total: 39.6s	remaining: 4m 49s
400:	learn: 1.037934

[I 2024-12-12 19:41:40,747] Trial 1 finished with value: 1.0708524212754442 and parameters: {'depth': 13, 'l2_leaf_reg': 0.4894207717033368, 'subsample ': 0.675, 'random_strength': 0.18065512282671958}. Best is trial 0 with value: 1.0699177857001105.


0:	learn: 1.0943468	test: 1.0960140	best: 1.0960140 (0)	total: 40.7ms	remaining: 1m 41s
100:	learn: 1.0735099	test: 1.0751905	best: 1.0751905 (100)	total: 3.43s	remaining: 1m 21s
200:	learn: 1.0719134	test: 1.0738273	best: 1.0738273 (200)	total: 6.82s	remaining: 1m 18s
300:	learn: 1.0711266	test: 1.0733439	best: 1.0733439 (300)	total: 10.3s	remaining: 1m 14s
400:	learn: 1.0705044	test: 1.0730026	best: 1.0730026 (400)	total: 13.7s	remaining: 1m 11s
500:	learn: 1.0699073	test: 1.0727043	best: 1.0727043 (500)	total: 17s	remaining: 1m 7s
600:	learn: 1.0693218	test: 1.0724620	best: 1.0724620 (600)	total: 20.4s	remaining: 1m 4s
700:	learn: 1.0686919	test: 1.0721750	best: 1.0721750 (700)	total: 23.9s	remaining: 1m 1s
800:	learn: 1.0681023	test: 1.0719574	best: 1.0719563 (798)	total: 27.3s	remaining: 58s
900:	learn: 1.0675615	test: 1.0717869	best: 1.0717869 (900)	total: 30.8s	remaining: 54.8s
1000:	learn: 1.0670817	test: 1.0716743	best: 1.0716733 (995)	total: 34.3s	remaining: 51.4s
1100:	learn

[I 2024-12-12 19:55:51,277] Trial 2 finished with value: 1.0700264878356094 and parameters: {'depth': 7, 'l2_leaf_reg': 0.0316983898827619, 'subsample ': 0.7250000000000001, 'random_strength': 0.5550904494701409}. Best is trial 0 with value: 1.0699177857001105.


0:	learn: 1.0941969	test: 1.0958646	best: 1.0958646 (0)	total: 81.1ms	remaining: 3m 22s
100:	learn: 1.0709477	test: 1.0742565	best: 1.0742565 (100)	total: 7.13s	remaining: 2m 49s
200:	learn: 1.0675869	test: 1.0734534	best: 1.0734534 (200)	total: 14s	remaining: 2m 39s
300:	learn: 1.0644980	test: 1.0731261	best: 1.0731261 (300)	total: 20.5s	remaining: 2m 30s
400:	learn: 1.0612039	test: 1.0728958	best: 1.0728936 (395)	total: 27.2s	remaining: 2m 22s
500:	learn: 1.0580480	test: 1.0726219	best: 1.0726219 (500)	total: 33.7s	remaining: 2m 14s
600:	learn: 1.0544668	test: 1.0721146	best: 1.0721128 (599)	total: 40.7s	remaining: 2m 8s
700:	learn: 1.0506127	test: 1.0717328	best: 1.0717328 (700)	total: 47.8s	remaining: 2m 2s
800:	learn: 1.0467507	test: 1.0714599	best: 1.0714586 (799)	total: 55s	remaining: 1m 56s
900:	learn: 1.0426628	test: 1.0712765	best: 1.0712764 (892)	total: 1m 2s	remaining: 1m 50s
1000:	learn: 1.0387538	test: 1.0711590	best: 1.0711558 (987)	total: 1m 9s	remaining: 1m 43s
1100:	l

[I 2024-12-12 20:11:02,927] Trial 3 finished with value: 1.0702169887942883 and parameters: {'depth': 11, 'l2_leaf_reg': 0.0024721909778884464, 'subsample ': 0.8500000000000001, 'random_strength': 1.3943672976622465}. Best is trial 0 with value: 1.0699177857001105.


0:	learn: 1.0941184	test: 1.0958319	best: 1.0958319 (0)	total: 226ms	remaining: 9m 23s
100:	learn: 1.0615785	test: 1.0737977	best: 1.0737977 (100)	total: 21.5s	remaining: 8m 30s
200:	learn: 1.0451598	test: 1.0731872	best: 1.0731747 (182)	total: 42.3s	remaining: 8m 3s
300:	learn: 1.0294827	test: 1.0729646	best: 1.0729496 (289)	total: 1m 3s	remaining: 7m 40s
400:	learn: 1.0116788	test: 1.0729352	best: 1.0728463 (348)	total: 1m 23s	remaining: 7m 19s
500:	learn: 0.9924460	test: 1.0729547	best: 1.0728381 (449)	total: 1m 44s	remaining: 6m 58s
bestTest = 1.072838124
bestIteration = 449
Shrink model to first 450 iterations.
0:	learn: 1.0954080	test: 1.0932496	best: 1.0932496 (0)	total: 218ms	remaining: 9m 6s
100:	learn: 1.0636975	test: 1.0718946	best: 1.0718946 (100)	total: 21.4s	remaining: 8m 27s
200:	learn: 1.0476855	test: 1.0712751	best: 1.0712718 (196)	total: 42.1s	remaining: 8m 2s
300:	learn: 1.0301392	test: 1.0711053	best: 1.0710848 (293)	total: 1m 2s	remaining: 7m 39s
400:	learn: 1.0125

[I 2024-12-12 20:28:47,750] Trial 4 finished with value: 1.0719419070887342 and parameters: {'depth': 14, 'l2_leaf_reg': 0.0001657688972744093, 'subsample ': 0.8, 'random_strength': 0.9600716941230956}. Best is trial 0 with value: 1.0699177857001105.


0:	learn: 1.0941969	test: 1.0958646	best: 1.0958646 (0)	total: 81ms	remaining: 3m 22s
100:	learn: 1.0705246	test: 1.0737835	best: 1.0737835 (100)	total: 7.19s	remaining: 2m 50s
200:	learn: 1.0668245	test: 1.0728253	best: 1.0728246 (199)	total: 14.1s	remaining: 2m 41s
300:	learn: 1.0633245	test: 1.0724164	best: 1.0724163 (298)	total: 20.9s	remaining: 2m 32s
400:	learn: 1.0600188	test: 1.0721595	best: 1.0721584 (399)	total: 27.7s	remaining: 2m 25s
500:	learn: 1.0563206	test: 1.0718978	best: 1.0718978 (500)	total: 34.5s	remaining: 2m 17s
600:	learn: 1.0524027	test: 1.0717224	best: 1.0717185 (592)	total: 41.5s	remaining: 2m 11s
700:	learn: 1.0482343	test: 1.0714290	best: 1.0714290 (700)	total: 48.5s	remaining: 2m 4s
800:	learn: 1.0442288	test: 1.0712824	best: 1.0712796 (795)	total: 55.6s	remaining: 1m 57s
900:	learn: 1.0401562	test: 1.0711769	best: 1.0711694 (887)	total: 1m 2s	remaining: 1m 51s
1000:	learn: 1.0360212	test: 1.0711574	best: 1.0711495 (930)	total: 1m 9s	remaining: 1m 44s
1100

[I 2024-12-12 20:42:12,768] Trial 5 finished with value: 1.070418924961286 and parameters: {'depth': 11, 'l2_leaf_reg': 0.0004842019267573844, 'subsample ': 0.7000000000000001, 'random_strength': 0.65233061303638}. Best is trial 0 with value: 1.0699177857001105.


0:	learn: 1.0941462	test: 1.0958322	best: 1.0958322 (0)	total: 155ms	remaining: 6m 26s
100:	learn: 1.0670410	test: 1.0736533	best: 1.0736533 (100)	total: 13.4s	remaining: 5m 17s
200:	learn: 1.0577410	test: 1.0728516	best: 1.0728486 (197)	total: 26.4s	remaining: 5m 1s
300:	learn: 1.0489915	test: 1.0725126	best: 1.0725126 (300)	total: 39.1s	remaining: 4m 46s
400:	learn: 1.0390312	test: 1.0724415	best: 1.0724330 (395)	total: 51.9s	remaining: 4m 31s
500:	learn: 1.0283317	test: 1.0722939	best: 1.0722899 (499)	total: 1m 4s	remaining: 4m 18s
600:	learn: 1.0172173	test: 1.0721923	best: 1.0721923 (600)	total: 1m 17s	remaining: 4m 6s
700:	learn: 1.0068125	test: 1.0719393	best: 1.0719278 (696)	total: 1m 31s	remaining: 3m 53s
800:	learn: 0.9958971	test: 1.0718868	best: 1.0718370 (759)	total: 1m 44s	remaining: 3m 41s
900:	learn: 0.9845715	test: 1.0718480	best: 1.0718265 (862)	total: 1m 57s	remaining: 3m 28s
bestTest = 1.071826493
bestIteration = 862
Shrink model to first 863 iterations.
0:	learn: 1

[I 2024-12-12 21:00:46,159] Trial 6 finished with value: 1.0710234720032414 and parameters: {'depth': 13, 'l2_leaf_reg': 0.11989368038763118, 'subsample ': 0.8, 'random_strength': 0.8165196009634836}. Best is trial 0 with value: 1.0699177857001105.


0:	learn: 1.0942474	test: 1.0959139	best: 1.0959139 (0)	total: 53.9ms	remaining: 2m 14s
100:	learn: 1.0721214	test: 1.0740551	best: 1.0740551 (100)	total: 4.96s	remaining: 1m 57s
200:	learn: 1.0702226	test: 1.0729606	best: 1.0729606 (200)	total: 9.71s	remaining: 1m 51s
300:	learn: 1.0689554	test: 1.0726031	best: 1.0726031 (300)	total: 14.4s	remaining: 1m 45s
400:	learn: 1.0676672	test: 1.0722243	best: 1.0722243 (400)	total: 19.1s	remaining: 1m 39s
500:	learn: 1.0664398	test: 1.0719692	best: 1.0719692 (500)	total: 23.7s	remaining: 1m 34s
600:	learn: 1.0650821	test: 1.0717118	best: 1.0717114 (599)	total: 28.4s	remaining: 1m 29s
700:	learn: 1.0636534	test: 1.0714054	best: 1.0714043 (698)	total: 33.4s	remaining: 1m 25s
800:	learn: 1.0623030	test: 1.0711938	best: 1.0711938 (800)	total: 38.2s	remaining: 1m 20s
900:	learn: 1.0609769	test: 1.0710352	best: 1.0710352 (900)	total: 43s	remaining: 1m 16s
1000:	learn: 1.0597211	test: 1.0709100	best: 1.0709100 (1000)	total: 47.8s	remaining: 1m 11s
11

[I 2024-12-12 21:16:01,322] Trial 7 finished with value: 1.069768965999698 and parameters: {'depth': 9, 'l2_leaf_reg': 0.000311043236374346, 'subsample ': 0.9, 'random_strength': 0.6528523995521294}. Best is trial 7 with value: 1.069768965999698.


0:	learn: 1.0942485	test: 1.0959146	best: 1.0959146 (0)	total: 53.4ms	remaining: 2m 13s
100:	learn: 1.0721885	test: 1.0741154	best: 1.0741154 (100)	total: 4.87s	remaining: 1m 55s
200:	learn: 1.0703195	test: 1.0731883	best: 1.0731883 (200)	total: 9.63s	remaining: 1m 50s
300:	learn: 1.0689758	test: 1.0727173	best: 1.0727171 (299)	total: 14.3s	remaining: 1m 44s
400:	learn: 1.0677226	test: 1.0724378	best: 1.0724378 (400)	total: 19s	remaining: 1m 39s
500:	learn: 1.0665009	test: 1.0722632	best: 1.0722600 (492)	total: 23.8s	remaining: 1m 34s
600:	learn: 1.0650997	test: 1.0720289	best: 1.0720289 (600)	total: 28.6s	remaining: 1m 30s
700:	learn: 1.0635895	test: 1.0717006	best: 1.0717006 (700)	total: 33.4s	remaining: 1m 25s
800:	learn: 1.0621919	test: 1.0714776	best: 1.0714774 (799)	total: 38.2s	remaining: 1m 21s
900:	learn: 1.0608203	test: 1.0713463	best: 1.0713429 (880)	total: 43s	remaining: 1m 16s
1000:	learn: 1.0594572	test: 1.0712448	best: 1.0712437 (995)	total: 47.9s	remaining: 1m 11s
1100:

[I 2024-12-12 21:29:40,364] Trial 8 finished with value: 1.0701854921404137 and parameters: {'depth': 9, 'l2_leaf_reg': 0.046665379515572106, 'subsample ': 0.65, 'random_strength': 0.6013792352106349}. Best is trial 7 with value: 1.069768965999698.


0:	learn: 1.0942937	test: 1.0959597	best: 1.0959597 (0)	total: 51.2ms	remaining: 2m 8s
100:	learn: 1.0727557	test: 1.0745133	best: 1.0745133 (100)	total: 4.22s	remaining: 1m 40s
200:	learn: 1.0711750	test: 1.0733772	best: 1.0733772 (200)	total: 8.33s	remaining: 1m 35s
300:	learn: 1.0703384	test: 1.0729529	best: 1.0729529 (300)	total: 12.3s	remaining: 1m 29s
400:	learn: 1.0695031	test: 1.0726562	best: 1.0726558 (398)	total: 16.3s	remaining: 1m 25s
500:	learn: 1.0687676	test: 1.0724867	best: 1.0724867 (500)	total: 20.2s	remaining: 1m 20s
600:	learn: 1.0678167	test: 1.0721293	best: 1.0721293 (600)	total: 24.3s	remaining: 1m 16s
700:	learn: 1.0668549	test: 1.0717821	best: 1.0717815 (699)	total: 28.5s	remaining: 1m 13s
800:	learn: 1.0659858	test: 1.0715849	best: 1.0715840 (796)	total: 32.6s	remaining: 1m 9s
900:	learn: 1.0650474	test: 1.0714098	best: 1.0714088 (898)	total: 36.7s	remaining: 1m 5s
1000:	learn: 1.0642425	test: 1.0712483	best: 1.0712469 (999)	total: 40.8s	remaining: 1m 1s
1100:

[I 2024-12-12 21:44:39,997] Trial 9 finished with value: 1.069895564363912 and parameters: {'depth': 8, 'l2_leaf_reg': 0.03209938410939924, 'subsample ': 0.775, 'random_strength': 0.649040589121398}. Best is trial 7 with value: 1.069768965999698.


0:	learn: 1.0942128	test: 1.0958823	best: 1.0958823 (0)	total: 65.6ms	remaining: 2m 43s
100:	learn: 1.0712496	test: 1.0734637	best: 1.0734637 (100)	total: 5.78s	remaining: 2m 17s
200:	learn: 1.0684870	test: 1.0721937	best: 1.0721937 (200)	total: 11.5s	remaining: 2m 12s
300:	learn: 1.0662804	test: 1.0718342	best: 1.0718334 (297)	total: 17.4s	remaining: 2m 6s
400:	learn: 1.0643495	test: 1.0715699	best: 1.0715675 (396)	total: 23.1s	remaining: 2m
500:	learn: 1.0621854	test: 1.0713787	best: 1.0713767 (489)	total: 28.7s	remaining: 1m 54s
600:	learn: 1.0599512	test: 1.0711689	best: 1.0711688 (598)	total: 34.3s	remaining: 1m 48s
700:	learn: 1.0577176	test: 1.0710337	best: 1.0710337 (700)	total: 40s	remaining: 1m 42s
800:	learn: 1.0554441	test: 1.0709724	best: 1.0709724 (800)	total: 45.7s	remaining: 1m 37s
900:	learn: 1.0532851	test: 1.0708651	best: 1.0708651 (900)	total: 51.4s	remaining: 1m 31s
1000:	learn: 1.0511015	test: 1.0708035	best: 1.0708030 (999)	total: 57s	remaining: 1m 25s
1100:	lear

[I 2024-12-12 21:57:58,498] Trial 10 finished with value: 1.0698736803411668 and parameters: {'depth': 10, 'l2_leaf_reg': 0.002107616901534704, 'subsample ': 0.9, 'random_strength': 0.05727601756471312}. Best is trial 7 with value: 1.069768965999698.


0:	learn: 1.0942071	test: 1.0958747	best: 1.0958747 (0)	total: 65ms	remaining: 2m 42s
100:	learn: 1.0712169	test: 1.0733762	best: 1.0733762 (100)	total: 5.73s	remaining: 2m 16s
200:	learn: 1.0685384	test: 1.0721843	best: 1.0721843 (200)	total: 11.5s	remaining: 2m 11s
300:	learn: 1.0663594	test: 1.0718038	best: 1.0718038 (300)	total: 17.4s	remaining: 2m 7s
400:	learn: 1.0643008	test: 1.0715330	best: 1.0715316 (398)	total: 23s	remaining: 2m
500:	learn: 1.0621812	test: 1.0713031	best: 1.0712992 (494)	total: 28.7s	remaining: 1m 54s
600:	learn: 1.0599525	test: 1.0710889	best: 1.0710889 (600)	total: 34.4s	remaining: 1m 48s
700:	learn: 1.0576706	test: 1.0709719	best: 1.0709687 (694)	total: 40s	remaining: 1m 42s
800:	learn: 1.0555102	test: 1.0708931	best: 1.0708931 (800)	total: 45.7s	remaining: 1m 36s
900:	learn: 1.0534452	test: 1.0707973	best: 1.0707973 (900)	total: 51.4s	remaining: 1m 31s
1000:	learn: 1.0512695	test: 1.0707504	best: 1.0707467 (975)	total: 57.1s	remaining: 1m 25s
1100:	learn:

[I 2024-12-12 22:11:32,671] Trial 11 finished with value: 1.0698720473171996 and parameters: {'depth': 10, 'l2_leaf_reg': 0.002898145622220784, 'subsample ': 0.9, 'random_strength': 0.02145613418410449}. Best is trial 7 with value: 1.069768965999698.


0:	learn: 1.0942135	test: 1.0958835	best: 1.0958835 (0)	total: 67.1ms	remaining: 2m 47s
100:	learn: 1.0712947	test: 1.0735287	best: 1.0735287 (100)	total: 5.75s	remaining: 2m 16s
200:	learn: 1.0685988	test: 1.0722651	best: 1.0722651 (200)	total: 11.5s	remaining: 2m 11s
300:	learn: 1.0664163	test: 1.0718537	best: 1.0718537 (300)	total: 17.2s	remaining: 2m 6s
400:	learn: 1.0642830	test: 1.0715907	best: 1.0715899 (397)	total: 22.8s	remaining: 1m 59s
500:	learn: 1.0621561	test: 1.0714194	best: 1.0714163 (496)	total: 28.5s	remaining: 1m 53s
600:	learn: 1.0600214	test: 1.0711994	best: 1.0711994 (600)	total: 34.2s	remaining: 1m 48s
700:	learn: 1.0576789	test: 1.0710557	best: 1.0710557 (700)	total: 40s	remaining: 1m 42s
800:	learn: 1.0554480	test: 1.0709632	best: 1.0709632 (800)	total: 45.6s	remaining: 1m 36s
900:	learn: 1.0533629	test: 1.0708668	best: 1.0708668 (900)	total: 51.3s	remaining: 1m 31s
1000:	learn: 1.0511510	test: 1.0707896	best: 1.0707855 (991)	total: 57s	remaining: 1m 25s
1100:	

[I 2024-12-12 22:24:48,936] Trial 12 finished with value: 1.0698615217111656 and parameters: {'depth': 10, 'l2_leaf_reg': 0.0022776338668776173, 'subsample ': 0.9, 'random_strength': 0.2640538758931447}. Best is trial 7 with value: 1.069768965999698.


0:	learn: 1.0942485	test: 1.0959145	best: 1.0959145 (0)	total: 54.3ms	remaining: 2m 15s
100:	learn: 1.0720125	test: 1.0739083	best: 1.0739083 (100)	total: 4.83s	remaining: 1m 54s
200:	learn: 1.0700126	test: 1.0727455	best: 1.0727455 (200)	total: 9.63s	remaining: 1m 50s
300:	learn: 1.0685728	test: 1.0723438	best: 1.0723438 (300)	total: 14.5s	remaining: 1m 45s
400:	learn: 1.0672576	test: 1.0719843	best: 1.0719837 (398)	total: 19.1s	remaining: 1m 40s


In [None]:
print(cat_params)

In [None]:
trial = cat_study.best_trial
print('MSE: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

* Experiment 1:

    - MSE: 1.0685007914142697
    - Best hyperparameters: {'num_leaves': 111, 'min_child_samples': 35, 'reg_alpha': 005448690915044739, 'reg_lambda': 0.016061276667668913, 'max_depth': 12, 'colsample_bytree': 0.8500000000000001, 'subsample': 0.75}

* Experiment 2:

    - MSE: 1.068389091315945
    - Best hyperparameters: {'reg_alpha': 0.4932, 'reg_lambda': 0.002076, 'max_depth': 21, 'colsample_bytree': 0.825, 'subsample': 0.75,  'num_leaves': 101, 'n_estimators': 2501,'learning_rate': 0.01, 'min_child_samples':  35}    

In [None]:
fig = optuna.visualization.plot_optimization_history(cat_study)
fig.show()

In [None]:
fig = optuna.visualization.plot_param_importances(cat_study)
fig.show()

In [None]:
#del xgb_study
gc.collect()

### 1.2 Fit Best Model:

**PARAMETERS**

In [None]:
# Define a common random seed for reproducibility
RANDOM_SEED = 42
N_ESTIMATORS = 3000  # Number of estimators for the ensemble models
n_splits = 3
n_repeats = 3

params ={'reg_alpha': 0.4932, 'reg_lambda': 0.002076, 'max_depth': 21, 'colsample_bytree': 0.825, 'subsample': 0.75, 'num_leaves': 101, 'n_estimators': 2501, 'learning_rate': 0.01,
         'min_child_samples':35, 'random_state': 42, 'force_col_wise':True, 'verbose':0}

cat_cols = ['Gender', 'Marital Status','Number of Dependents','Occupation','Location','Policy Type','Previous Claims', 'Insurance Duration', 'Customer Feedback',
            'Smoking Status',  'Exercise Frequency', 'Property Type', 'Start_Year', 'Start_Month', 'Start_Day','Education Level']

num_cols = ['Age', 'Annual Income', 'Health Score', 'Vehicle Age', 'Credit Score', 'Policy Start Date', 'Premium_time_Mean']

cat_loc = [X.columns.get_loc(i) for i in cat_cols]

**DATA**

In [None]:
df_subm_stack = df_subm.copy()

X = df_train_new.drop(columns=["Premium Amount"])
X_test = df_test_new.drop(columns=["Premium Amount"])
y = df_train_new["Premium Amount"]

log_y = np.log1p(y)

**FIT THE MODEL**

In [None]:
kf = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=42)
rmsle = []

# Initialize the Stack
df_subm_stack['Premium Amount'] = 0.0

i=0

oof_results_stack = pd.DataFrame(columns=list(range(n_splits*n_repeats)), index=X.index)

for idx_train, idx_valid in kf.split(X, log_y):

    print(f"Working on Fold {i}")

    # Split the data into training and validation sets for the current fold
    X_train, y_train = X.iloc[idx_train], log_y.iloc[idx_train].to_numpy().reshape(-1,1)
    X_valid, y_valid = X.iloc[idx_valid], log_y.iloc[idx_valid].to_numpy().reshape(-1,1)
    X_test_ = X_test.copy()

    scaler = StandardScaler()
    X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
    X_valid[num_cols] = scaler.transform(X_valid[num_cols])
    X_test_[num_cols] = scaler.transform(X_test[num_cols])


    X_train = X_train.to_numpy()
    X_valid = X_valid.to_numpy()
    X_test_ = X_test_.to_numpy()


    if i >= 9:
        #print(stacking_model.get_params())
        # Fit the StackingRegressor
        LGBM_model = train_lgbm(params, X_train, y_train, cat_loc, use_gpu=False, X_val=X_valid, y_val=y_valid, es=101)
        obj = sio.dump(LGBM_model, f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/LGBM_base_{i}.skops")

    else:
        unknown_types = sio.get_untrusted_types(file=f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/LGBM_base_{i}.skops")
        LGBM_model = sio.load(f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/LGBM_base_{i}.skops", trusted=unknown_types)

    stack_preds = np.exp(LGBM_model.predict(X_valid))

    oof_results_stack.iloc[idx_valid,i] = stack_preds.flatten()
    # Prepare the test data and make predictions
    error = root_mean_squared_log_error(np.exp(y_valid), stack_preds)

    rmsle.append(error)
    print(f"RMSLE fold {i}: {error}")

    # Aggregate the predictions across the 5 folds (averaging for ensemble)
    df_subm_stack['Premium Amount'] += np.exp(LGBM_model.predict(X_test_)) / (n_splits*n_repeats)
    i += 1

In [None]:
np.mean(rmsle), np.std(rmsle)

### **1.3 Save Results:**

In [None]:
(df_test_for,train_forecast_to_store) = store_results(for_test=df_subm_stack,for_train=oof_results_stack, model="LGBM", experiment=0)

In [None]:
df_test_for.max(),train_forecast_to_store.max()

## **3.0 CatBoostRegressor - Time Features sin-cos transformed**

In [None]:
X = df_train_new.drop(columns=["Premium Amount"])
X_test = df_test_new.drop(columns=["Premium Amount"])
y = df_train_new["Premium Amount"]

X = feature_engineering(X)
X_test = feature_engineering(X_test)

log_y = np.log1p(y)

In [None]:
cat_cols = ['Gender', 'Marital Status','Number of Dependents','Occupation','Location','Policy Type','Previous Claims', 'Insurance Duration', 'Customer Feedback',
            'Smoking Status',  'Exercise Frequency', 'Property Type', 'Start_Year', 'Education Level']

num_cols = ['Age', 'Annual Income', 'Health Score', 'Vehicle Age', 'Credit Score', 'Policy Start Date', 'Premium_time_Mean', "Start_Month_Sin",	"Start_Month_Cos",	"Start_Day_Sin",	"Start_Day_Cos",
            "Annual_Income_Health_Score_Ratio",	"Annual_Income_Health_Score",	"Annual_Income_Credit_Score_Ratio",	"Annual_Income_Credit_Score",	"Vehicle_Age_Insurance_Duration",	"Annual_Income_Previous_Claims"]

dtypes_num = {c:"float" for c in num_cols}
dtypes_cat = {c:"category" for c in cat_cols}

dtypes_all = {**dtypes_num, **dtypes_cat}

len(cat_cols+num_cols),len(dtypes_all.keys())#,len(df_train_new.columns)

X.head(3)

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Start_Year,Premium_time_Mean,Annual_Income_Health_Score_Ratio,Annual_Income_Health_Score,Annual_Income_Credit_Score_Ratio,Annual_Income_Credit_Score,Vehicle_Age_Insurance_Duration,Annual_Income_Previous_Claims,Start_Month_Sin,Start_Month_Cos,Start_Day_Sin,Start_Day_Cos
0,19.0,1,10049.0,1.0,1.0,1,1.0,22.598761,2,2,2.0,17.0,372.0,5.0,0.955634,0.0,0,2,2,2023,1090.306893,444.670402,227094.9,27.013441,3738228.0,3.4,20098.0,-2.449294e-16,1.0,-0.998717,-0.050649
1,39.0,1,31678.0,0.0,3.0,2,0.0,15.569731,0,1,1.0,12.0,694.0,2.0,1.487141,1.0,1,1,2,2023,1106.265011,2034.588781,493217.9,45.645533,21984532.0,6.0,31678.0,1.224647e-16,-1.0,0.651372,-0.758758
2,23.0,0,25602.0,0.0,3.0,0,1.0,47.177549,1,2,1.0,14.0,605.0,3.0,1.185771,2.0,1,2,2,2023,1118.329945,542.673377,1207840.0,42.317355,15489210.0,4.666667,25602.0,-1.0,-1.83697e-16,-0.201299,0.97953


In [None]:
cat_loc = [X.columns.get_loc(i) for i in cat_cols]

### 1.1 Optuna Optimization:

In [None]:
def objective_catboost(trial, X, y, n_splits, n_repeats, model=CatBoostRegressor, use_gpu=True):

    model_class = model

    categorical_features = cat_cols.copy()
    tot_cat = categorical_features

    numeric_features = [col for col in X.columns if col not in tot_cat]

    params = {
        'iterations': 2501,
        'learning_rate': 0.025, #trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'depth': trial.suggest_int('depth', 5, 13),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-4, 1.0, log=True),
        "bootstrap_type": "Bernoulli",
        'subsample': trial.suggest_float('subsample ', 0.65, 0.90,step=0.025),
        'random_strength': trial.suggest_float('random_strength', 0.0, 1.5),
        #'border_count': trial.suggest_int('border_count', 32, 255),
        'cat_features': categorical_features,
        'task_type': 'GPU' if use_gpu else 'CPU',
        'verbose': 100
    }

    kf = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=42)
    rmsle_scores = []

    for idx_train, idx_valid in kf.split(X, y):

        # Split the data into training and validation sets for the current fold
        X_train, y_train = X.iloc[idx_train], y.iloc[idx_train].to_numpy().reshape(-1,1)
        X_valid, y_valid = X.iloc[idx_valid], y.iloc[idx_valid].to_numpy().reshape(-1,1)

        scaler = StandardScaler()
        X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
        X_valid[num_cols] = scaler.transform(X_valid[num_cols])

        # Create the Pool objects for CatBoost
        train_pool = Pool(data=X_train, label=y_train, cat_features=categorical_features)
        valid_pool = Pool(data=X_valid, label=y_valid, cat_features=categorical_features)

        # Create the pipeline
        model = model_class(**params, loss_function='RMSE', eval_metric='RMSE')
        # Fit the model:
        model.fit(train_pool, eval_set=valid_pool, early_stopping_rounds=101,
                  #callbacks=[optuna.integration.CatBoostPruningCallback(trial, "RMSE")]
                  )

        # Make predictions on the validation set
        y_pred = model.predict(X_valid)

        y_pred = np.expm1(y_pred)
        y_valid = np.expm1(y_valid)

        # Calculate the RMSE for the current fold

        rmsle_score = root_mean_squared_log_error(y_valid, y_pred)
        rmsle_scores.append(rmsle_score)

    # Calculate the mean RMSLE score across all folds
    mean_rmsle_score = np.mean(rmsle_scores)

    return mean_rmsle_score

In [None]:
# Step 2: Tuning Hyperparameters with Optuna
def tune_hyperparameters(X, y, model_class, n_trials, n_splits_ ,n_repeats_, use_gpu=True):  #use_gpu
    study = optuna.create_study(direction='minimize', sampler=optuna.samplers.TPESampler(), pruner=optuna.pruners.MedianPruner(n_warmup_steps=50))
    study.optimize(lambda trial: objective_catboost(trial, X, y, n_splits=n_splits_, n_repeats=n_repeats_, model=model_class, use_gpu=use_gpu), n_trials=n_trials)
    return study  # Return the study object

# Step 3: Saving Best Results and Models
def save_results(study, model_class, model_name):
    best_params_file = f"{model_name}_best_params.joblib"
    joblib.dump(study.best_params, best_params_file)
    print(f"Best parameters for {model_name} saved to {best_params_file}")

    verbose_file = f"{model_name}_optuna_verbose.log"
    with open(verbose_file, "w") as f:
        f.write(str(study.trials))
    print(f"Optuna verbose for {model_name} saved to {verbose_file}")

In [None]:
# usage with XGBRegressor
cat_study = tune_hyperparameters(X, log_y, model_class=CatBoostRegressor, n_trials=101, n_splits_ = 3 ,n_repeats_=3, use_gpu=False)
save_results(cat_study, CatBoostRegressor, "CatBoost_ext_v0")
cat_params = cat_study.best_params

[I 2024-12-12 22:54:00,470] A new study created in memory with name: no-name-8179aabd-6c64-47df-9a09-bfdb29b50964


0:	learn: 1.0942844	test: 1.0959495	best: 1.0959495 (0)	total: 1.16s	remaining: 48m 27s
100:	learn: 1.0718070	test: 1.0741431	best: 1.0741431 (100)	total: 1m 21s	remaining: 32m 9s
200:	learn: 1.0693650	test: 1.0731046	best: 1.0731028 (199)	total: 2m 40s	remaining: 30m 37s
300:	learn: 1.0674306	test: 1.0727124	best: 1.0727124 (300)	total: 4m 1s	remaining: 29m 21s
400:	learn: 1.0656057	test: 1.0723403	best: 1.0723403 (400)	total: 5m 17s	remaining: 27m 40s
500:	learn: 1.0636513	test: 1.0721095	best: 1.0721091 (496)	total: 6m 37s	remaining: 26m 28s
600:	learn: 1.0611861	test: 1.0717924	best: 1.0717900 (593)	total: 8m 2s	remaining: 25m 26s
700:	learn: 1.0587513	test: 1.0714787	best: 1.0714787 (700)	total: 9m 34s	remaining: 24m 34s
800:	learn: 1.0562398	test: 1.0712177	best: 1.0712113 (795)	total: 11m 11s	remaining: 23m 44s
900:	learn: 1.0537104	test: 1.0710781	best: 1.0710762 (898)	total: 12m 46s	remaining: 22m 41s
1000:	learn: 1.0512454	test: 1.0709996	best: 1.0709986 (993)	total: 14m 6s	r

In [None]:
print(cat_params)

In [None]:
trial = cat_study.best_trial
print('MSE: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

* Experiment 1:

    - MSE: 1.0685007914142697
    - Best hyperparameters: {'num_leaves': 111, 'min_child_samples': 35, 'reg_alpha': 005448690915044739, 'reg_lambda': 0.016061276667668913, 'max_depth': 12, 'colsample_bytree': 0.8500000000000001, 'subsample': 0.75}

* Experiment 2:

    - MSE: 1.068389091315945
    - Best hyperparameters: {'reg_alpha': 0.4932, 'reg_lambda': 0.002076, 'max_depth': 21, 'colsample_bytree': 0.825, 'subsample': 0.75,  'num_leaves': 101, 'n_estimators': 2501,'learning_rate': 0.01, 'min_child_samples':  35}    

In [None]:
fig = optuna.visualization.plot_optimization_history(cat_study)
fig.show()

In [None]:
fig = optuna.visualization.plot_param_importances(cat_study)
fig.show()

In [None]:
#del xgb_study
gc.collect()

### 1.2 Fit Best Model:

**PARAMETERS**

In [None]:
# Define a common random seed for reproducibility
RANDOM_SEED = 42
N_ESTIMATORS = 3000  # Number of estimators for the ensemble models
n_splits = 3
n_repeats = 3

params ={'reg_alpha': 0.4932, 'reg_lambda': 0.002076, 'max_depth': 21, 'colsample_bytree': 0.825, 'subsample': 0.75, 'num_leaves': 101, 'n_estimators': 2501, 'learning_rate': 0.01,
         'min_child_samples':35, 'random_state': 42, 'force_col_wise':True, 'verbose':0}

cat_cols = ['Gender', 'Marital Status','Number of Dependents','Occupation','Location','Policy Type','Previous Claims', 'Insurance Duration', 'Customer Feedback',
            'Smoking Status',  'Exercise Frequency', 'Property Type', 'Start_Year', 'Start_Month', 'Start_Day','Education Level']

num_cols = ['Age', 'Annual Income', 'Health Score', 'Vehicle Age', 'Credit Score', 'Policy Start Date', 'Premium_time_Mean']

cat_loc = [X.columns.get_loc(i) for i in cat_cols]

**DATA**

In [None]:
df_subm_stack = df_subm.copy()

X = df_train_new.drop(columns=["Premium Amount"])
X_test = df_test_new.drop(columns=["Premium Amount"])
y = df_train_new["Premium Amount"]

log_y = np.log1p(y)

**FIT THE MODEL**

In [None]:
kf = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=42)
rmsle = []

# Initialize the Stack
df_subm_stack['Premium Amount'] = 0.0

i=0

oof_results_stack = pd.DataFrame(columns=list(range(n_splits*n_repeats)), index=X.index)

for idx_train, idx_valid in kf.split(X, log_y):

    print(f"Working on Fold {i}")

    # Split the data into training and validation sets for the current fold
    X_train, y_train = X.iloc[idx_train], log_y.iloc[idx_train].to_numpy().reshape(-1,1)
    X_valid, y_valid = X.iloc[idx_valid], log_y.iloc[idx_valid].to_numpy().reshape(-1,1)
    X_test_ = X_test.copy()

    scaler = StandardScaler()
    X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
    X_valid[num_cols] = scaler.transform(X_valid[num_cols])
    X_test_[num_cols] = scaler.transform(X_test[num_cols])


    X_train = X_train.to_numpy()
    X_valid = X_valid.to_numpy()
    X_test_ = X_test_.to_numpy()


    if i >= 9:
        #print(stacking_model.get_params())
        # Fit the StackingRegressor
        LGBM_model = train_lgbm(params, X_train, y_train, cat_loc, use_gpu=False, X_val=X_valid, y_val=y_valid, es=101)
        obj = sio.dump(LGBM_model, f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/LGBM_base_{i}.skops")

    else:
        unknown_types = sio.get_untrusted_types(file=f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/LGBM_base_{i}.skops")
        LGBM_model = sio.load(f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/LGBM_base_{i}.skops", trusted=unknown_types)

    stack_preds = np.exp(LGBM_model.predict(X_valid))

    oof_results_stack.iloc[idx_valid,i] = stack_preds.flatten()
    # Prepare the test data and make predictions
    error = root_mean_squared_log_error(np.exp(y_valid), stack_preds)

    rmsle.append(error)
    print(f"RMSLE fold {i}: {error}")

    # Aggregate the predictions across the 5 folds (averaging for ensemble)
    df_subm_stack['Premium Amount'] += np.exp(LGBM_model.predict(X_test_)) / (n_splits*n_repeats)
    i += 1

In [None]:
np.mean(rmsle), np.std(rmsle)

### **1.3 Save Results:**

In [None]:
(df_test_for,train_forecast_to_store) = store_results(for_test=df_subm_stack,for_train=oof_results_stack, model="LGBM", experiment=0)

## **3.0 StackedRegressor - Time Features sin-cos transformed**

In [None]:
X = df_train_new.drop(columns=["Premium Amount"])
X_test = df_test_new.drop(columns=["Premium Amount"])
y = df_train_new["Premium Amount"]

X = feature_engineering(X)
X_test = feature_engineering(X_test)

log_y = np.log1p(y)

In [None]:
cat_cols = ['Gender', 'Marital Status','Number of Dependents','Occupation','Location','Policy Type','Previous Claims', 'Insurance Duration', 'Customer Feedback',
            'Smoking Status',  'Exercise Frequency', 'Property Type', 'Start_Year', 'Start_Month', 'Start_Day','Education Level']

num_cols = ['Age', 'Annual Income', 'Health Score', 'Vehicle Age', 'Credit Score', 'Policy Start Date', 'Premium_time_Mean']]

dtypes_num = {c:"float" for c in num_cols}
dtypes_cat = {c:"category" for c in cat_cols}

dtypes_all = {**dtypes_num, **dtypes_cat}

len(cat_cols+num_cols),len(dtypes_all.keys())#,len(df_train_new.columns)

X.head(3)

Unnamed: 0,Age,Gender,Annual Income,Marital Status,Number of Dependents,Education Level,Occupation,Health Score,Location,Policy Type,Previous Claims,Vehicle Age,Credit Score,Insurance Duration,Policy Start Date,Customer Feedback,Smoking Status,Exercise Frequency,Property Type,Start_Year,Premium_time_Mean,Annual_Income_Health_Score_Ratio,Annual_Income_Health_Score,Annual_Income_Credit_Score_Ratio,Annual_Income_Credit_Score,Vehicle_Age_Insurance_Duration,Annual_Income_Previous_Claims,Start_Month_Sin,Start_Month_Cos,Start_Day_Sin,Start_Day_Cos
0,19.0,1,10049.0,1.0,1.0,1,1.0,22.598761,2,2,2.0,17.0,372.0,5.0,0.955634,0.0,0,2,2,2023,1090.306893,444.670402,227094.9,27.013441,3738228.0,3.4,20098.0,-2.449294e-16,1.0,-0.998717,-0.050649
1,39.0,1,31678.0,0.0,3.0,2,0.0,15.569731,0,1,1.0,12.0,694.0,2.0,1.487141,1.0,1,1,2,2023,1106.265011,2034.588781,493217.9,45.645533,21984532.0,6.0,31678.0,1.224647e-16,-1.0,0.651372,-0.758758
2,23.0,0,25602.0,0.0,3.0,0,1.0,47.177549,1,2,1.0,14.0,605.0,3.0,1.185771,2.0,1,2,2,2023,1118.329945,542.673377,1207840.0,42.317355,15489210.0,4.666667,25602.0,-1.0,-1.83697e-16,-0.201299,0.97953


In [None]:
cat_loc = [X.columns.get_loc(i) for i in cat_cols]

### 1.1 Optuna Optimization:

In [None]:
def objective_catboost(trial, X, y, n_splits, n_repeats, model=CatBoostRegressor, use_gpu=True):

    model_class = model

    categorical_features = cat_cols.copy()
    tot_cat = categorical_features

    numeric_features = [col for col in X.columns if col not in tot_cat]

    params = {
        'iterations': 2501,
        'learning_rate': 0.025, #trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'depth': trial.suggest_int('depth', 5, 13),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 1e-4, 1.0, log=True),
        "bootstrap_type": "Bernoulli",
        'subsample': trial.suggest_float('subsample ', 0.65, 0.90,step=0.025),
        'random_strength': trial.suggest_float('random_strength', 0.0, 1.5),
        #'border_count': trial.suggest_int('border_count', 32, 255),
        'cat_features': categorical_features,
        'task_type': 'GPU' if use_gpu else 'CPU',
        'verbose': 100
    }

    kf = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=42)
    rmsle_scores = []

    for idx_train, idx_valid in kf.split(X, y):

        # Split the data into training and validation sets for the current fold
        X_train, y_train = X.iloc[idx_train], y.iloc[idx_train].to_numpy().reshape(-1,1)
        X_valid, y_valid = X.iloc[idx_valid], y.iloc[idx_valid].to_numpy().reshape(-1,1)

        scaler = StandardScaler()
        X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
        X_valid[num_cols] = scaler.transform(X_valid[num_cols])

        # Create the Pool objects for CatBoost
        train_pool = Pool(data=X_train, label=y_train, cat_features=categorical_features)
        valid_pool = Pool(data=X_valid, label=y_valid, cat_features=categorical_features)

        # Create the pipeline
        model = model_class(**params, loss_function='RMSE', eval_metric='RMSE')
        # Fit the model:
        model.fit(train_pool, eval_set=valid_pool, early_stopping_rounds=101,
                  #callbacks=[optuna.integration.CatBoostPruningCallback(trial, "RMSE")]
                  )

        # Make predictions on the validation set
        y_pred = model.predict(X_valid)

        y_pred = np.expm1(y_pred)
        y_valid = np.expm1(y_valid)

        # Calculate the RMSE for the current fold

        rmsle_score = root_mean_squared_log_error(y_valid, y_pred)
        rmsle_scores.append(rmsle_score)

    # Calculate the mean RMSLE score across all folds
    mean_rmsle_score = np.mean(rmsle_scores)

    return mean_rmsle_score

In [None]:
# Step 2: Tuning Hyperparameters with Optuna
def tune_hyperparameters(X, y, model_class, n_trials, n_splits_ ,n_repeats_, use_gpu=True):  #use_gpu
    study = optuna.create_study(direction='minimize', sampler=optuna.samplers.TPESampler(), pruner=optuna.pruners.MedianPruner(n_warmup_steps=50))
    study.optimize(lambda trial: objective_catboost(trial, X, y, n_splits=n_splits_, n_repeats=n_repeats_, model=model_class, use_gpu=use_gpu), n_trials=n_trials)
    return study  # Return the study object

# Step 3: Saving Best Results and Models
def save_results(study, model_class, model_name):
    best_params_file = f"{model_name}_best_params.joblib"
    joblib.dump(study.best_params, best_params_file)
    print(f"Best parameters for {model_name} saved to {best_params_file}")

    verbose_file = f"{model_name}_optuna_verbose.log"
    with open(verbose_file, "w") as f:
        f.write(str(study.trials))
    print(f"Optuna verbose for {model_name} saved to {verbose_file}")

In [None]:
# usage with XGBRegressor
cat_study = tune_hyperparameters(X, log_y, model_class=CatBoostRegressor, n_trials=101, n_splits_ = 3 ,n_repeats_=3, use_gpu=False)
save_results(cat_study, CatBoostRegressor, "CatBoost_ext_v0")
cat_params = cat_study.best_params

[I 2024-12-12 22:54:00,470] A new study created in memory with name: no-name-8179aabd-6c64-47df-9a09-bfdb29b50964


0:	learn: 1.0942844	test: 1.0959495	best: 1.0959495 (0)	total: 1.16s	remaining: 48m 27s
100:	learn: 1.0718070	test: 1.0741431	best: 1.0741431 (100)	total: 1m 21s	remaining: 32m 9s
200:	learn: 1.0693650	test: 1.0731046	best: 1.0731028 (199)	total: 2m 40s	remaining: 30m 37s
300:	learn: 1.0674306	test: 1.0727124	best: 1.0727124 (300)	total: 4m 1s	remaining: 29m 21s
400:	learn: 1.0656057	test: 1.0723403	best: 1.0723403 (400)	total: 5m 17s	remaining: 27m 40s
500:	learn: 1.0636513	test: 1.0721095	best: 1.0721091 (496)	total: 6m 37s	remaining: 26m 28s
600:	learn: 1.0611861	test: 1.0717924	best: 1.0717900 (593)	total: 8m 2s	remaining: 25m 26s
700:	learn: 1.0587513	test: 1.0714787	best: 1.0714787 (700)	total: 9m 34s	remaining: 24m 34s
800:	learn: 1.0562398	test: 1.0712177	best: 1.0712113 (795)	total: 11m 11s	remaining: 23m 44s
900:	learn: 1.0537104	test: 1.0710781	best: 1.0710762 (898)	total: 12m 46s	remaining: 22m 41s
1000:	learn: 1.0512454	test: 1.0709996	best: 1.0709986 (993)	total: 14m 6s	r

In [None]:
print(cat_params)

In [None]:
trial = cat_study.best_trial
print('MSE: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

* Experiment 1:

    - MSE: 1.0685007914142697
    - Best hyperparameters: {'num_leaves': 111, 'min_child_samples': 35, 'reg_alpha': 005448690915044739, 'reg_lambda': 0.016061276667668913, 'max_depth': 12, 'colsample_bytree': 0.8500000000000001, 'subsample': 0.75}

* Experiment 2:

    - MSE: 1.068389091315945
    - Best hyperparameters: {'reg_alpha': 0.4932, 'reg_lambda': 0.002076, 'max_depth': 21, 'colsample_bytree': 0.825, 'subsample': 0.75,  'num_leaves': 101, 'n_estimators': 2501,'learning_rate': 0.01, 'min_child_samples':  35}    

In [None]:
fig = optuna.visualization.plot_optimization_history(cat_study)
fig.show()

In [None]:
fig = optuna.visualization.plot_param_importances(cat_study)
fig.show()

In [None]:
#del xgb_study
gc.collect()

### 1.2 Fit Best Model:

**PARAMETERS**

In [None]:
# Define a common random seed for reproducibility
RANDOM_SEED = 42
N_ESTIMATORS = 3000  # Number of estimators for the ensemble models
n_splits = 3
n_repeats = 3

params_lgbm ={'reg_alpha': 0.4932, 'reg_lambda': 0.002076, 'max_depth': 21, 'colsample_bytree': 0.825,
              'subsample': 0.75, 'num_leaves': 101, 'n_estimators': 950, 'learning_rate': 0.01,
              'min_child_samples':35, 'random_state': 42, 'force_col_wise':True, 'verbose':0}

cat_cols = ['Gender', 'Marital Status','Number of Dependents','Occupation','Location','Policy Type','Previous Claims', 'Insurance Duration', 'Customer Feedback',
            'Smoking Status',  'Exercise Frequency', 'Property Type', 'Start_Year', 'Start_Month', 'Start_Day','Education Level']

num_cols = ['Age', 'Annual Income', 'Health Score', 'Vehicle Age', 'Credit Score', 'Policy Start Date', 'Premium_time_Mean']

cat_loc = [X.columns.get_loc(i) for i in cat_cols]

**DATA**

In [None]:
df_subm_stack = df_subm.copy()

X = df_train_new.drop(columns=["Premium Amount"])
X_test = df_test_new.drop(columns=["Premium Amount"])
y = df_train_new["Premium Amount"]

log_y = np.log1p(y)

**FIT THE MODEL**

In [None]:
cv = RepeatedKFold(n_splits=n_splits, n_repeats=n_repeats, random_state=42)
rmsle = []

# Initialize the Stack
df_subm_stack['Premium Amount'] = 0.0

i=0

oof_results_stack = pd.DataFrame(columns=list(range(n_repeats*n_splits)), index=X.index)

for idx_train, idx_valid in cv.split(X, y):

    print(f"Fold {i}")

    # Split the data into training and validation sets for the current fold
    X_train, y_train = X.iloc[idx_train].to_numpy(), y.iloc[idx_train].to_numpy()
    X_valid, y_valid = X.iloc[idx_valid].to_numpy(), y.iloc[idx_valid].to_numpy()

    # Define base estimators with random seed and number of estimators
    estimators = [
        ('lgbm', LGBMRegressor(objective="root_mean_squared_error", metric="rmse",boosting_type='gbdt', categorical_feature=cat_loc, **params_lgbm)),
        ('xgb', XGBRegressor(random_state=RANDOM_SEED, n_estimators=N_ESTIMATORS, colsample_bytree = 0.95, subsample= 0.90, learning_rate=0.015)),
        ('catboost', CatBoostRegressor(random_seed=RANDOM_SEED, iterations=N_ESTIMATORS, subsample=0.9, learning_rate=0.015))
    ]

    meta_model = Ridge(alpha=0.1, positive=True)

    # Create StackingRegressor
    stacking_model = StackingRegressor(estimators=estimators, final_estimator=meta_model)
    if i >= 0:
        #print(stacking_model.get_params())
        # Fit the StackingRegressor
        stacking_model.fit(X_train, y_train)
        obj = sio.dump(stacking_model, f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/staked_v0_{i}.skops")

    else:
        unknown_types = sio.get_untrusted_types(file=f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/staked_v0_{i}.skops")
        stacking_model = sio.load(f"/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E12/staked_v0_{i}.skops", trusted=unknown_types)


    stack_preds = np.exp(stacking_model.predict(X_valid))

    oof_results_stack.iloc[idx_valid,i] = stack_preds.flatten()
    # Prepare the test data and make predictions
    error = root_mean_squared_log_error(np.exp(y_valid), stack_preds)

    rmsle.append(error)
    print(f"RMSLE fold {i}: {error}")

    # Aggregate the predictions across the 5 folds (averaging for ensemble)
    df_subm_stack['Premium Amount'] += np.exp(stacking_model.predict(X_test)) / 5
    i += 1

In [None]:
np.mean(rmsle), np.std(rmsle)

### **1.3 Save Results:**

In [None]:
(df_test_for,train_forecast_to_store) = store_results(for_test=df_subm_stack,for_train=oof_results_stack, model="LGBM", experiment=0)

In [None]:
df_test_for.max(),train_forecast_to_store.max()