# End-to-end ML miniproject

We will attempt to solve a possible data science problem: https://www.kaggle.com/datasets/anmolkumar/health-insurance-cross-sell-prediction. We will go through a step-by-step process:

- Establishing the problem statement and finding a suitable metric 
- Exploratory analysis of a dataset
- Develop a data processing pipeline in pandas
- Trying different ML tabular algorithms such as random forests, gradient boosting
- Hyperparameter tuning and optimizing for our chosen metric on the cross-validation set
- Present the results to the wider community (as the problem treated here is somewhat artificial and uninteresting in itself this step will be skipped)

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/health-insurance-cross-sell-prediction/sample_submission.csv
/kaggle/input/health-insurance-cross-sell-prediction/train.csv
/kaggle/input/health-insurance-cross-sell-prediction/test.csv


In [2]:
train_set = pd.read_csv(os.path.join(dirname, 'train.csv'))
test_set = pd.read_csv(os.path.join(dirname, 'test.csv'))

## Problem statement
Our interest in the dataset is to predict the probability for a customer to buy car insurance, given some details about them. This will amount to a classification task.

Because the target feature is imbalanced, we need to be careful with our choice of metric, because accuracy by itself will not be a good performance indicator. The standard metrics to use are ROC AUC score (which is independent of the decision threshold) and F1 score (which can be improved by tweaking the threshold from 0.5 to something lower).


## EDA and data cleaning


In [3]:
train_set.head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,1,Male,44,1,28.0,0,> 2 Years,Yes,40454.0,26.0,217,1
1,2,Male,76,1,3.0,0,1-2 Year,No,33536.0,26.0,183,0
2,3,Male,47,1,28.0,0,> 2 Years,Yes,38294.0,26.0,27,1
3,4,Male,21,1,11.0,1,< 1 Year,No,28619.0,152.0,203,0
4,5,Female,29,1,41.0,1,< 1 Year,No,27496.0,152.0,39,0


Gender, Vehicle_Age and Vehicle_Damage are categorical features taking few values so we will be able to do a one hot encoding. Another possibility which i do not pursue here is to encode age with an appropiate ordinal encoder (0 for 1 year, 1 for 1-2 years, etc)

In [4]:
train_set.describe()

Unnamed: 0,id,Age,Driving_License,Region_Code,Previously_Insured,Annual_Premium,Policy_Sales_Channel,Vintage,Response
count,381109.0,381109.0,381109.0,381109.0,381109.0,381109.0,381109.0,381109.0,381109.0
mean,190555.0,38.822584,0.997869,26.388807,0.45821,30564.389581,112.034295,154.347397,0.122563
std,110016.836208,15.511611,0.04611,13.229888,0.498251,17213.155057,54.203995,83.671304,0.327936
min,1.0,20.0,0.0,0.0,0.0,2630.0,1.0,10.0,0.0
25%,95278.0,25.0,1.0,15.0,0.0,24405.0,29.0,82.0,0.0
50%,190555.0,36.0,1.0,28.0,0.0,31669.0,133.0,154.0,0.0
75%,285832.0,49.0,1.0,35.0,1.0,39400.0,152.0,227.0,0.0
max,381109.0,85.0,1.0,52.0,1.0,540165.0,163.0,299.0,1.0


`id` is not relevant so we will drop it. We can further notice that the Driving_Licence mean is ~1, so we can happily conclude that the firm probably never provided car insurance to someone without a driver's license. We can the remove instances without driver's licenses (likely errors in our database).

In [5]:
#imputer is not needed
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 381109 entries, 0 to 381108
Data columns (total 12 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    381109 non-null  int64  
 1   Gender                381109 non-null  object 
 2   Age                   381109 non-null  int64  
 3   Driving_License       381109 non-null  int64  
 4   Region_Code           381109 non-null  float64
 5   Previously_Insured    381109 non-null  int64  
 6   Vehicle_Age           381109 non-null  object 
 7   Vehicle_Damage        381109 non-null  object 
 8   Annual_Premium        381109 non-null  float64
 9   Policy_Sales_Channel  381109 non-null  float64
 10  Vintage               381109 non-null  int64  
 11  Response              381109 non-null  int64  
dtypes: float64(3), int64(6), object(3)
memory usage: 34.9+ MB


In [6]:
def cleaning(train_set):
    train = train_set.drop('id', axis=1)
    return train

train_copy = cleaning(train_set)
y = train_copy.Response
X = train_copy.drop('Response', axis=1)

## Data Preprocessing

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler


num_att = ['Age', 'Driving_License', 'Region_Code', 'Previously_Insured', 
           'Annual_Premium', 'Policy_Sales_channel', 'Vintage']
cat_att = ['Gender', 'Vehicle_Age', 'Vehicle_Damage']

cat_encoder = ColumnTransformer([("cat", OrdinalEncoder(), cat_att)], 
                                remainder='passthrough')

engineering_pipeline = Pipeline([
    ("cat_encoder", cat_encoder),
    ("scaler", MinMaxScaler())
])


In [8]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# for fairness we will fit the engineering pipeline only on the training set
_X_train, _X_val, y_train, y_val = train_test_split(X, y, random_state=123)
X_train = engineering_pipeline.fit_transform(_X_train)
X_val = engineering_pipeline.transform(_X_val)

We will apply standard oversampling techniques using the SMOTE library.

In [9]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=3301)

Xt_smoted, yt_smoted = smote.fit_resample(X_train, y_train)

print(Xt_smoted.shape)

(501466, 10)


## RFC random search grid 

In [10]:


RandomForestClassifier()
params = {
    'max_depth': range(5,30),
    'min_samples_leaf': [1,2,3,5,10,20,30,50],
    'max_leaf_nodes': [10,100,1000,10000],
    
    
    'n_jobs': [-1],
    'random_state': [3301]
}

rfc = RandomizedSearchCV(RandomForestClassifier(), params,
                        cv=3, 
                        return_train_score=True,
                        scoring='roc_auc',
                        n_iter=15,
                        verbose=2,
                        random_state=3301)



In [11]:
## XGBoost grid

In [12]:
from xgboost import XGBClassifier

param_xgb = {
 'max_depth': range(3,10,2),
 'min_child_weight': range(1,6,2), 
 'gamma': [0, 0.5, 1, 10]
}

gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.05, 
                                                  n_estimators=500, 
                                                  max_depth=5,
                                                  min_child_weight=1, 
                                                  gamma=0, 
                                                  subsample=0.8, 
                                                  colsample_bytree=0.8,
                                                  objective= 'binary:logistic', 
                                                  nthread=4, 
                                                  scale_pos_weight=1, 
                                                  seed=3301), 
                        param_grid=param_xgb, 
                        #n_iter=20,
                        scoring='roc_auc',
                        n_jobs=4,
                        iid=False, 
                        cv=3,
                        verbose=1)

gsearch1.fit(X_train, y_train,
             eval_set=[(X_val, y_val)],
             eval_metric='auc',
             early_stopping_rounds=30,
             verbose=False)
print(gsearch1.best_params_, gsearch1.best_score_)

Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed: 46.8min
[Parallel(n_jobs=4)]: Done 144 out of 144 | elapsed: 157.9min finished


{'gamma': 0, 'max_depth': 5, 'min_child_weight': 1} 0.857535794853251


Below: I found these parameters to work decently.

In [13]:
from xgboost import XGBClassifier

#randomsearch results
gsparams = {'subsample': 0.72, 
            'min_child_weight': 3, 'max_depth': 5, 'gamma': 3, 
            'colsample_bytree': 0.76} 
0.8577270969019691

#gridsearch obtained when searching only for min_child_depth and max_depth
gsparams = {'subsample': 0.8, 
            'min_child_weight': 5, 'max_depth': 5, 'gamma': 0, 
            'colsample_bytree': 0.8}

estimator = XGBClassifier( learning_rate =0.05, 
                          n_estimators=500, 
                          max_depth=5,
                          min_child_weight=5, 
                          gamma=0, 
                          subsample=0.8, 
                          colsample_bytree=0.8,
                          objective= 'binary:logistic', 
                          nthread=4, 
                          scale_pos_weight=1, 
                          seed=27)

estimator.fit(X_train, y_train,
             eval_set=[(X_val, y_val)],
             eval_metric='auc',
             early_stopping_rounds=30)

[0]	validation_0-auc:0.82267
[1]	validation_0-auc:0.82389
[2]	validation_0-auc:0.83751
[3]	validation_0-auc:0.84132
[4]	validation_0-auc:0.84354
[5]	validation_0-auc:0.84471
[6]	validation_0-auc:0.84841
[7]	validation_0-auc:0.84801
[8]	validation_0-auc:0.84786
[9]	validation_0-auc:0.84820
[10]	validation_0-auc:0.84968
[11]	validation_0-auc:0.84948
[12]	validation_0-auc:0.84944
[13]	validation_0-auc:0.84923
[14]	validation_0-auc:0.84956
[15]	validation_0-auc:0.85005
[16]	validation_0-auc:0.85087
[17]	validation_0-auc:0.85112
[18]	validation_0-auc:0.85122
[19]	validation_0-auc:0.85102
[20]	validation_0-auc:0.85091
[21]	validation_0-auc:0.85084
[22]	validation_0-auc:0.85066
[23]	validation_0-auc:0.85077
[24]	validation_0-auc:0.85074
[25]	validation_0-auc:0.85064
[26]	validation_0-auc:0.85081
[27]	validation_0-auc:0.85108
[28]	validation_0-auc:0.85120
[29]	validation_0-auc:0.85123
[30]	validation_0-auc:0.85137
[31]	validation_0-auc:0.85142
[32]	validation_0-auc:0.85137
[33]	validation_0-au

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.05, max_delta_step=0, max_depth=5,
              min_child_weight=5, missing=nan, monotone_constraints='()',
              n_estimators=500, n_jobs=4, nthread=4, num_parallel_tree=1,
              random_state=27, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=27, subsample=0.8, tree_method='exact',
              validate_parameters=1, verbosity=None)

In [14]:
gsearch1.best_score_

0.857535794853251

In [15]:
gsearch1.best_params_

{'gamma': 0, 'max_depth': 5, 'min_child_weight': 1}

## Using oversampling
This first cell is wrong but I still included it to highlight an error I made at first. Applying oversampling to the validation set will cause us to grossly underestimate the generalization error.

In [16]:
from xgboost import XGBClassifier

from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

Xs_train, Xs_val, ys_train, ys_val = train_test_split(Xt_smoted, yt_smoted, random_state=123)

estimator = XGBClassifier( learning_rate =0.05, 
                          n_estimators=300, 
                          max_depth=5,
                          min_child_weight=5, 
                          gamma=0, 
                          subsample=0.8, 
                          colsample_bytree=0.8,
                          objective= 'binary:logistic', 
                          nthread=4, 
                          scale_pos_weight=1, 
                          seed=27)

estimator.fit(Xs_train, ys_train,
             eval_set=[(Xs_val, ys_val)],
             eval_metric='auc',
             early_stopping_rounds=30)

[0]	validation_0-auc:0.84291
[1]	validation_0-auc:0.85778
[2]	validation_0-auc:0.85822
[3]	validation_0-auc:0.85799
[4]	validation_0-auc:0.85792
[5]	validation_0-auc:0.85832
[6]	validation_0-auc:0.85908
[7]	validation_0-auc:0.86190
[8]	validation_0-auc:0.86240
[9]	validation_0-auc:0.86586
[10]	validation_0-auc:0.86605
[11]	validation_0-auc:0.86702
[12]	validation_0-auc:0.86742
[13]	validation_0-auc:0.86816
[14]	validation_0-auc:0.86964
[15]	validation_0-auc:0.86908
[16]	validation_0-auc:0.87335
[17]	validation_0-auc:0.87382
[18]	validation_0-auc:0.87446
[19]	validation_0-auc:0.87482
[20]	validation_0-auc:0.87563
[21]	validation_0-auc:0.87573
[22]	validation_0-auc:0.87644
[23]	validation_0-auc:0.87840
[24]	validation_0-auc:0.87845
[25]	validation_0-auc:0.88037
[26]	validation_0-auc:0.88014
[27]	validation_0-auc:0.88047
[28]	validation_0-auc:0.88147
[29]	validation_0-auc:0.88211
[30]	validation_0-auc:0.88231
[31]	validation_0-auc:0.88377
[32]	validation_0-auc:0.88363
[33]	validation_0-au



[267]	validation_0-auc:0.95269
[268]	validation_0-auc:0.95282
[269]	validation_0-auc:0.95300
[270]	validation_0-auc:0.95303
[271]	validation_0-auc:0.95312
[272]	validation_0-auc:0.95314
[273]	validation_0-auc:0.95317
[274]	validation_0-auc:0.95317
[275]	validation_0-auc:0.95325
[276]	validation_0-auc:0.95328
[277]	validation_0-auc:0.95338
[278]	validation_0-auc:0.95339
[279]	validation_0-auc:0.95353
[280]	validation_0-auc:0.95353
[281]	validation_0-auc:0.95356
[282]	validation_0-auc:0.95378
[283]	validation_0-auc:0.95382
[284]	validation_0-auc:0.95387
[285]	validation_0-auc:0.95403
[286]	validation_0-auc:0.95448
[287]	validation_0-auc:0.95469
[288]	validation_0-auc:0.95480
[289]	validation_0-auc:0.95486
[290]	validation_0-auc:0.95499
[291]	validation_0-auc:0.95503
[292]	validation_0-auc:0.95510
[293]	validation_0-auc:0.95515
[294]	validation_0-auc:0.95517
[295]	validation_0-auc:0.95536
[296]	validation_0-auc:0.95539
[297]	validation_0-auc:0.95554
[298]	validation_0-auc:0.95565
[299]	va

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.05, max_delta_step=0, max_depth=5,
              min_child_weight=5, missing=nan, monotone_constraints='()',
              n_estimators=300, n_jobs=4, nthread=4, num_parallel_tree=1,
              random_state=27, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=27, subsample=0.8, tree_method='exact',
              validate_parameters=1, verbosity=None)

In [17]:
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

Xs_train, ys_train = smote.fit_resample(X_train, y_train)

In [18]:
estimator = XGBClassifier( learning_rate =0.2, 
                          n_estimators=300,
                          max_depth=5,
                          min_child_weight=5, 
                          gamma=0, 
                          subsample=0.8, 
                          colsample_bytree=0.8,
                          objective= 'binary:logistic', 
                          nthread=4, 
                          scale_pos_weight=1, 
                          seed=27)

estimator.fit(Xs_train, ys_train,
             eval_set=[(X_val, y_val)],
             eval_metric='auc',
             early_stopping_rounds=50)



[0]	validation_0-auc:0.81396
[1]	validation_0-auc:0.83304
[2]	validation_0-auc:0.83546
[3]	validation_0-auc:0.83714
[4]	validation_0-auc:0.84048
[5]	validation_0-auc:0.84062
[6]	validation_0-auc:0.84127
[7]	validation_0-auc:0.84141
[8]	validation_0-auc:0.84231
[9]	validation_0-auc:0.84273
[10]	validation_0-auc:0.84263
[11]	validation_0-auc:0.84284
[12]	validation_0-auc:0.84307
[13]	validation_0-auc:0.84308
[14]	validation_0-auc:0.84527
[15]	validation_0-auc:0.84530
[16]	validation_0-auc:0.84499
[17]	validation_0-auc:0.84491
[18]	validation_0-auc:0.84520
[19]	validation_0-auc:0.84541
[20]	validation_0-auc:0.84552
[21]	validation_0-auc:0.84622
[22]	validation_0-auc:0.84644
[23]	validation_0-auc:0.84612
[24]	validation_0-auc:0.84590
[25]	validation_0-auc:0.84564
[26]	validation_0-auc:0.84553
[27]	validation_0-auc:0.84561
[28]	validation_0-auc:0.84560
[29]	validation_0-auc:0.84566
[30]	validation_0-auc:0.84535
[31]	validation_0-auc:0.84554
[32]	validation_0-auc:0.84592
[33]	validation_0-au

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.2, max_delta_step=0, max_depth=5,
              min_child_weight=5, missing=nan, monotone_constraints='()',
              n_estimators=300, n_jobs=4, nthread=4, num_parallel_tree=1,
              random_state=27, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
              seed=27, subsample=0.8, tree_method='exact',
              validate_parameters=1, verbosity=None)

In [19]:
from sklearn.metrics import confusion_matrix

y_pred = estimator.predict(X_val)
f1_score(y_val, y_pred)
confusion_matrix(y_val, y_pred)

array([[66600, 17066],
       [ 3572,  8040]])

## Conclusion
Using XGboost we have been able to obtain an AUC score of 0.84 on this classification task.