<div class="alert alert-block alert-success">
<b>Kernel Author:</b>  <br>
<a href="https://bhishanpdl.github.io/" , target="_blank">Bhishan Poudel, Ph.D Astrophysics</a> .
</div>

# Description
In this project we will use multiclass classification to predict one of the 8 possible value of Response.

The data is taken from Kaggle Prudential Life Insurance Project.

About only 40% household in USA has life insurance policy. Based on different of applicant 8 different quotes are granted to applicants.

Here category 8 has the highest counts, I assume it the quote that is granted.
```
Records: 60k
Features: 127
Target: Response (has 8 categories, 1-8)

```

Features:
```
1 Misc             : Age ht wt bmi              4
2 Product Info     : Product_Info_1 to 7        7
3 Employment Info  : Employment_Info_1 to 6     6
4 Insured Info     : InsuredInfo_1 to 7         7
5 Insurance History: Insurance_History_1 to 9   9
6 Family History   : Family_Hist_1 to 5         5
7 Medical History  : Medical_History_1 to 41    41
8 Medical Keywords : Medical_Keyword_1 to 48    48
Target: Response                                1
ID    : ID                                      1
---------------------------------------------------
Total Features: 127
Dependent Variable: 1 (Response)
```

Method Used:
- XGBoost

Metric Used:
- Weighted Quadratic Kappa (cohehs kappa with weight equals quadratic)

**References**  
- https://www.kaggle.com/zeroblue/xgboost-with-optimized-offsets

# Imports

In [1]:
%%capture
# capture will not print in notebook

import os
import sys
ENV_COLAB = 'google.colab' in sys.modules

if ENV_COLAB:
    ### mount google drive
    from google.colab import drive
    drive.mount('/content/drive')

    ### load the data dir
    dat_dir = 'drive/My Drive/Colab Notebooks/data/'
    sys.path.append(dat_dir)

    ### Image dir
    img_dir = 'drive/My Drive/Colab Notebooks/images/'
    if not os.path.isdir(img_dir): os.makedirs(img_dir)
    sys.path.append(img_dir)

    ### Output dir
    out_dir = 'drive/My Drive/Colab Notebooks/outputs/'
    if not os.path.isdir(out_dir): os.makedirs(out_dir)
    sys.path.append(out_dir)

    ### Also install my custom module
    module_dir = 'drive/My Drive/Colab Notebooks/Bhishan_Modules/' 
    sys.path.append(module_dir)
    !cd drive/My Drive/Colab Notebooks/Bhishan_Modules/
    !pip install -e bhishan
    !cd -
    import bhishan
    from bhishan import bp

    ## upgrade
    !pip install -U xgboost

    #### print
    print('Environment: Google Colaboratory.')

# NOTE: If we update modules in gcolab, we need to restart runtime.

In [2]:
import time
import numpy as np
import pandas as pd
import seaborn as sns
import os
import json

from tqdm import tqdm_notebook as tqdm
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('ggplot') 

# random state
SEED=100

home = os.path.expanduser('~')

[(x.__name__,x.__version__) for x in [np,pd,sns]]

[('numpy', '1.18.5'), ('pandas', '1.0.4'), ('seaborn', '0.10.1')]

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


from mlxtend.feature_selection import ColumnSelector

from sklearn import metrics

from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score





In [4]:
import xgboost as xgb

xgb.__version__

'1.1.1'

In [5]:
from scipy.optimize import fmin_powell

# Data Cleaning

In [7]:
def data_cleaning():
    df = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/Prudential_Insurance/raw/train.csv.zip?raw=true',compression='zip')
    columns_to_drop = ['Id', 'Medical_History_10','Medical_History_24']
    df = df.drop(columns_to_drop,axis=1)
    df['Product_Info_2_char'] = df.Product_Info_2.str[0]
    df['Product_Info_2_num'] = df.Product_Info_2.str[1]

    # factorize categorical variables
    df['Product_Info_2'] = pd.factorize(df['Product_Info_2'])[0]
    df['Product_Info_2_char'] = pd.factorize(df['Product_Info_2_char'])[0]
    df['Product_Info_2_num'] = pd.factorize(df['Product_Info_2_num'])[0]

    df['BMI_Age'] = df['BMI'] * df['Ins_Age']

    med_keyword_columns = df.columns[df.columns.str.startswith('Medical_Keyword_')]
    df['Med_Keywords_Count'] = df[med_keyword_columns].sum(axis=1)
    df = df.fillna(-1)

    return df

df = data_cleaning()
print(df.shape)
df.isna().sum().sum(), df.sum().sum()

(59381, 129)


(0, 26897356.818315115)

In [8]:
def get_already_cleaned_data():
    file_data = out_dir + 'Prudential/' + 'clean_data.csv'
    df = pd.read_csv(file_data,compression='zip')
    df = df.drop('Id',axis=1)
    file_features = out_dir + 'Prudential/'+'categorical_features.json'
    cols_cat = json.load(open(file_features))
    df = pd.get_dummies(df,columns=cols_cat,drop_first=True)

    return df

# this gives worse result. 
# df = get_already_cleaned_data()
# print(df.shape)
# df.isna().sum().sum(), df.sum().sum()

# Train-test Split with Stratify

In [10]:
from sklearn.model_selection import train_test_split

target = 'Response'
df_Xtrain, df_Xtest, ser_ytrain, ser_ytest = train_test_split(
    df.drop(target,axis=1), df[target],
    test_size=0.2, random_state=SEED, stratify=df[target])


ytrain = ser_ytrain.to_numpy().ravel()
ytest = ser_ytest.to_numpy().ravel()

print(f"df             : {df.shape}")
print(f"\ndf_Xtrain      : {df_Xtrain.shape}")
print(f"ser_ytrain     : {ser_ytrain.shape}")

print(f"\ndf_Xtest       : {df_Xtest.shape}")
print(f"ser_ytest      : {ser_ytest.shape}")

df_Xtrain.head(2)

df             : (59381, 129)

df_Xtrain      : (47504, 128)
ser_ytrain     : (47504,)

df_Xtest       : (11877, 128)
ser_ytest      : (11877,)


Unnamed: 0,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_2,Employment_Info_3,Employment_Info_4,Employment_Info_5,Employment_Info_6,InsuredInfo_1,InsuredInfo_2,InsuredInfo_3,InsuredInfo_4,InsuredInfo_5,InsuredInfo_6,InsuredInfo_7,Insurance_History_1,Insurance_History_2,Insurance_History_3,Insurance_History_4,Insurance_History_5,Insurance_History_7,Insurance_History_8,Insurance_History_9,Family_Hist_1,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5,Medical_History_1,Medical_History_2,Medical_History_3,...,Medical_Keyword_13,Medical_Keyword_14,Medical_Keyword_15,Medical_Keyword_16,Medical_Keyword_17,Medical_Keyword_18,Medical_Keyword_19,Medical_Keyword_20,Medical_Keyword_21,Medical_Keyword_22,Medical_Keyword_23,Medical_Keyword_24,Medical_Keyword_25,Medical_Keyword_26,Medical_Keyword_27,Medical_Keyword_28,Medical_Keyword_29,Medical_Keyword_30,Medical_Keyword_31,Medical_Keyword_32,Medical_Keyword_33,Medical_Keyword_34,Medical_Keyword_35,Medical_Keyword_36,Medical_Keyword_37,Medical_Keyword_38,Medical_Keyword_39,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,Product_Info_2_char,Product_Info_2_num,BMI_Age,Med_Keywords_Count
616,1,10,26,0.230769,2,3,1,0.059701,0.727273,0.225941,0.341911,0.03,9,1,0.0,2,0.015,1,2,3,3,1,1,1,2,1,1,3,-1.0,3,2,3,3,0.376812,-1.0,0.366197,-1.0,-1.0,162,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0.020413,0
3239,1,0,26,0.230769,2,3,1,0.41791,0.654545,0.209205,0.376858,0.09,14,1,0.0,2,0.6,1,2,8,3,1,2,1,2,1,3,1,0.000333,1,3,2,2,0.710145,-1.0,-1.0,0.419643,3.0,413,2,...,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.157493,2


In [19]:
dtrain = xgb.DMatrix(df_Xtrain, label=ser_ytrain)
dtest = xgb.DMatrix(df_Xtest, label=ser_ytest)

# Evaluation Metric

In [12]:
def eval_wrapper(yhat, y):  
    y = np.array(y)
    y = y.astype(int)
    yhat = np.array(yhat)
    yhat = np.clip(np.round(yhat), np.min(y), np.max(y)).astype(int)   
    return metrics.cohen_kappa_score(yhat, y,weights='quadratic')

# Modelling xgboost classifier

In [13]:
import xgboost
from xgboost import XGBClassifier

xgboost.__version__

'1.1.1'

In [14]:
def get_params():
    params = {}
    params["objective"] = "reg:squarederror"  
    params["eta"] = 0.05
    params["min_child_weight"] = 240
    params["subsample"] = 0.9
    params["colsample_bytree"] = 0.67
    params["max_depth"] = 6
    params_lst = list(params.items())

    return params_lst

params_lst = get_params()
print(params_lst)

[('objective', 'reg:squarederror'), ('eta', 0.05), ('min_child_weight', 240), ('subsample', 0.9), ('colsample_bytree', 0.67), ('max_depth', 6)]


In [15]:
xgb_num_rounds = 800
num_classes = 8

In [17]:
%%time
# train model
model = xgb.train(params_lst, dtrain, xgb_num_rounds) 

CPU times: user 4min 12s, sys: 144 ms, total: 4min 13s
Wall time: 2min 8s


In [20]:
# get preds
train_preds = model.predict(dtrain, ntree_limit=model.best_iteration)
test_preds = model.predict(dtest, ntree_limit=model.best_iteration)

train_preds = np.clip(train_preds, -0.99, 8.99)
test_preds = np.clip(test_preds, -0.99, 8.99)

print('Train score is:', eval_wrapper(train_preds, ytrain))
print('Test score is :', eval_wrapper(test_preds, ytest))

"""
Train score is: 0.6691834159910361
Test score is : 0.6040096397174282

Using already cleaned data:

Train score is: 0.6594926527156114
Test score is : 0.5903855023812505
""";

Train score is: 0.6691834159910361
Test score is : 0.6040096397174282


# Training Offsets

In [21]:
def apply_offset(data, bin_offset, sv, scorer=eval_wrapper):
    # data has dimension (3,N)
    # offsets = np.array([0.1, -1, -2, -1, -0.8, 0.02, 0.8, 1])
    # data = np.vstack((train_preds, train_preds, ytrain_orig))
    #
    # 0 = data, 1 = offset_pred, 2 = labels
    # offset_pred = pred + bin_offset
    #
    data[1, data[0].astype(int)==sv] = \
    data[0, data[0].astype(int)==sv] + bin_offset

    score = scorer(data[1], data[2])
    return score

In [23]:
offsets = np.array([0.1, -1, -2, -1, -0.8, 0.02, 0.8, 1])
data = np.vstack((train_preds, train_preds, ytrain))
data.shape

(3, 47504)

In [24]:
for i in range(num_classes):
    data[1, data[0].astype(int)==i] = \
    data[0, data[0].astype(int)==i] + offsets[i] 

In [25]:
for i in range(num_classes):
    func_train_offset = lambda x: -apply_offset(data, x, i)
    # note: scipy function fmin_powell needs initial guess to minimize the func
    # here initial guess is offsets[i]
    # the default guess is 0.5, but we need to run this function
    # multiple times with different guess values to get useful offsets.
    offsets[i] = fmin_powell(func_train_offset, offsets[i]) 

Optimization terminated successfully.
         Current function value: -0.718521
         Iterations: 1
         Function evaluations: 14
Optimization terminated successfully.
         Current function value: -0.718521
         Iterations: 1
         Function evaluations: 14
Optimization terminated successfully.
         Current function value: -0.718521
         Iterations: 1
         Function evaluations: 14
Optimization terminated successfully.
         Current function value: -0.718849
         Iterations: 2
         Function evaluations: 81
Optimization terminated successfully.
         Current function value: -0.718879
         Iterations: 1
         Function evaluations: 24
Optimization terminated successfully.
         Current function value: -0.718900
         Iterations: 1
         Function evaluations: 18
Optimization terminated successfully.
         Current function value: -0.719639
         Iterations: 2
         Function evaluations: 50
Optimization terminated successful

In [26]:
# apply offsets to test
data = np.vstack((test_preds, test_preds, ytest))
for i in range(num_classes):
    data[1, data[0].astype(int)==i] = \
    data[0, data[0].astype(int)==i] + offsets[i] 

final_test_preds = np.round(np.clip(data[1], 1, 8)).astype(int)

In [27]:
print('Test score using offset is :', eval_wrapper(final_test_preds, ytest))

Test score using offset is : 0.6490565088506439
