<div class="alert alert-block alert-success">
<b>Kernel Author:</b>  <br>
<a href="https://bhishanpdl.github.io/" , target="_blank">Bhishan Poudel, Ph.D Astrophysics</a> .
</div>

# Description
In this project we will use multiclass classification to predict one of the 8 possible value of Response.

The data is taken from Kaggle Prudential Life Insurance Project.

About only 40% household in USA has life insurance policy. Based on different of applicant 8 different quotes are granted to applicants.

Here category 8 has the highest counts, I assume it the quote that is granted.
```
Records: 60k
Features: 127
Target: Response (has 8 categories, 1-8)

```

Features:
```
1 Misc             : Age ht wt bmi              4
2 Product Info     : Product_Info_1 to 7        7
3 Employment Info  : Employment_Info_1 to 6     6
4 Insured Info     : InsuredInfo_1 to 7         7
5 Insurance History: Insurance_History_1 to 9   9
6 Family History   : Family_Hist_1 to 5         5
7 Medical History  : Medical_History_1 to 41    41
8 Medical Keywords : Medical_Keyword_1 to 48    48
Target: Response                                1
ID    : ID                                      1
---------------------------------------------------
Total Features: 127
Dependent Variable: 1 (Response)
```

Method Used:
- XGBoost

Metric Used:
- Weighted Quadratic Kappa (cohehs kappa with weight equals quadratic)

**References**  
- https://www.kaggle.com/zeroblue/xgboost-with-optimized-offsets

**Notes about offset**  
Here, in this project the metric of evaluation is kappa. But when we fit the linear regression using xgboost the loss function is squared error (MSE). The predictions given by optimizing MSE may not be optimal for the evaluation metric kappa. 

For the ordinal ranking metric such as kappa, we assume there is parameter space which is more suitable to predictions if we offset the predictions given by MSE. For example, a prediction 1.6 from MSE belongs to class 2. But if we had a offset of 1 for that prediction, then 1.6+1 = 2.6, which becomes class 3. By changing the class we may achieve the better results.

# Imports

In [1]:
import time
import numpy as np
import pandas as pd
import seaborn as sns
import os
import json
from tqdm import tqdm_notebook as tqdm
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('ggplot') 
SEED=100
home = os.path.expanduser('~')
time_start_notebook = time.time()

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics
from scipy.optimize import fmin_powell
import xgboost as xgb
xgb.__version__

  import pandas.util.testing as tm


'1.1.1'

# Data Cleaning

In [2]:
def data_cleaning():
    df = pd.read_csv('https://github.com/bhishanpdl/Datasets/blob/master/Prudential_Insurance/raw/train.csv.zip?raw=true',compression='zip')
    columns_to_drop = ['Id', 'Medical_History_10','Medical_History_24']
    df = df.drop(columns_to_drop,axis=1)
    df['Product_Info_2_char'] = df.Product_Info_2.str[0]
    df['Product_Info_2_num'] = df.Product_Info_2.str[1]

    # factorize categorical variables
    df['Product_Info_2'] = pd.factorize(df['Product_Info_2'])[0]
    df['Product_Info_2_char'] = pd.factorize(df['Product_Info_2_char'])[0]
    df['Product_Info_2_num'] = pd.factorize(df['Product_Info_2_num'])[0]

    df['BMI_Age'] = df['BMI'] * df['Ins_Age']

    med_keyword_columns = df.columns[df.columns.str.startswith('Medical_Keyword_')]
    df['Med_Keywords_Count'] = df[med_keyword_columns].sum(axis=1)
    df = df.fillna(-1)

    return df

df = data_cleaning()
print(df.shape)
df.isna().sum().sum(), df.sum().sum()

(59381, 129)


(0, 26897356.818315115)

# Train Test Split with Stratify

In [3]:
from sklearn.model_selection import train_test_split

target = 'Response'
df_Xtrain, df_Xtest, ser_ytrain, ser_ytest = train_test_split(
    df.drop(target,axis=1), df[target],
    test_size=0.2, random_state=SEED, stratify=df[target])


ytrain = ser_ytrain.to_numpy().ravel()
ytest = ser_ytest.to_numpy().ravel()

In [4]:
dtrain = xgb.DMatrix(df_Xtrain, label=ser_ytrain)
dtest = xgb.DMatrix(df_Xtest, label=ser_ytest)

# Evaluation Metric

In [5]:
def eval_wrapper(y, yhat):
    # cohens kappa is symmetrics. y <=> yhat gives same result.
    y = np.array(y).astype(int)
    yhat = np.array(yhat)
    yhat = np.clip(np.round(yhat), np.min(y), np.max(y)).astype(int)   
    return metrics.cohen_kappa_score(y, yhat,weights='quadratic')

# Modelling xgboost classifier

In [6]:
import xgboost
from xgboost import XGBClassifier

xgboost.__version__

'1.1.1'

In [7]:
params_dict = {'objective': 'reg:squarederror',
              'eta': 0.05,
              'min_child_weight': 240,
              'subsample': 0.9,
              'colsample_bytree': 0.67,
              'max_depth': 6
}
xgb_num_rounds = 800

In [8]:
%%time
bst = xgb.train(params_dict, dtrain, xgb_num_rounds)

CPU times: user 4min 15s, sys: 167 ms, total: 4min 15s
Wall time: 2min 9s


# Model Evaluation

In [9]:
# get preds
train_preds = bst.predict(dtrain, ntree_limit=bst.best_iteration)
test_preds = bst.predict(dtest, ntree_limit=bst.best_iteration)

print('Train score is:', eval_wrapper(ytrain,train_preds))
print('Test score is :', eval_wrapper(ytest, test_preds))

Train score is: 0.669183415991036
Test score is : 0.6040096397174282


# Find Offsets for Train
- https://www.kaggle.com/c/prudential-life-insurance-assessment/discussion/19003
- https://github.com/zhurak/kaggle-prudential/blob/master/code/predict.py

In [10]:
def quadratic_weighted_kappa(ytrue,ypreds):
    return metrics.cohen_kappa_score(ytrue, ypreds,weights='quadratic')

In [11]:
def digitize_train(train_preds, guess_lst):
    (x1,x2,x3,x4,x5,x6,x7) = list(guess_lst)   
    res = []
    for y in list(train_preds):
        if y < x1:
            res.append(1)
        elif y < x2:
            res.append(2)
        elif y < x3:
            res.append(3)
        elif y < x4:
            res.append(4)
        elif y < x5:
            res.append(5)
        elif y < x6:
            res.append(6)
        elif y < x7:
            res.append(7)
        else: res.append(8)
    return res

In [12]:
def get_offsets_minimizing_train_preds_kappa(guess_lst):
    res = digitize_train(train_preds, guess_lst)
    return -quadratic_weighted_kappa(ytrain, res)  

In [13]:
%%time
"""
Here, we already have train predictions.
For these train predictions, if we compare them with original train labels,
we get some kappa value. But we want to change the train predictions such
that when comparing this changed train prediction with original train labels
we get better kappa.

For that we use scipy function "fmin_powell". The function needs some initial
guess so that it can give better offset next time. The default guess is 0.5.
For 8 classes (1-8) we can start with (1.5,2.5,...,8.5) then use the result
and run the function again.

fmin_powell is costly function. It takes about 7 minutes to run.

""";
x0 = (1.5,2.9,3.1,4.5,5.5,6.1,7.1)    # initial guess 

# offsets = fmin_powell(get_offsets_minimizing_train_preds_kappa, x0, disp = True)

# offsets = [3.11768886, 3.5742616, 4.34722233, 4.91914813,
#            5.5290772,  6.16230137, 6.82661745]
print(offsets)

Optimization terminated successfully.
         Current function value: -0.720997
         Iterations: 4
         Function evaluations: 779
[3.11768886 3.5742616  4.34722233 4.91914813 5.5290772  6.16230137
 6.82661745]
CPU times: user 6min 30s, sys: 666 ms, total: 6min 30s
Wall time: 6min 31s


# Apply Offsets to Test Data

In [14]:
def apply_offsets_to_test(test_preds, offsets):
    (x1,x2,x3,x4,x5,x6,x7) = offsets  
    res = []
    for y in list(test_preds):
        if y < x1:
            res.append(1)
        elif y < x2:
            res.append(2)
        elif y < x3:
            res.append(3)
        elif y < x4:
            res.append(4)
        elif y < x5:
            res.append(5)
        elif y < x6:
            res.append(6)
        elif y < x7:
            res.append(7)
        else: res.append(8)
    return res

final_test_preds = apply_offsets_to_test(test_preds, offsets)

# Model evaluation after applying offset

In [15]:
kappa = quadratic_weighted_kappa(ytest,final_test_preds)

In [16]:
print('Test score using offset is :', kappa)
# Test score using offset is : 0.6510844397754416

Test score using offset is : 0.6510844397754416


# Time taken

In [17]:
time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
      '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))

Time taken to run whole notebook: 0 hr 8 min 46 secs
