<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Description" data-toc-modified-id="Description-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Description</a></span></li><li><span><a href="#Imports" data-toc-modified-id="Imports-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Train-Validation-Split" data-toc-modified-id="Train-Validation-Split-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Train-Validation Split</a></span></li><li><span><a href="#Scale-the-data" data-toc-modified-id="Scale-the-data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Scale the data</a></span></li><li><span><a href="#Modelling" data-toc-modified-id="Modelling-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Modelling</a></span><ul class="toc-item"><li><span><a href="#Wrapper-Functions" data-toc-modified-id="Wrapper-Functions-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Wrapper Functions</a></span></li><li><span><a href="#Modelling:-Random-Forest" data-toc-modified-id="Modelling:-Random-Forest-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Modelling: Random Forest</a></span></li><li><span><a href="#Modelling:-XGBoost" data-toc-modified-id="Modelling:-XGBoost-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Modelling: XGBoost</a></span></li><li><span><a href="#Modelling:-Extra-Trees-Regressor" data-toc-modified-id="Modelling:-Extra-Trees-Regressor-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Modelling: Extra Trees Regressor</a></span></li><li><span><a href="#Stacking-multiple-models" data-toc-modified-id="Stacking-multiple-models-7.5"><span class="toc-item-num">7.5&nbsp;&nbsp;</span>Stacking multiple models</a></span></li><li><span><a href="#Model-Estimation-on-Test-Set" data-toc-modified-id="Model-Estimation-on-Test-Set-7.6"><span class="toc-item-num">7.6&nbsp;&nbsp;</span>Model Estimation on Test Set</a></span></li></ul></li><li><span><a href="#Time-Taken" data-toc-modified-id="Time-Taken-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Time Taken</a></span></li></ul></div>

# Description
[The Allstate Corporation](https://www.allstate.com/) is an American insurance company that is in the United States. The company also has personal lines insurance operations in Canada. 

**References**
- https://www.kaggle.com/c/allstate-claims-severity/data
- https://www.kaggle.com/sharmasanthosh/exploratory-study-on-ml-algorithms
- https://www.kaggle.com/mmueller/stacking-starter

# Imports

In [1]:
import time

time_start_notebook = time.time()

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns

from tqdm import tqdm_notebook as tqdm
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use('ggplot') 

# random state
SEED=100

[(x.__name__,x.__version__) for x in [np,pd,sns]]

[('numpy', '1.18.4'), ('pandas', '1.1.0'), ('seaborn', '0.10.1')]

In [3]:
from scipy.special import boxcox1p

In [4]:
import scipy
import sklearn

print([(x.__name__,x.__version__) for x in [scipy, sklearn]])

[('scipy', '1.4.1'), ('sklearn', '0.23.1')]


In [5]:
# scale and split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [6]:
# regressors
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor

In [7]:
# pipeline
from sklearn.pipeline import Pipeline

In [8]:
# metrics
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

In [9]:
# cross validation
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score

In [10]:
# classifier
import xgboost as xgb

In [11]:
# six and pickle
import six
import pickle
import joblib

In [12]:
%load_ext autoreload

In [13]:
%autoreload 2

In [14]:
# my personal library
from bhishan import bp

# Load the data

In [15]:
path_pro = '../data/processed'

df_train = pd.read_csv(f'{path_pro}/train_cleaned_encoded.csv.zip',compression='zip')

df_test = pd.read_csv(f'{path_pro}/test_cleaned_encoded.csv.zip',compression='zip')

print(df_train.shape)
df_train.head(2).append(df_train.tail(2))

(188318, 816)


Unnamed: 0,id,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,dummy_cat116_KC,dummy_cat116_KW,dummy_cat116_LB,dummy_cat116_LF,dummy_cat116_LM,dummy_cat116_LN,dummy_cat116_LO,dummy_cat116_LY,dummy_cat116_MD,dummy_cat116_Others
0,1,A,B,A,B,A,A,A,A,B,...,0,0,1,0,0,0,0,0,0,0
1,2,A,B,A,A,A,A,A,A,B,...,0,0,0,0,0,0,0,0,0,0
188316,587632,A,B,A,A,A,A,A,A,B,...,0,0,0,0,0,0,0,0,1,0
188317,587633,B,A,A,B,A,A,A,A,A,...,0,0,0,0,0,0,0,0,0,1


In [16]:
df_train.shape, df_test.shape

((188318, 816), (125546, 811))

In [17]:
features_only_train = df_train.columns.difference(df_test.columns).tolist()

print(features_only_train)

['dummy_cat101_N', 'dummy_cat101_U', 'dummy_cat102_H', 'dummy_cat102_J', 'dummy_cat105_R', 'dummy_cat105_S', 'dummy_cat111_D', 'dummy_cat114_X', 'dummy_cat89_I', 'dummy_cat90_G', 'dummy_cat92_F', 'loss', 'loss_log1p']


In [18]:
features_only_test = df_test.columns.difference(df_train.columns).tolist()

print(features_only_test)

['dummy_cat103_M', 'dummy_cat106_Q', 'dummy_cat111_L', 'dummy_cat89_F', 'dummy_cat92_E', 'dummy_cat92_G', 'dummy_cat96_H', 'dummy_cat99_U']


In [19]:
features_only_one = features_only_train + features_only_test

print(features_only_one)

['dummy_cat101_N', 'dummy_cat101_U', 'dummy_cat102_H', 'dummy_cat102_J', 'dummy_cat105_R', 'dummy_cat105_S', 'dummy_cat111_D', 'dummy_cat114_X', 'dummy_cat89_I', 'dummy_cat90_G', 'dummy_cat92_F', 'loss', 'loss_log1p', 'dummy_cat103_M', 'dummy_cat106_Q', 'dummy_cat111_L', 'dummy_cat89_F', 'dummy_cat92_E', 'dummy_cat92_G', 'dummy_cat96_H', 'dummy_cat99_U']


# Data Preparation

In [20]:
"""
continuous features: xxx_boxcox1p
categorical features: dummy_xxx

target: loss_log1p
""";

In [21]:
features_raw = df_train.columns.tolist()
target = 'loss'
target_log = 'loss_log1p'

# orig
features_orig_cont = [i for i in features_raw 
                    if i.startswith('cont')
                    if not i.endswith('_boxcox1p')
                     ]
                      
features_orig_cat = [i for i in features_raw 
                    if i.startswith('cat') ]                     
features_orig = features_orig_cat + features_orig_cont

# processed
features_cont = [i for i in features_raw if i.endswith('_boxcox1p') ]

features_cat = [i for i in features_raw if i.startswith('dummy_') ]

features = features_cat + features_cont
features = [i for i in features if i not in features_only_one]


# print
 
# print(features_orig_cont)
# print(features_orig_cat)
# print(features_cont)
# print(features_cat)

# Train-Validation Split

In [22]:
df_Xtr, df_Xval, ser_ytr, ser_yval = train_test_split(
    df_train[features],df_train[target],
    train_size=0.8,random_state=SEED)

Xtr = df_Xtr.to_numpy()
ytr = ser_ytr.to_numpy().ravel()
yval = ser_yval.to_numpy().ravel()

print(df_Xtr.shape, ser_ytr.shape,ytr.shape)
df_Xtr.head(2)

(150654, 672) (150654,) (150654,)


Unnamed: 0,dummy_cat1_A,dummy_cat1_B,dummy_cat2_A,dummy_cat2_B,dummy_cat3_A,dummy_cat3_B,dummy_cat4_A,dummy_cat4_B,dummy_cat5_A,dummy_cat5_B,...,cont5_boxcox1p,cont6_boxcox1p,cont7_boxcox1p,cont8_boxcox1p,cont9_boxcox1p,cont10_boxcox1p,cont11_boxcox1p,cont12_boxcox1p,cont13_boxcox1p,cont14_boxcox1p
180302,1,0,0,1,0,1,1,0,1,0,...,0.273358,0.170764,0.278357,0.269171,0.288533,0.197228,0.169389,0.167836,0.238885,0.650623
105000,1,0,0,1,1,0,0,1,1,0,...,0.361725,0.350364,0.320341,0.499342,0.378155,0.393653,0.310744,0.307905,0.327325,0.257707


In [23]:
df_Xtx = df_test[features]
Xtx = df_Xtx.to_numpy()

ser_ytx = None
ytx = None

# in kaggle there is no ytest, we need to submit it.
print(df_Xtx.shape)
df_Xtx.head(2)

(125546, 672)


Unnamed: 0,dummy_cat1_A,dummy_cat1_B,dummy_cat2_A,dummy_cat2_B,dummy_cat3_A,dummy_cat3_B,dummy_cat4_A,dummy_cat4_B,dummy_cat5_A,dummy_cat5_B,...,cont5_boxcox1p,cont6_boxcox1p,cont7_boxcox1p,cont8_boxcox1p,cont9_boxcox1p,cont10_boxcox1p,cont11_boxcox1p,cont12_boxcox1p,cont13_boxcox1p,cont14_boxcox1p
0,1,0,0,1,1,0,1,0,1,0,...,0.255586,0.40187,0.285609,0.507345,0.30657,0.335532,0.333617,0.327418,0.570153,0.345239
1,1,0,0,1,1,0,0,1,1,0,...,0.656444,0.413704,0.384638,0.576341,0.440606,0.501547,0.560053,0.551063,0.39199,0.19354


# Scale the data

In [24]:
scaler = StandardScaler()

Xtr_scaled = scaler.fit_transform(df_Xtr)
Xval_scaled = scaler.fit_transform(df_Xval)

print(type(Xtr_scaled))

df_Xtr_scaled = pd.DataFrame(Xtr_scaled,columns=df_Xtr.columns)

df_Xval_scaled = pd.DataFrame(Xval_scaled,columns=df_Xval.columns)

df_Xtr_scaled.head(2)

<class 'numpy.ndarray'>


Unnamed: 0,dummy_cat1_A,dummy_cat1_B,dummy_cat2_A,dummy_cat2_B,dummy_cat3_A,dummy_cat3_B,dummy_cat4_A,dummy_cat4_B,dummy_cat5_A,dummy_cat5_B,...,cont5_boxcox1p,cont6_boxcox1p,cont7_boxcox1p,cont8_boxcox1p,cont9_boxcox1p,cont10_boxcox1p,cont11_boxcox1p,cont12_boxcox1p,cont13_boxcox1p,cont14_boxcox1p
0,0.574198,-0.574198,-1.141867,1.141867,-4.159984,4.159984,0.682029,-0.682029,0.722126,-0.722126,...,-0.902536,-1.611161,-1.020371,-0.971494,-0.936657,-1.631879,-1.584026,-1.595692,-1.124155,1.446586
1,0.574198,-0.574198,-1.141867,1.141867,0.240386,-0.240386,-1.466213,1.466213,0.722126,-0.722126,...,-0.317716,-0.413701,-0.69465,0.622825,-0.24672,-0.18811,-0.668667,-0.686765,-0.555165,-0.958409


# Modelling

## Wrapper Functions

In [25]:
class SklearnWrapper(object):
    def __init__(self, clf, seed=SEED, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)

    def train(self, Xtrain, ytrain):
        self.clf.fit(Xtrain, np.log(ytrain))

    def predict(self, x):
        return np.exp(self.clf.predict(x))

In [26]:
class XgbWrapper(object):
    def __init__(self, seed=SEED, params=None):
        self.param = params
        self.param['seed'] = seed
        self.nrounds = params.pop('nrounds', 250)

    def train(self, Xtrain, ytrain):
        dtrain = xgb.DMatrix(Xtrain, label=np.log(ytrain))
        self.gbdt = xgb.train(self.param, dtrain, self.nrounds)

    def predict(self, x):
        return np.exp(self.gbdt.predict(xgb.DMatrix(x)))

In [27]:
def get_oof(clf,Xtr, ytr,
            Xtx,ytx,
            NFOLDS):
    """Out of Fold Estimation.
    
    Parameters:
    ------------
    clf: Wrapper class having train and predict methods.
    Xtr: Numpy array of training data
    ytr: Numpy array of training target data
    Xtx: Numpy array of test data
    ytx: Numpy array of test label data
    NFOLDS: Number of fold

    """
    ntrain = Xtr.shape[0]
    ntest = Xtx.shape[0]

    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))

    oof_test_kf = np.empty((NFOLDS, ntest))

    kf = KFold(n_splits=NFOLDS,
                          shuffle=True,
                          random_state=SEED)

    for i, (idx_tr, idx_tx) in enumerate(kf.split(Xtr,ytr)):
        xtr_now = Xtr[idx_tr]
        ytr_now = ytr[idx_tr]
        xtx_now = Xtr[idx_tx]

        clf.train(xtr_now, ytr_now)

        oof_train[idx_tx] = clf.predict(xtx_now)
        oof_test_kf[i, :] = clf.predict(Xtx)

    oof_test[:] = oof_test_kf.mean(axis=0)

    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

## Modelling: Random Forest

In [28]:
rf_params = {
    'n_jobs': -1,
    'n_estimators': 100,
    'max_features': 0.2,
    'max_depth': 8,
    'min_samples_leaf': 2,
}

In [29]:
rf = SklearnWrapper(clf=RandomForestRegressor,
                    seed=SEED,
                    params=rf_params)

In [30]:
%%time

NFOLDS = 2

rf = SklearnWrapper(clf=RandomForestRegressor,
                    seed=SEED,
                    params=rf_params)

rf_oof_train, rf_oof_test = get_oof(rf,Xtr,ytr,
                                    Xtx,ytx,NFOLDS)

print("RF-CV: {}".format(mean_absolute_error(ytr, rf_oof_train)))

RF-CV: 1292.8756371081363
CPU times: user 4min 49s, sys: 2.93 s, total: 4min 52s
Wall time: 1min 37s


## Modelling: XGBoost

In [31]:
xgb_params = {
    'seed': 0,
    'colsample_bytree': 0.7,
    'silent': 1,
    'subsample': 0.7,
    'learning_rate': 0.075,
    'objective': 'reg:squarederror',
    'max_depth': 7,
    'num_parallel_tree': 1,
    'min_child_weight': 1,
    'eval_metric': 'mae',
    'nrounds': 350
}

In [32]:
%%time

xg = XgbWrapper(seed=SEED,
                params=xgb_params)

xg_oof_train, xg_oof_test = get_oof(xg,Xtr,ytr,
                                    Xtx,ytx,NFOLDS)

print("XG-CV: {}".format(mean_absolute_error(ytr,xg_oof_train)))

Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


XG-CV: 1156.7888456671442
CPU times: user 40min 30s, sys: 9.29 s, total: 40min 39s
Wall time: 12min 25s


## Modelling: Extra Trees Regressor

In [33]:
et_params = {
    'n_jobs': -1,
    'n_estimators': 100,
    'max_features': 0.5,
    'max_depth': 12,
    'min_samples_leaf': 2,
}

In [34]:
%%time

et = SklearnWrapper(clf=ExtraTreesRegressor,
                    seed=SEED,
                    params=et_params)

et_oof_train, et_oof_test = get_oof(et,Xtr,ytr,
                                    Xtx,ytx,NFOLDS)

print("ET-CV: {}".format(mean_absolute_error(ytr,et_oof_train)))

ET-CV: 1243.4451960986753
CPU times: user 23min 42s, sys: 4.83 s, total: 23min 47s
Wall time: 6min 32s


## Stacking multiple models

In [35]:
def xg_eval_mae(yhat, dtrain):
    y = dtrain.get_label()
    return 'mae', mean_absolute_error(np.exp(y), np.exp(yhat))

In [36]:
%%time

Xtr_stacked = np.c_[xg_oof_train,
                    et_oof_train,
                    rf_oof_train]
Xtx_stacked = np.c_[xg_oof_test,
                    et_oof_test,
                    rf_oof_test]

dtrain = xgb.DMatrix(Xtr_stacked, label=np.log(ytr))
dtest = xgb.DMatrix(Xtx)

xgb_params = {
    'seed': SEED,
    'colsample_bytree': 0.8,
    'verbose_eval': False,
    'subsample': 0.6,
    'learning_rate': 0.01,
    'objective': 'reg:linear',
    'max_depth': 4,
    'num_parallel_tree': 1,
    'min_child_weight': 1,
    'eval_metric': 'mae',
}

res = xgb.cv(xgb_params,
             dtrain,
             num_boost_round=500,
             nfold=4,
             seed=SEED,
             stratified=False,
             early_stopping_rounds=25,
             verbose_eval=100,
             show_stdv=True,
             feval=xg_eval_mae,
             maximize=False)

best_nrounds = res.shape[0] - 1
cv_mean = res.iloc[-1, 0]
cv_std = res.iloc[-1, 1]

print('Stacked-CV: {0}+{1}'.format(cv_mean, cv_std))

Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoo

# Time Taken

In [39]:
time_taken = time.time() - time_start_notebook
h,m = divmod(time_taken,60*60)
print('Time taken to run whole notebook: {:.0f} hr '\
      '{:.0f} min {:.0f} secs'.format(h, *divmod(m,60)))

Time taken to run whole notebook: 0 hr 31 min 32 secs
