The purpose of this notebook is to optimize hyperparameters for the XGBoost model. The feature engineering process was revamped, and gave me data with the following features:

Null count feature for each customer and column, scaled with StandardScaler to return values similar to continuous features.  
Dummy variable columns for each categorical feature, including one for Null.  
Continuous features capped at -3 and 3 to reduce the impact of outliers.  
Aggregate features calculated from the continuous columns: minimum, maximum, median, standard deviation, last, and change over course of account.

I achieved the best amex metric score of 0.77 on the validation set with this data. My model appeared to be overfit to the training sample, which scored 0.87. I have investigated useful hyperparameters to adjust for overfitting and I'm left with the following list:

colsample_bytree  
subsample  
max_depth  
gamma  
eta  
min_child_weight  
scale_pos_weight

I will investigate different values for these hyperparameters and combinations of these changes to optimize my model for the dataset.

Starting hyperparameters from the best model:

params = {  
    'verbosity': 1,  
    'max_depth': 4,  
    'objective': 'binary:logistic',  
    'eta': 0.15,  
    'random_state': seed,  
    'colsample_bytree': 0.8,  
    'colsample_bylevel': 0.8  
}

In [1]:
%autosave 0

Autosave disabled


In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
import myfunctions as mf

Read in the processed data from the boosting3 notebook.

In [3]:
train = pd.read_csv('train_final.csv')
valid = pd.read_csv('valid_final.csv')

train.shape, valid.shape

((39007, 1272), (214152, 1272))

Read in the targets.

In [4]:
base_url = '../../data/prepared/'

train_labels = pd.read_csv(base_url + 'train_labels.csv')
valid_labels = pd.read_csv(base_url + 'val_labels.csv')

train_labels.shape, valid_labels.shape

((39007, 2), (214152, 2))

Match up the features with the labels.

In [5]:
train = train.merge(train_labels, how='left', on='customer_ID')
valid = valid.merge(valid_labels, how='left', on='customer_ID')

Create the DMatrices.

In [8]:
train_matrix = xgb.DMatrix(train.drop(columns=['customer_ID', 'target']), label=train['target'])
valid_matrix = xgb.DMatrix(valid.drop(columns=['customer_ID', 'target']), label=valid['target'])

My first pass will try lowering the colsample_bytree to pull from fewer features every time a new tree is created in an attempt to increase my model's ability to generalize to new data.

In [9]:
steps = 1000
seed = 42

params1 = {
    'verbosity': 1,
    'max_depth': 4,
    'objective': 'binary:logistic',
    'eta': 0.15,
    'random_state': seed,
    'colsample_bytree': 0.4,
    'colsample_bylevel': 0.8
}

In [10]:
model1 = xgb.train(params1, train_matrix, steps, early_stopping_rounds=3,
                  evals=[(train_matrix, 'Train'), (valid_matrix, 'Valid')])

[0]	Train-logloss:0.61053	Valid-logloss:0.61163
[1]	Train-logloss:0.54265	Valid-logloss:0.54423
[2]	Train-logloss:0.49193	Valid-logloss:0.49434
[3]	Train-logloss:0.45108	Valid-logloss:0.45391
[4]	Train-logloss:0.41864	Valid-logloss:0.42197
[5]	Train-logloss:0.39005	Valid-logloss:0.39399
[6]	Train-logloss:0.36667	Valid-logloss:0.37108
[7]	Train-logloss:0.34801	Valid-logloss:0.35308
[8]	Train-logloss:0.33170	Valid-logloss:0.33718
[9]	Train-logloss:0.31815	Valid-logloss:0.32411
[10]	Train-logloss:0.30680	Valid-logloss:0.31320
[11]	Train-logloss:0.29691	Valid-logloss:0.30391
[12]	Train-logloss:0.28853	Valid-logloss:0.29599
[13]	Train-logloss:0.28119	Valid-logloss:0.28911
[14]	Train-logloss:0.27442	Valid-logloss:0.28296
[15]	Train-logloss:0.26837	Valid-logloss:0.27751
[16]	Train-logloss:0.26337	Valid-logloss:0.27306
[17]	Train-logloss:0.25887	Valid-logloss:0.26899
[18]	Train-logloss:0.25492	Valid-logloss:0.26556
[19]	Train-logloss:0.25128	Valid-logloss:0.26236
[20]	Train-logloss:0.24810	Val

In [11]:
result1 = mf.model_evaluator(model1, valid_matrix, valid['target'])
result1

0.7710386427762723

Roughly the same as my best result. No drop in performance on the out-of-set data and the difference between my train loss/validation loss is less severe. I will keep this new value for colsample_bytree for future tuning.

In my next pass, I will introduce the subsample hyperparameter. This value means that 80% of all my training data will be sampled prior to growing another tree. The default value is 1: choosing a value less than 1 is a way to prevent overfitting.

In [14]:
params2 = {
    'verbosity': 1,
    'max_depth': 4,
    'objective': 'binary:logistic',
    'eta': 0.15,
    'random_state': seed,
    'colsample_bytree': 0.4,
    'colsample_bylevel': 0.8,
    'subsample': 0.8
}

In [16]:
model2 = xgb.train(params2, train_matrix, steps, early_stopping_rounds=3,
                   evals=[(train_matrix, 'Train'), (valid_matrix, 'Valid')])

[0]	Train-logloss:0.61091	Valid-logloss:0.61205
[1]	Train-logloss:0.54526	Valid-logloss:0.54706
[2]	Train-logloss:0.49199	Valid-logloss:0.49420
[3]	Train-logloss:0.45241	Valid-logloss:0.45507
[4]	Train-logloss:0.41840	Valid-logloss:0.42169
[5]	Train-logloss:0.39033	Valid-logloss:0.39404
[6]	Train-logloss:0.36715	Valid-logloss:0.37153
[7]	Train-logloss:0.34798	Valid-logloss:0.35279
[8]	Train-logloss:0.33173	Valid-logloss:0.33706
[9]	Train-logloss:0.31811	Valid-logloss:0.32398
[10]	Train-logloss:0.30675	Valid-logloss:0.31314
[11]	Train-logloss:0.29658	Valid-logloss:0.30370
[12]	Train-logloss:0.28823	Valid-logloss:0.29587
[13]	Train-logloss:0.28060	Valid-logloss:0.28887
[14]	Train-logloss:0.27392	Valid-logloss:0.28281
[15]	Train-logloss:0.26824	Valid-logloss:0.27760
[16]	Train-logloss:0.26324	Valid-logloss:0.27314
[17]	Train-logloss:0.25899	Valid-logloss:0.26926
[18]	Train-logloss:0.25497	Valid-logloss:0.26554
[19]	Train-logloss:0.25160	Valid-logloss:0.26268
[20]	Train-logloss:0.24863	Val

In [17]:
result2 = mf.model_evaluator(model2, valid_matrix, valid['target'])
result2

0.7700153701869945

My results are still the same. I will try a more drastic subsample value before moving on to explore other hyperparameters.

In [18]:
params3 = {
    'verbosity': 1,
    'max_depth': 4,
    'objective': 'binary:logistic',
    'eta': 0.15,
    'random_state': seed,
    'colsample_bytree': 0.4,
    'colsample_bylevel': 0.8,
    'subsample': 0.4
}

In [19]:
model3 = xgb.train(params3, train_matrix, steps, early_stopping_rounds=3,
                   evals=[(train_matrix, 'Train'), (valid_matrix, 'Valid')])

[0]	Train-logloss:0.61101	Valid-logloss:0.61170
[1]	Train-logloss:0.54532	Valid-logloss:0.54648
[2]	Train-logloss:0.49230	Valid-logloss:0.49379
[3]	Train-logloss:0.45289	Valid-logloss:0.45496
[4]	Train-logloss:0.41891	Valid-logloss:0.42154
[5]	Train-logloss:0.39065	Valid-logloss:0.39390
[6]	Train-logloss:0.36804	Valid-logloss:0.37190
[7]	Train-logloss:0.34787	Valid-logloss:0.35234
[8]	Train-logloss:0.33199	Valid-logloss:0.33686
[9]	Train-logloss:0.31848	Valid-logloss:0.32382
[10]	Train-logloss:0.30663	Valid-logloss:0.31268
[11]	Train-logloss:0.29689	Valid-logloss:0.30332
[12]	Train-logloss:0.28872	Valid-logloss:0.29556
[13]	Train-logloss:0.28125	Valid-logloss:0.28887
[14]	Train-logloss:0.27488	Valid-logloss:0.28292
[15]	Train-logloss:0.26941	Valid-logloss:0.27777
[16]	Train-logloss:0.26464	Valid-logloss:0.27354
[17]	Train-logloss:0.26046	Valid-logloss:0.26983
[18]	Train-logloss:0.25646	Valid-logloss:0.26602
[19]	Train-logloss:0.25304	Valid-logloss:0.26317
[20]	Train-logloss:0.25010	Val

In [20]:
result3 = mf.model_evaluator(model3, valid_matrix, valid['target'])
result3

0.7635269140504961

Sampling only a small portion of my features (40%) when growing a new tree had a strong negative effect on my results. I think it will be appropriate to keep this value at 0.8 for future tuning steps.

In the next training pass, I will try a max depth of 3 to curb overfitting.

In [21]:
params4 = {
    'verbosity': 1,
    'max_depth': 3,
    'objective': 'binary:logistic',
    'eta': 0.15,
    'random_state': seed,
    'colsample_bytree': 0.4,
    'colsample_bylevel': 0.8,
    'subsample': 0.8
}

model4 = xgb.train(params4, train_matrix, steps, early_stopping_rounds=3,
                   evals=[(train_matrix, 'Train'), (valid_matrix, 'Valid')])

result4 = mf.model_evaluator(model4, valid_matrix, valid['target'])
result4

[0]	Train-logloss:0.61498	Valid-logloss:0.61561
[1]	Train-logloss:0.55141	Valid-logloss:0.55250
[2]	Train-logloss:0.49932	Valid-logloss:0.50049
[3]	Train-logloss:0.46147	Valid-logloss:0.46269
[4]	Train-logloss:0.42831	Valid-logloss:0.42995
[5]	Train-logloss:0.40051	Valid-logloss:0.40238
[6]	Train-logloss:0.37864	Valid-logloss:0.38066
[7]	Train-logloss:0.36005	Valid-logloss:0.36218
[8]	Train-logloss:0.34420	Valid-logloss:0.34658
[9]	Train-logloss:0.33148	Valid-logloss:0.33407
[10]	Train-logloss:0.32063	Valid-logloss:0.32369
[11]	Train-logloss:0.31069	Valid-logloss:0.31411
[12]	Train-logloss:0.30244	Valid-logloss:0.30619
[13]	Train-logloss:0.29504	Valid-logloss:0.29918
[14]	Train-logloss:0.28869	Valid-logloss:0.29314
[15]	Train-logloss:0.28298	Valid-logloss:0.28777
[16]	Train-logloss:0.27815	Valid-logloss:0.28329
[17]	Train-logloss:0.27382	Valid-logloss:0.27928
[18]	Train-logloss:0.26943	Valid-logloss:0.27511
[19]	Train-logloss:0.26574	Valid-logloss:0.27164
[20]	Train-logloss:0.26272	Val

0.7720946805913114

Reducing my max depth results in a slight increase in model performance on the validation set. I will keep a max depth of 3 moving forward with the other features.

I will experiment with gamma next. Gamma is set to 0 by default, and it is defined as the minimum loss reduction required to add more splits to a tree. A larger value of gamma is associated with a more conservative algorithm. Increasing gamma may be a way to curb the complexity of my trees and reduce overfitting. Gamma accepts values from 0 to infinity, however I will limit myself from 0 to 10 initially.

In [22]:
params5 = {
    'verbosity': 1,
    'max_depth': 3,
    'objective': 'binary:logistic',
    'eta': 0.15,
    'random_state': seed,
    'colsample_bytree': 0.4,
    'colsample_bylevel': 0.8,
    'subsample': 0.8,
    'gamma': 2
}

model5 = xgb.train(params5, train_matrix, steps, early_stopping_rounds=3,
                   evals=[(train_matrix, 'Train'), (valid_matrix, 'Valid')])

result5 = mf.model_evaluator(model5, valid_matrix, valid['target'])
result5

[0]	Train-logloss:0.61498	Valid-logloss:0.61561
[1]	Train-logloss:0.55141	Valid-logloss:0.55250
[2]	Train-logloss:0.49932	Valid-logloss:0.50049
[3]	Train-logloss:0.46147	Valid-logloss:0.46269
[4]	Train-logloss:0.42831	Valid-logloss:0.42995
[5]	Train-logloss:0.40051	Valid-logloss:0.40238
[6]	Train-logloss:0.37864	Valid-logloss:0.38066
[7]	Train-logloss:0.36005	Valid-logloss:0.36218
[8]	Train-logloss:0.34420	Valid-logloss:0.34658
[9]	Train-logloss:0.33148	Valid-logloss:0.33407
[10]	Train-logloss:0.32063	Valid-logloss:0.32369
[11]	Train-logloss:0.31069	Valid-logloss:0.31411
[12]	Train-logloss:0.30244	Valid-logloss:0.30619
[13]	Train-logloss:0.29504	Valid-logloss:0.29918
[14]	Train-logloss:0.28869	Valid-logloss:0.29314
[15]	Train-logloss:0.28298	Valid-logloss:0.28777
[16]	Train-logloss:0.27815	Valid-logloss:0.28329
[17]	Train-logloss:0.27382	Valid-logloss:0.27928
[18]	Train-logloss:0.26943	Valid-logloss:0.27511
[19]	Train-logloss:0.26574	Valid-logloss:0.27164
[20]	Train-logloss:0.26272	Val

0.7720945900961371

I see very little difference between my results with and without gamma (all other hyperparameters being the same). I will try an extreme value of gamma to see if I can see a difference.

In [23]:
params6 = {
    'verbosity': 1,
    'max_depth': 3,
    'objective': 'binary:logistic',
    'eta': 0.15,
    'random_state': seed,
    'colsample_bytree': 0.4,
    'colsample_bylevel': 0.8,
    'subsample': 0.8,
    'gamma': 64
}

model6 = xgb.train(params6, train_matrix, steps, early_stopping_rounds=3,
                   evals=[(train_matrix, 'Train'), (valid_matrix, 'Valid')])

result6 = mf.model_evaluator(model6, valid_matrix, valid['target'])
result6

[0]	Train-logloss:0.61498	Valid-logloss:0.61561
[1]	Train-logloss:0.55141	Valid-logloss:0.55250
[2]	Train-logloss:0.49932	Valid-logloss:0.50049
[3]	Train-logloss:0.46147	Valid-logloss:0.46269
[4]	Train-logloss:0.42831	Valid-logloss:0.42995
[5]	Train-logloss:0.40092	Valid-logloss:0.40280
[6]	Train-logloss:0.37901	Valid-logloss:0.38104
[7]	Train-logloss:0.36069	Valid-logloss:0.36275
[8]	Train-logloss:0.34528	Valid-logloss:0.34747
[9]	Train-logloss:0.33279	Valid-logloss:0.33513
[10]	Train-logloss:0.32209	Valid-logloss:0.32480
[11]	Train-logloss:0.31211	Valid-logloss:0.31510
[12]	Train-logloss:0.30435	Valid-logloss:0.30757
[13]	Train-logloss:0.29734	Valid-logloss:0.30085
[14]	Train-logloss:0.29126	Valid-logloss:0.29497
[15]	Train-logloss:0.28571	Valid-logloss:0.28974
[16]	Train-logloss:0.28111	Valid-logloss:0.28545
[17]	Train-logloss:0.27732	Valid-logloss:0.28184
[18]	Train-logloss:0.27332	Valid-logloss:0.27799
[19]	Train-logloss:0.27007	Valid-logloss:0.27474
[20]	Train-logloss:0.26700	Val

0.7524617273872352

I got my answer. A gamma this large negatively impacts my score by a great margin. I think I will abandon the gamma hyperparameter for the time being. A small value had no effect, and a large value had a negative effect.

Next I will try cutting the learning rate ('eta') in half. A lower learning rate should increase the amount of training steps required to reach convergence and decrease the amount of overfitting.

In [24]:
params7 = {
    'verbosity': 1,
    'max_depth': 3,
    'objective': 'binary:logistic',
    'eta': 0.075,
    'random_state': seed,
    'colsample_bytree': 0.4,
    'colsample_bylevel': 0.8,
    'subsample': 0.8,
}

model7 = xgb.train(params7, train_matrix, steps, early_stopping_rounds=3,
                   evals=[(train_matrix, 'Train'), (valid_matrix, 'Valid')])

result7 = mf.model_evaluator(model7, valid_matrix, valid['target'])
result7

[0]	Train-logloss:0.65249	Valid-logloss:0.65280
[1]	Train-logloss:0.61491	Valid-logloss:0.61545
[2]	Train-logloss:0.58044	Valid-logloss:0.58098
[3]	Train-logloss:0.55261	Valid-logloss:0.55325
[4]	Train-logloss:0.52609	Valid-logloss:0.52691
[5]	Train-logloss:0.50175	Valid-logloss:0.50266
[6]	Train-logloss:0.48111	Valid-logloss:0.48210
[7]	Train-logloss:0.46171	Valid-logloss:0.46278
[8]	Train-logloss:0.44418	Valid-logloss:0.44535
[9]	Train-logloss:0.42871	Valid-logloss:0.43000
[10]	Train-logloss:0.41480	Valid-logloss:0.41639
[11]	Train-logloss:0.40239	Valid-logloss:0.40407
[12]	Train-logloss:0.39088	Valid-logloss:0.39280
[13]	Train-logloss:0.38003	Valid-logloss:0.38218
[14]	Train-logloss:0.36999	Valid-logloss:0.37227
[15]	Train-logloss:0.36130	Valid-logloss:0.36366
[16]	Train-logloss:0.35315	Valid-logloss:0.35560
[17]	Train-logloss:0.34564	Valid-logloss:0.34819
[18]	Train-logloss:0.33870	Valid-logloss:0.34145
[19]	Train-logloss:0.33225	Valid-logloss:0.33516
[20]	Train-logloss:0.32639	Val

[167]	Train-logloss:0.20974	Valid-logloss:0.23241
[168]	Train-logloss:0.20955	Valid-logloss:0.23236
[169]	Train-logloss:0.20939	Valid-logloss:0.23228
[170]	Train-logloss:0.20922	Valid-logloss:0.23224
[171]	Train-logloss:0.20905	Valid-logloss:0.23221
[172]	Train-logloss:0.20886	Valid-logloss:0.23217
[173]	Train-logloss:0.20866	Valid-logloss:0.23209
[174]	Train-logloss:0.20848	Valid-logloss:0.23202
[175]	Train-logloss:0.20830	Valid-logloss:0.23198
[176]	Train-logloss:0.20812	Valid-logloss:0.23194
[177]	Train-logloss:0.20792	Valid-logloss:0.23190
[178]	Train-logloss:0.20771	Valid-logloss:0.23178
[179]	Train-logloss:0.20759	Valid-logloss:0.23176
[180]	Train-logloss:0.20740	Valid-logloss:0.23173
[181]	Train-logloss:0.20719	Valid-logloss:0.23164
[182]	Train-logloss:0.20700	Valid-logloss:0.23164
[183]	Train-logloss:0.20687	Valid-logloss:0.23158
[184]	Train-logloss:0.20670	Valid-logloss:0.23153
[185]	Train-logloss:0.20654	Valid-logloss:0.23149
[186]	Train-logloss:0.20638	Valid-logloss:0.23146


0.7742527529940828

Decreasing the learning rate had a small positive impact on the score. It may not be worth using the slower learning rate because it took over twice the time to arrive at the same amex score. Since it's technically the best model I've built in this notebook and I have time on my hands, I'll use this smaller value for learning rate moving forward.

Next I will investigate the effect of min_child_weight on my results. The hyperparameter sets the cutoff at which any values lower will stop the partitioning process. A larger value is associated with a more conservative algorithm.

In [25]:
params8 = {
    'verbosity': 1,
    'max_depth': 3,
    'objective': 'binary:logistic',
    'eta': 0.075,
    'random_state': seed,
    'colsample_bytree': 0.4,
    'colsample_bylevel': 0.8,
    'subsample': 0.8,
    'min_child_weight': 4
}

model8 = xgb.train(params8, train_matrix, steps, early_stopping_rounds=3,
                   evals=[(train_matrix, 'Train'), (valid_matrix, 'Valid')])

result8 = mf.model_evaluator(model8, valid_matrix, valid['target'])
result8

[0]	Train-logloss:0.65249	Valid-logloss:0.65280
[1]	Train-logloss:0.61491	Valid-logloss:0.61545
[2]	Train-logloss:0.58044	Valid-logloss:0.58098
[3]	Train-logloss:0.55261	Valid-logloss:0.55325
[4]	Train-logloss:0.52609	Valid-logloss:0.52691
[5]	Train-logloss:0.50175	Valid-logloss:0.50266
[6]	Train-logloss:0.48111	Valid-logloss:0.48210
[7]	Train-logloss:0.46171	Valid-logloss:0.46278
[8]	Train-logloss:0.44418	Valid-logloss:0.44535
[9]	Train-logloss:0.42871	Valid-logloss:0.43000
[10]	Train-logloss:0.41480	Valid-logloss:0.41639
[11]	Train-logloss:0.40239	Valid-logloss:0.40407
[12]	Train-logloss:0.39088	Valid-logloss:0.39280
[13]	Train-logloss:0.38003	Valid-logloss:0.38218
[14]	Train-logloss:0.36999	Valid-logloss:0.37227
[15]	Train-logloss:0.36130	Valid-logloss:0.36366
[16]	Train-logloss:0.35315	Valid-logloss:0.35560
[17]	Train-logloss:0.34564	Valid-logloss:0.34819
[18]	Train-logloss:0.33870	Valid-logloss:0.34145
[19]	Train-logloss:0.33225	Valid-logloss:0.33516
[20]	Train-logloss:0.32639	Val

[167]	Train-logloss:0.21009	Valid-logloss:0.23238
[168]	Train-logloss:0.20991	Valid-logloss:0.23233
[169]	Train-logloss:0.20976	Valid-logloss:0.23229
[170]	Train-logloss:0.20961	Valid-logloss:0.23227
[171]	Train-logloss:0.20944	Valid-logloss:0.23224
[172]	Train-logloss:0.20926	Valid-logloss:0.23219
[173]	Train-logloss:0.20906	Valid-logloss:0.23213
[174]	Train-logloss:0.20889	Valid-logloss:0.23205
[175]	Train-logloss:0.20872	Valid-logloss:0.23201
[176]	Train-logloss:0.20856	Valid-logloss:0.23197
[177]	Train-logloss:0.20837	Valid-logloss:0.23192
[178]	Train-logloss:0.20817	Valid-logloss:0.23184
[179]	Train-logloss:0.20802	Valid-logloss:0.23179
[180]	Train-logloss:0.20786	Valid-logloss:0.23171
[181]	Train-logloss:0.20766	Valid-logloss:0.23162
[182]	Train-logloss:0.20749	Valid-logloss:0.23161
[183]	Train-logloss:0.20728	Valid-logloss:0.23150
[184]	Train-logloss:0.20712	Valid-logloss:0.23146
[185]	Train-logloss:0.20695	Valid-logloss:0.23141
[186]	Train-logloss:0.20680	Valid-logloss:0.23132


0.7752500774789979

Like some experiments before, I've added a hyperparameter and my results have not been changed in any meaningful way. I will try cranking this hyperparameter up to 11 to see if it has an impact before I try altering my last hyperparameter to try.

In [26]:
params9 = {
    'verbosity': 1,
    'max_depth': 3,
    'objective': 'binary:logistic',
    'eta': 0.075,
    'random_state': seed,
    'colsample_bytree': 0.4,
    'colsample_bylevel': 0.8,
    'subsample': 0.8,
    'min_child_weight': 16
}

model9 = xgb.train(params9, train_matrix, steps, early_stopping_rounds=3,
                   evals=[(train_matrix, 'Train'), (valid_matrix, 'Valid')])

result9 = mf.model_evaluator(model9, valid_matrix, valid['target'])
result9

[0]	Train-logloss:0.65249	Valid-logloss:0.65280
[1]	Train-logloss:0.61491	Valid-logloss:0.61545
[2]	Train-logloss:0.58044	Valid-logloss:0.58098
[3]	Train-logloss:0.55261	Valid-logloss:0.55325
[4]	Train-logloss:0.52609	Valid-logloss:0.52691
[5]	Train-logloss:0.50175	Valid-logloss:0.50266
[6]	Train-logloss:0.48111	Valid-logloss:0.48210
[7]	Train-logloss:0.46171	Valid-logloss:0.46278
[8]	Train-logloss:0.44419	Valid-logloss:0.44535
[9]	Train-logloss:0.42868	Valid-logloss:0.42998
[10]	Train-logloss:0.41476	Valid-logloss:0.41636
[11]	Train-logloss:0.40236	Valid-logloss:0.40405
[12]	Train-logloss:0.39086	Valid-logloss:0.39274
[13]	Train-logloss:0.38002	Valid-logloss:0.38211
[14]	Train-logloss:0.37002	Valid-logloss:0.37223
[15]	Train-logloss:0.36140	Valid-logloss:0.36365
[16]	Train-logloss:0.35316	Valid-logloss:0.35552
[17]	Train-logloss:0.34566	Valid-logloss:0.34812
[18]	Train-logloss:0.33877	Valid-logloss:0.34145
[19]	Train-logloss:0.33235	Valid-logloss:0.33518
[20]	Train-logloss:0.32651	Val

[167]	Train-logloss:0.21101	Valid-logloss:0.23255
[168]	Train-logloss:0.21084	Valid-logloss:0.23251
[169]	Train-logloss:0.21074	Valid-logloss:0.23248
[170]	Train-logloss:0.21058	Valid-logloss:0.23243
[171]	Train-logloss:0.21045	Valid-logloss:0.23242
[172]	Train-logloss:0.21025	Valid-logloss:0.23235
[173]	Train-logloss:0.21008	Valid-logloss:0.23231
[174]	Train-logloss:0.20997	Valid-logloss:0.23224
[175]	Train-logloss:0.20982	Valid-logloss:0.23220
[176]	Train-logloss:0.20969	Valid-logloss:0.23217
[177]	Train-logloss:0.20949	Valid-logloss:0.23204
[178]	Train-logloss:0.20931	Valid-logloss:0.23193
[179]	Train-logloss:0.20916	Valid-logloss:0.23189
[180]	Train-logloss:0.20899	Valid-logloss:0.23188
[181]	Train-logloss:0.20881	Valid-logloss:0.23182
[182]	Train-logloss:0.20864	Valid-logloss:0.23181
[183]	Train-logloss:0.20851	Valid-logloss:0.23178
[184]	Train-logloss:0.20834	Valid-logloss:0.23175
[185]	Train-logloss:0.20819	Valid-logloss:0.23171
[186]	Train-logloss:0.20805	Valid-logloss:0.23169


0.7749020011541358

Increasing min_child_weight beyond 4 has a small negative impact on my performance, and I assume it will only get worse if I increase it from there.

My last investigation will be into scale_pos_weight. The dataset we have has roughly 3 negative cases for every postive one. The equation to calculate scale_pos_weight is sum(negative instances) / sum(positive instances. I will use 3 for this hyperparameter.

In [28]:
params10 = {
    'verbosity': 1,
    'max_depth': 3,
    'objective': 'binary:logistic',
    'eta': 0.075,
    'random_state': seed,
    'colsample_bytree': 0.4,
    'colsample_bylevel': 0.8,
    'subsample': 0.8,
    'min_child_weight': 4,
    'scale_pos_weight': 3
}

model10 = xgb.train(params10, train_matrix, steps, early_stopping_rounds=3,
                   evals=[(train_matrix, 'Train'), (valid_matrix, 'Valid')])

result10 = mf.model_evaluator(model10, valid_matrix, valid['target'])
result10

[0]	Train-logloss:0.65751	Valid-logloss:0.65804
[1]	Train-logloss:0.62315	Valid-logloss:0.62390
[2]	Train-logloss:0.59164	Valid-logloss:0.59242
[3]	Train-logloss:0.56604	Valid-logloss:0.56701
[4]	Train-logloss:0.54191	Valid-logloss:0.54315
[5]	Train-logloss:0.51968	Valid-logloss:0.52106
[6]	Train-logloss:0.50078	Valid-logloss:0.50247
[7]	Train-logloss:0.48326	Valid-logloss:0.48517
[8]	Train-logloss:0.46756	Valid-logloss:0.46966
[9]	Train-logloss:0.45399	Valid-logloss:0.45632
[10]	Train-logloss:0.44219	Valid-logloss:0.44481
[11]	Train-logloss:0.43130	Valid-logloss:0.43403
[12]	Train-logloss:0.42118	Valid-logloss:0.42407
[13]	Train-logloss:0.41215	Valid-logloss:0.41522
[14]	Train-logloss:0.40334	Valid-logloss:0.40655
[15]	Train-logloss:0.39614	Valid-logloss:0.39944
[16]	Train-logloss:0.38894	Valid-logloss:0.39240
[17]	Train-logloss:0.38258	Valid-logloss:0.38618
[18]	Train-logloss:0.37670	Valid-logloss:0.38045
[19]	Train-logloss:0.37131	Valid-logloss:0.37517
[20]	Train-logloss:0.36655	Val

[167]	Train-logloss:0.25034	Valid-logloss:0.27419
[168]	Train-logloss:0.25012	Valid-logloss:0.27414
[169]	Train-logloss:0.24984	Valid-logloss:0.27400
[170]	Train-logloss:0.24967	Valid-logloss:0.27393
[171]	Train-logloss:0.24937	Valid-logloss:0.27379
[172]	Train-logloss:0.24905	Valid-logloss:0.27361
[173]	Train-logloss:0.24879	Valid-logloss:0.27346
[174]	Train-logloss:0.24854	Valid-logloss:0.27334
[175]	Train-logloss:0.24848	Valid-logloss:0.27335
[176]	Train-logloss:0.24835	Valid-logloss:0.27333
[177]	Train-logloss:0.24811	Valid-logloss:0.27323
[178]	Train-logloss:0.24786	Valid-logloss:0.27305
[179]	Train-logloss:0.24770	Valid-logloss:0.27301
[180]	Train-logloss:0.24740	Valid-logloss:0.27289
[181]	Train-logloss:0.24717	Valid-logloss:0.27280
[182]	Train-logloss:0.24685	Valid-logloss:0.27263
[183]	Train-logloss:0.24669	Valid-logloss:0.27255
[184]	Train-logloss:0.24647	Valid-logloss:0.27246
[185]	Train-logloss:0.24631	Valid-logloss:0.27238
[186]	Train-logloss:0.24610	Valid-logloss:0.27227


0.7732946179740022

Something bad happened.

The main takeaway from this exploration is that marginal improvements in model performance can be made by hyperparameter tuning. The greatest improvements were achieved from the data-side of the project: engineering useful features and determing how to cap values and impute null values. The best way forward is to engineer additional useful features and merge the validation dataset onto the train dataset for a greater body of training data. Currently, the train portion is 10% and validation is the other 90%. Combining this data and using the holdout test portion on the Kaggle website as the validation subset could greatly improve model performance.

Model8 had the best performance on the validation subset with an amex score of 0.775. The hyperparameters for this model are as follows:

params8 = {  
    'verbosity': 1,  
    'max_depth': 3,  
    'objective': 'binary:logistic',  
    'eta': 0.075,  
    'random_state': seed,  
    'colsample_bytree': 0.4,  
    'colsample_bylevel': 0.8,  
    'subsample': 0.8,  
    'min_child_weight': 4  
}


I will investigate the importance of features passed in to the model. I do this with the get_score Booster object attribute. I can choose from one of five ways to define importance. In this case, I went with weight, which is the number of times each feature was used to split the data. In essence, the features with the greatest weight have the greatest decision-making power in my decision trees.

In [35]:
model8.get_score(importance_type='weight')

{'P_2_min': 24.0,
 'P_2_max': 3.0,
 'P_2_median': 14.0,
 'P_2_std': 1.0,
 'P_2_last': 57.0,
 'P_2_change': 4.0,
 'D_39_min': 1.0,
 'D_39_max': 15.0,
 'D_39_std': 5.0,
 'D_39_last': 32.0,
 'D_39_change': 11.0,
 'B_1_min': 1.0,
 'B_1_max': 2.0,
 'B_1_std': 1.0,
 'B_1_last': 14.0,
 'B_1_change': 3.0,
 'B_2_min': 2.0,
 'B_2_max': 1.0,
 'B_2_median': 1.0,
 'B_2_last': 15.0,
 'B_2_change': 3.0,
 'R_1_min': 3.0,
 'R_1_max': 8.0,
 'R_1_std': 5.0,
 'R_1_last': 11.0,
 'R_1_change': 2.0,
 'S_3_min': 6.0,
 'S_3_median': 9.0,
 'S_3_std': 2.0,
 'S_3_last': 7.0,
 'S_3_change': 5.0,
 'D_41_min': 1.0,
 'D_41_max': 4.0,
 'D_41_median': 1.0,
 'D_41_std': 9.0,
 'D_41_last': 11.0,
 'D_41_change': 14.0,
 'B_3_min': 3.0,
 'B_3_max': 8.0,
 'B_3_median': 2.0,
 'B_3_std': 5.0,
 'B_3_last': 17.0,
 'B_3_change': 10.0,
 'D_42_min': 11.0,
 'D_42_max': 6.0,
 'D_42_median': 7.0,
 'D_42_last': 3.0,
 'D_43_median': 3.0,
 'D_43_std': 4.0,
 'D_43_change': 3.0,
 'D_44_std': 2.0,
 'D_44_last': 4.0,
 'D_44_change': 7.0,
 'B

The result is a dictionary. I use the dictionary to create a dataframe.

In [39]:
model8_performance = pd.DataFrame([model8.get_score(importance_type='weight')])

The resulting dataframe has a column for each feature and one row containing all the weights. I transpose the dataframe to get all weights in one column, allowing for easier sorting and investigation.

In [46]:
model8_performance = model8_performance.T

The documentation says that zero-importance features are not included in the model score (in this case, any feature that was not involved in a single split decision). I will look at the shape of my dataframe to see how many features are considered relevant.

In [59]:
model8_performance.shape

(619, 1)

I started the training process with over 1200 features in my dataframe. It looks like half of those are relevant to the problem at hand. I can potentially trim the number of features I pass into my model to speed up training and improve results.

In [54]:
model8_performance.rename(columns={0: 'weight'}, inplace=True)

This is great, but I would prefer to build a dataframe containing all available evaluation metrics for feature performance: weight, gain, cover, total_gain, and total_cover. Gain is considered to be the best metric to judge feature performance.

In [80]:
score_columns = ['weight', 'gain', 'cover', 'total_gain', 'total_cover']

model8_scores = pd.DataFrame()

for col in score_columns:
    model8_scores[col] = model8.get_score(importance_type=col).values()
    
my_index = pd.DataFrame(list(model8.get_score(importance_type='weight').keys()))
model8_scores = pd.concat([my_index, model8_scores], axis=1)
model8_scores.head()   

Unnamed: 0,0,weight,gain,cover,total_gain,total_cover
0,P_2_min,24.0,1046.174927,3564.391602,25108.199219,85545.398438
1,P_2_max,3.0,16.868271,1446.203491,50.604813,4338.610352
2,P_2_median,14.0,378.496765,2358.535889,5298.95459,33019.503906
3,P_2_std,1.0,10.919077,131.254517,10.919077,131.254517
4,P_2_last,57.0,606.084595,2461.301758,34546.820312,140294.203125


I need to set the index as the feature name and I'm good to go.

In [81]:
model8_scores.set_index(0, inplace=True)
model8_scores.head()

Unnamed: 0_level_0,weight,gain,cover,total_gain,total_cover
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
P_2_min,24.0,1046.174927,3564.391602,25108.199219,85545.398438
P_2_max,3.0,16.868271,1446.203491,50.604813,4338.610352
P_2_median,14.0,378.496765,2358.535889,5298.95459,33019.503906
P_2_std,1.0,10.919077,131.254517,10.919077,131.254517
P_2_last,57.0,606.084595,2461.301758,34546.820312,140294.203125


I will save a copy of this dataframe to analyze my best-performing features and determine if I can drop some original features from my data entirely.

In [82]:
model8_scores.to_csv('model8_feature_scores.csv')

I want to take a look at the features that provided the greatest gain to my model. Gain describes how greatly the accuracy of the model was improved by the addition the feature to the tree. Features with high gain are doing the best job minimizing the error of trees. Gain seems to represent the greatest positive impact offered by each feature to one tree, while total gain represents the benefit offered to the entire ensemble by the feature. I will look at the top performers in both categories to get a decent picture.

In [86]:
model8_scores.sort_values('gain', ascending=False).head(25)

Unnamed: 0_level_0,weight,gain,cover,total_gain,total_cover
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
B_9_last,5.0,1930.199829,4813.571289,9650.999023,24067.855469
B_9_max,4.0,1725.645752,3515.094238,6902.583008,14060.376953
B_18_last,4.0,1508.544678,3608.366699,6034.178711,14433.466797
D_75_last,1.0,1304.041016,5534.0,1304.041016,5534.0
P_2_min,24.0,1046.174927,3564.391602,25108.199219,85545.398438
R_1_std,5.0,808.695923,2965.106934,4043.479492,14825.535156
P_2_last,57.0,606.084595,2461.301758,34546.820312,140294.203125
D_44_last,4.0,549.951538,3008.22998,2199.806152,12032.919922
P_2_median,14.0,378.496765,2358.535889,5298.95459,33019.503906
P_2_change,4.0,349.812317,2365.248535,1399.249268,9460.994141


In [87]:
model8_scores.sort_values('total_gain', ascending=False).head(25)

Unnamed: 0_level_0,weight,gain,cover,total_gain,total_cover
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
P_2_last,57.0,606.084595,2461.301758,34546.820312,140294.203125
P_2_min,24.0,1046.174927,3564.391602,25108.199219,85545.398438
B_9_last,5.0,1930.199829,4813.571289,9650.999023,24067.855469
B_9_max,4.0,1725.645752,3515.094238,6902.583008,14060.376953
B_18_last,4.0,1508.544678,3608.366699,6034.178711,14433.466797
P_2_median,14.0,378.496765,2358.535889,5298.95459,33019.503906
R_1_std,5.0,808.695923,2965.106934,4043.479492,14825.535156
B_7_last,13.0,271.51886,1980.896606,3529.745361,25751.65625
B_2_last,15.0,175.641174,1758.774048,2634.617676,26381.611328
B_1_last,14.0,175.890549,1989.841919,2462.467773,27857.787109


There is a considerable amount of overlap between the top performers in gain and total_gain. There are a couple odd cases in the gain table, where a few features have a top gain that's equal to their total gain. In these cases, their weight is one, which means these features were only used once in the whole ensemble to make an impactful split.