# Implementing XGBoost using Grid Search as well as Hyperopt
- modeltype: **XGBoost** 
- train: **0.886**
- test: **0.865**

## Flow of the notebook
* 0. Rationale behind choosing XGBoost
* 1. Reading Data and importing libraries
* 2. Data Manipulation
* 3. Implementing XGBoost

## 0. Rationale behind choosing XGBoost

There are many articles online that has compared XGBoost, CatBoost and LightGBM and in most of them, XGBoost had a relatively lesser efficiency, and in few of them, it also had a relatively lesser accuracy.

But after researching a bit, XGBoost seemed a highly popular algorithm in data scienstist community and also several top performing kagglers had mentioned this algorithm in the past. So even though being a bit older and relatively inefficient compared to CatBoost and LighGBM I was curious to get my hands on it.

I have explained the working and parameter tuning of the algorithm in the section 3.

## 1. Reading Data & Importing Libraries

In [1]:
# Basic Data operations
import os
import numpy as np
import pandas as pd
import pickle


# H20 and Grid Search
import h2o
from h2o.estimators import H2OXGBoostEstimator
from h2o.grid.grid_search import H2OGridSearch


# Hyper Opt
import hyperopt
from numpy.random import RandomState
from sklearn.metrics import roc_auc_score

# Uncomment this if the category_encoder version is less than 2.1.0
# ! pip install category_encoders # This will update Sk-learn as well
from category_encoders import *

In [2]:
# Set directories
print(os.getcwd())
dirRawData = "../input/"
dirPData   = "../PData/"
dirPOutput = "../POutput/"

/home/jovyan/Final_Project/Pcode


In [3]:
f_name = dirPData + '01_df_250k.pickle'

with (open(f_name, "rb")) as f:
    dict_ = pickle.load(f)

df_train = dict_['df_train']
df_test  = dict_['df_test']

del f_name, dict_

In [4]:
f_name = dirPData + '01_vars.pickle'

with open(f_name, "rb") as f:
    dict_ = pickle.load(f)

vars_ind_numeric     = dict_['vars_ind_numeric']
vars_ind_hccv        = dict_['vars_ind_hccv']
vars_ind_categorical = dict_['vars_ind_categorical']
vars_notToUse        = ['unique_id']
var_dep              = dict_['var_dep']

del f_name, dict_

## 2. Data manipulation

### 2.2 Treating NAs

* **Handling NAs**

One of the advantage of XGBoost over other Gradient Boosting algorithm is it efficiently handles the missing values. As Tianqi Chen has discussed in his paper
> When a value is missing in the sparse matrix x, the instance is classified into the default direction. There are two choices
of default direction in each branch. The optimal default directions are learnt from the data. The key improvement is to only visit the non-missing entries. The presented algorithm treats the non-presence as a missing value and learns the best direction to handle missing values. 

And as mentioned in the H20 documentation: 
>  Missing values are interpreted as containing information (i.e., missing for a reason), rather than missing at random. During tree building, split decisions for every node are found by minimizing the loss function and treating missing values as a separate category that can go either left or right. XGBoost will automatically learn which is the best direction to go when a value is missing.

Following this, we have decided not to do any operations on missing value (except for deleting 'c02' column becuase it contain >50% of missing values in test data). 

-99 values will be treated in the H20 Frame, right now we are only removing 'C02' variable.

In [5]:


# Since C02 has a lot of NAs (>50%), we will drop it from both train and test
df_test.drop('c02', axis=1, inplace = True)
df_train.drop('c02', axis=1, inplace = True)
vars_ind_categorical.remove('c02')


### 2.3 Cardinality
   For treating the hccv we have used Target Encoders (Sk-Learn). We have also observed that there is oversampling of few factors 
    while others are under-sampled therefore we have used smoothing factor = 4. Smoothing of 4 is chosen by following few online 
    blogs. Also scikit learn is used to do the target encoding because I was a bit confused with H20 target encoding. Additionally,
    H20 automatically perform one_hot encoding on categorical variables so they have not been treated.


Note: I needed to update the scikit-learn and category_ecoders library

In [8]:
# To see the distribution of factors in a hcc variable
# df_train[vars_ind_hccv].nunique()
# df_train['e18'].value_counts()

In [6]:
# Target encoders on Train
enc = TargetEncoder(cols=vars_ind_hccv, smoothing =4)
enc.fit_transform(df_train, df_train['target'])
df_train = enc.transform(df_train, df_train['target'])

In [7]:
# Target encoders on Test
df_test['target'] = np.nan # Creating dummy 'target' variable for using the the enc.transform function
df_test = enc.transform(df_test) # applying the already trained encoder
df_test.drop(columns=['target'], inplace=True) # Dropping dummy 'target' variable

Running H20 cluster

In [11]:
# h2o.cluster().shutdown()

AttributeError: 'NoneType' object has no attribute 'shutdown'

In [8]:
h2o.init(port=54321, max_mem_size = "14g") # Asking h20 to use 14 GB of Ram
h2o.connect()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_212"; OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03); OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)
  Starting server from /opt/conda/lib/python3.6/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp17mtc_0v
  JVM stdout: /tmp/tmp17mtc_0v/h2o_jovyan_started_from_python.out
  JVM stderr: /tmp/tmp17mtc_0v/h2o_jovyan_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O cluster uptime:,01 secs
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.3
H2O cluster version age:,"1 year, 2 months and 8 days !!!"
H2O cluster name:,H2O_from_python_jovyan_s0rtq1
H2O cluster total nodes:,1
H2O cluster free memory:,12.44 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


Connecting to H2O server at http://localhost:54321 ... successful.


0,1
H2O cluster uptime:,02 secs
H2O cluster timezone:,Etc/UTC
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.3
H2O cluster version age:,"1 year, 2 months and 8 days !!!"
H2O cluster name:,H2O_from_python_jovyan_s0rtq1
H2O cluster total nodes:,1
H2O cluster free memory:,12.44 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


<H2OConnection to http://localhost:54321, no session>

We have removed 'unique_id' from Frames as it was not useful in prediction


In [9]:
vars_to_use = vars_ind_numeric + vars_ind_categorical
vars_to_use.remove('unique_id')
vars_ind_numeric.remove('unique_id')


h2o_df_train = h2o.H2OFrame(df_train[[var for var in vars_to_use+var_dep ]], destination_frame = 'df_train') # Train Frame
h2o_df_test  = h2o.H2OFrame(df_test[[var for var in vars_to_use]], destination_frame = 'df_test') # Test Frame

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%


Converting -99 to NAs. We have not done this before becuase as discussed earlier, converting pandas NAs to H20 Frame were not consistent and was giving an error (may be due to the older version of H20 in the image).

In [10]:
# Converting -99 to NA in train
for var in vars_ind_numeric:
    h2o_df_train[h2o_df_train[var] == -99.0 , var] = None
    
# Converting -99 to NA in test
for var in vars_ind_numeric:
    h2o_df_test[h2o_df_test[var] == -99.0 , var] = None

In [11]:
# H20 Document suggest to make dependant variable as factor for classiication task
h2o_df_train[var_dep] = h2o_df_train[var_dep].asfactor()

### 2.4 Interactions
Documentation regarding the interaction in XGBoost was not clear, therefore after  searching for online, especially [here](https://www.kaggle.com/c/bosch-production-line-performance/discussion/24418) making interaction seemed to be a logical choice given it has drastically improved the past model (GLM) performance.

I have not used the interaction on all variables, instead I have used it only on 2 variables ('f03' and 'e11'). These two 
variables have been chosen after running GLM model and then calculating variables importance.
I did not use all categorical variables, as it could not improve the performance significantly but made the
model quite complex to comprehend. Therefore, it was a trade-off between accuracy and complexity.


In [44]:
# Code for runing GLM to see which all variables are important

# Idea has been taken from: https://aichamp.wordpress.com/2017/09/29/python-example-of-building-glm-gbm-and-random-forest-
# binomial-model-with-h2o/


# from h2o.estimators.glm import H2OGeneralizedLinearEstimator
# glm_logistic = H2OGeneralizedLinearEstimator(family = "binomial")
# glm_logistic.train(x=vars_to_use , y= 'target', training_frame=h2o_df_train, model_id="glm_logistic")
# preds = glm_logistic.predict(h2o_df_test)
# df_test['Predicted'] = np.round(preds[2].as_data_frame(), 5)
# df_preds_dt = df_test[['unique_id', 'Predicted']].copy()
# df_test[['unique_id', 'Predicted']].to_csv(dirPOutput + '1st.csv', index=False)
# log_var_imp = glm_logistic.varimp(use_pandas=True).head()
# log_var_imp.loc[0:5, 'variable'].tolist()

glm Model Build progress: |███████████████████████████████████████████████| 100%
glm prediction progress: |████████████████████████████████████████████████| 100%




['f10', 'f03.F', 'e19', 'e11.A', 'f03.E']

I have chosen min_occurence as int(len(h2o_df_train)/40) after many trial and error.

With int(len(h2o_df_train)/40) on 250K train data I was getting 5 factors for each variable. I believe performing 
interactions on top 4-5 variables rather than all the variables having occurrence > 10/20 would make the model
faster and more interpretable

In [12]:
# Train Frame
interaction_frame_train = h2o_df_train.interaction(['f03', 'e11'], pairwise = False, max_factors = 100,
                                                   min_occurrence = int(len(h2o_df_train)/40))

# Test Frame
interaction_frame_test = h2o_df_test.interaction(['f03', 'e11'], pairwise = False, max_factors = 100, 
                                                 min_occurrence = int(len(h2o_df_train)/40))

# Cbinding interaction frame to train and test
h2o_df_train = h2o_df_train.cbind(interaction_frame_train)
h2o_df_test = h2o_df_test.cbind(interaction_frame_test)

Interactions progress: |██████████████████████████████████████████████████| 100%
Interactions progress: |██████████████████████████████████████████████████| 100%


In [13]:
# incuding interaction frame's variable to variables list
vars_to_use = vars_to_use + ['f03_e11']
vars_ind_numeric = vars_ind_numeric + ['f03_e11']

## 3. Implementing XGBoost
* **My understanding of the algorithm**

XGBoost (Extreme Gradient Boosting) is a boosting algorithm. Boosting refers to ensemble technique where several models (tree models) are being used to make the final model.

Boosting models are build sequentially by minimising the errors from the previous model and Gradient Boosting is an advance version of boosting where the algorithm employs gradient descent method to minimise the error. The main difference between Bagging and Boosting is, Bagging combines different models while Boosting use the previous model to make a new model with less errors.

* **One_hot encoding**

Since XGBoost is dependent on Tree based algorithm, it was safe to assume that XGBoost do not require one_hot encoding. But after going through the paper [Chen T.,Guestrin C., XGBoost: A Scalable Tree Boosting System](https://arxiv.org/pdf/1603.02754.pdf) I realised that one_hot encoding is important if there are categorical variables to make XGBoost perfom better. 

After Going through H20 documentation, I realised that the algorithm, similar to GLM, automatically performs one_hot encoding for XGBoost and thus we do not have to manually do that.

* **Spline**

Since XGBoost is based on Tree based algorithm, we have not used splines.

### 3.1 **Grid Search Vs HyperOpt**

For my own understanding, I tried both, Grid Search and HyperOpt.

Grid search processed only 2 algorithm in 2 hours optimising 5 hyper-parameters with 5 nfolds. It's Train AUC was = 0.889 and valid AUC = 0.879. Prediction accuracy on Kaggle for Grid Search algorithm was 0.865.

On the other hand, HyperOpt explored 10 algorithms in 5 hours minutes (limit of 30 mins per algorithm) and returned train AUC of 0.886, validation AUC of 0.878 and Kaggle Test data AUC of 0.865. Even though the accuracy was not actually significant, I still decided to go for Hyperopt to get my hands on it.

### 3.2 Tuning Hyper-Parameters
**Hyper-paramters to optimize**
* Max_depth: It sets the number of maximum tree depths. Increasing the depth makes the model complex and may lead to overfitting. Default value is 6. By going through few tutorials, I have kept the values from 4 to 8.
* sample_rate: sample_rate tell XGBoost how much data sample should be collected to grow the tree. default is 1, which means 100% data is to be used. Higher values may increase the train accuracy, but for the accuracy on test data, sampling is suggested [Friedman, 1999](https://statweb.stanford.edu/~jhf/ftp/stobst.pdf). I have the values of 0.4, 0.6, 0.8 and 0.9
* min_rows: Specify the minimum number of observations for a leaf. Default value is 1, but in our model, value 1 may make the algorithm time consuming and might not significantly improve the accuracy. Therefore, I have chosen value of 10,20,30 and 40
* ntrees: It tell XGBoost the numbers of tree to build. I have tried 200 trees and observed that there is a need to build more trees to get it converged. Also, since we have taken small learn_rate values, we would require more tree. Therefore, I have set the values to 1000, 2000, 3000 and 4000.
* learn_rate: As per my understanding, learn_rate here is similar to learning-rate in gradient boosting. I couldn't get much information from the H20 documentation. Therefore, I reviewed certain online blogs to see how others have comprehended and used this value. After researching for a while I selected 0.01, 0.005 and 0.001 because others have used these values in line to the number of trees we are using.

**Fixed Hyper-parameters** have been discussed in the code.

### Grid Search

In [87]:
max_depth = [4, 5, 6, 7, 8]
sample_rate = [0.4, 0.6, 0.8, 0.9]
min_rows = [10, 20, 30, 40] # 
ntrees = [1000, 2000, 3000, 4000]
learn_rate = [0.01, 0.005, 0.001]

search_crit = {'strategy': "RandomDiscrete",#  The default strategy, "Cartesian", covers the entire space of h-p 
               # combinations but takes too much time. Therefore we have used "Random Discrete Strategy"
               "max_runtime_secs": 7200, #  maximum time allowed for he algorithm to run
               "max_models": 10, # maximum models to return
               "seed": 2020,  
               "stopping_rounds" : 4, # Stops training when the option selected for stopping_metric doesn’t improve for
               # the specified number of training rounds, based on a simple moving average. 
               "stopping_metric" : "AUC",# Since the kaggle score is based upon AUC, therefore I am using AUC as stopping metric
               "stopping_tolerance" : 1e-3} # to stop training if the improvement is less than this value. 1e-3 is the default
                # value

hyper_params = {
      "ntrees" : ntrees 
    , "max_depth" : max_depth 
    , "learn_rate" : learn_rate
    , "sample_rate" : sample_rate
    , "min_rows" : min_rows 
}
XGB_grid = H2OGridSearch(H2OXGBoostEstimator(col_sample_rate_per_tree = .9, # Specify the column subsampling rate per tree. It
                                             # is multiplicative with col_sample_rate, so setting both parameters to 0.8,
                                             # for example, results in 64% of columns being considered at any given node to 
                                             # split.0.9 has been taken because it seemed the most common choice among the 
                                             # online blogs
                                             score_tree_interval = 100, # Scoring the model after every so many trees
                                             nfolds = 5 # nfolds ideally took more time but I believe it is the correct
                                             # way of doing rather than taking validation set.
                                            ),
                         hyper_params=hyper_params,
                         search_criteria=search_crit,
                         grid_id='g2') # name of the grid

XGB_grid.train(x=vars_to_use,
               y='target',
               training_frame=h2o_df_train) 

In [89]:
# exploring the modesl 
XGB_grid.get_grid()

    learn_rate max_depth min_rows ntrees sample_rate   model_ids  accuracy
0        0.005         8     10.0   1360         0.8  g2_model_1  0.782824
1        0.001         8     20.0     46         0.6  g2_model_2  0.773228




In [122]:
# Best model
best_XG_Grid 

Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  g2_model_1


ModelMetricsBinomial: xgboost
** Reported on train data. **

MSE: 0.13601975447423742
RMSE: 0.3688085607388167
LogLoss: 0.4166825112634517
Mean Per-Class Error: 0.2001375184104801
AUC: 0.8892462043527933
pr_auc: 0.8805401159906923
Gini: 0.7784924087055867
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.41581707225976355: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,90664.0,36505.0,0.2871,(36505.0/127169.0)
1,15199.0,107632.0,0.1237,(15199.0/122831.0)
Total,105863.0,144137.0,0.2068,(51704.0/250000.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4158171,0.8063288,232.0
max f2,0.1889161,0.8824624,316.0
max f0point5,0.6294731,0.8092707,152.0
max accuracy,0.5087246,0.799644,198.0
max precision,0.9910524,1.0,0.0
max recall,0.0077728,1.0,395.0
max specificity,0.9910524,1.0,0.0
max absolute_mcc,0.4889501,0.5998570,205.0
max min_per_class_accuracy,0.5087246,0.7994155,198.0


Gains/Lift Table: Avg response rate: 49.13 %, avg score: 49.13 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.01,0.9805386,2.0336886,2.0336886,0.9992,0.9861304,0.9992,0.9861304,0.0203369,0.0203369,103.3688564,103.3688564
,2,0.02,0.9694701,2.0190343,2.0263614,0.992,0.9748566,0.9956,0.9804935,0.0201903,0.0405272,101.9034283,102.6361423
,3,0.03,0.9614959,2.0198484,2.0241904,0.9924,0.9652416,0.9945333,0.9754095,0.0201985,0.0607257,101.9848410,102.4190419
,4,0.04,0.9552307,2.0125213,2.0212731,0.9888,0.9583069,0.9931,0.9711339,0.0201252,0.0808509,101.2521269,102.1273131
,5,0.05,0.9494460,1.9994952,2.0169176,0.9824,0.9523305,0.99096,0.9673732,0.0199950,0.1008459,99.9495241,101.6917553
,6,0.1,0.9206736,1.9682328,1.9925752,0.96704,0.9352837,0.979,0.9513284,0.0984116,0.1992575,96.8232775,99.2575164
,7,0.15,0.8828622,1.8884484,1.9578662,0.92784,0.9025077,0.9619467,0.9350548,0.0944224,0.2936799,88.8448356,95.7866228
,8,0.2,0.8402481,1.8085011,1.9205249,0.88856,0.8617819,0.9436,0.9167366,0.0904251,0.3841050,80.8501111,92.0524949
,9,0.3,0.7369484,1.6486066,1.8298855,0.81,0.7902756,0.8990667,0.8745829,0.1648607,0.5489657,64.8606622,82.9885507




ModelMetricsBinomial: xgboost
** Reported on cross-validation data. **

MSE: 0.1419487460194194
RMSE: 0.3767608605195335
LogLoss: 0.433612396356896
Mean Per-Class Error: 0.20900672194385894
AUC: 0.8787883643498352
pr_auc: 0.8688422794421169
Gini: 0.7575767286996704
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4055378807417919: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,88100.0,39069.0,0.3072,(39069.0/127169.0)
1,15225.0,107606.0,0.124,(15225.0/122831.0)
Total,103325.0,146675.0,0.2172,(54294.0/250000.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4055379,0.7985425,240.0
max f2,0.1918502,0.8776325,320.0
max f0point5,0.6204113,0.7986662,156.0
max accuracy,0.4913133,0.79068,206.0
max precision,0.9923445,1.0,0.0
max recall,0.0035599,1.0,399.0
max specificity,0.9923445,1.0,0.0
max absolute_mcc,0.4826967,0.5823093,209.0
max min_per_class_accuracy,0.5060866,0.7896343,200.0


Gains/Lift Table: Avg response rate: 49.13 %, avg score: 49.13 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.01,0.9796461,2.0214767,2.0214767,0.9932,0.9855846,0.9932,0.9855846,0.0202148,0.0202148,102.1476663,102.1476663
,2,0.02,0.9688479,2.0108930,2.0161848,0.988,0.9741055,0.9906,0.9798450,0.0201089,0.0403237,101.0893016,101.6184839
,3,0.03,0.9609281,2.0027517,2.0117071,0.984,0.9646940,0.9884,0.9747947,0.0200275,0.0603512,100.2751748,101.1707142
,4,0.04,0.9544154,1.9889116,2.0060083,0.9772,0.9576528,0.9856,0.9705092,0.0198891,0.0802403,98.8911594,100.6008255
,5,0.05,0.9483850,1.9872833,2.0022633,0.9764,0.9514450,0.98376,0.9666964,0.0198728,0.1001132,98.7283341,100.2263272
,6,0.1,0.9190712,1.9372960,1.9697796,0.95184,0.9340544,0.9678,0.9503754,0.0968648,0.1969780,93.7295959,96.9779616
,7,0.15,0.8815117,1.8614193,1.9336595,0.91456,0.9009667,0.9500533,0.9339058,0.0930710,0.2900489,86.1419349,93.3659527
,8,0.2,0.8380019,1.7754476,1.8941065,0.87232,0.8601586,0.93062,0.9154690,0.0887724,0.3788213,77.5447566,89.4106537
,9,0.3,0.7360088,1.6274393,1.8052175,0.7996,0.7885593,0.8869467,0.8731658,0.1627439,0.5415652,62.7439327,80.5217467



Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.7834026,0.0027226,0.7775557,0.7822458,0.7848795,0.7829144,0.7894177
auc,0.8788035,0.0016760,0.8773696,0.8765165,0.8774911,0.8795294,0.8831111
err,0.2165974,0.0027226,0.2224443,0.2177542,0.2151206,0.2170856,0.2105823
err_count,10830.0,142.37346,11130.0,10879.0,10804.0,10838.0,10499.0
f0point5,0.7590598,0.0038358,0.7505828,0.7582756,0.7635419,0.7568046,0.7660941
f1,0.7989224,0.0015456,0.7982818,0.7951263,0.7993165,0.8015527,0.8003347
f2,0.8433018,0.0051752,0.8524548,0.8357416,0.8386081,0.8519251,0.8377793
lift_top_group,2.0216014,0.0111948,2.0042074,2.0352557,2.0077214,2.0159836,2.0448399
logloss,0.4336052,0.0029183,0.4356759,0.4371350,0.4367163,0.4324740,0.4260249


Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
,2020-07-11 20:16:51,1:38:02.233,0.0,0.5,0.6931472,0.5,0.0,1.0,0.508676
,2020-07-11 20:18:05,1:39:15.936,100.0,0.4349465,0.5680203,0.8718468,0.8619663,2.0117071,0.225152
,2020-07-11 20:19:15,1:40:26.730,200.0,0.4040003,0.5063546,0.8743722,0.8654553,2.0198484,0.2203
,2020-07-11 20:20:28,1:41:39.228,300.0,0.3896965,0.4736329,0.8763788,0.8677672,2.0214767,0.21852
,2020-07-11 20:21:42,1:42:53.173,400.0,0.3829362,0.4555570,0.8780196,0.8699982,2.0247332,0.218492
,2020-07-11 20:22:57,1:44:08.310,500.0,0.3794422,0.4448940,0.8792473,0.8705343,2.0271756,0.214904
,2020-07-11 20:24:14,1:45:25.659,600.0,0.3774516,0.4383575,0.8802815,0.8716622,2.0296179,0.214676
,2020-07-11 20:25:32,1:46:43.684,700.0,0.3760536,0.4338692,0.8813051,0.8713452,2.0312462,0.214808
,2020-07-11 20:26:52,1:48:02.985,800.0,0.3748880,0.4304762,0.8823938,0.8727218,2.0328744,0.21252


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
f10,7722454.0,1.0,0.6846953
e19,871827.1875000,0.1128951,0.0772987
e18,240750.7656250,0.0311754,0.0213457
f02,227124.7812500,0.0294110,0.0201375
f27.F,186000.6406250,0.0240857,0.0164914
---,---,---,---
f09.K,8.9732685,0.0000012,0.0000008
c05.C,7.6248779,0.0000010,0.0000007
b03.N,6.3360138,0.0000008,0.0000006



See the whole table with table.as_data_frame()




In [108]:
# Implementin the model to test data
grid_XGBoost = XGB_grid.get_grid(sort_by='accuracy', decreasing=True)
best_XG_Grid = grid_XGBoost.models[0]


preds = best_XG_Grid.predict(h2o_df_test)
df_test['Predicted'] = np.round(preds[2].as_data_frame(), 5)
df_preds_dt = df_test[['unique_id', 'Predicted']].copy()
df_test[['unique_id', 'Predicted']].to_csv(dirPOutput + '250k_XGBoost_Grid.csv', index=False)

# Kaggle Accuracy
# Accuracy = 0.86532
# From 0.8381



### HyperOpt

In [15]:
# Splitting train data into valid data rather than n-folds because of speed efficiency. Even though I would prefer n-folds, I
# wanted to make this experiements since majority of people have used it in this way
train, valid = h2o_df_train.split_frame(ratios=[.7], seed = 2020)

In [16]:
# H20 Documentation says that the dependent variable has to be in factors
train[var_dep] = train[var_dep].asfactor()
valid[var_dep]   = valid[var_dep].asfactor()

In [27]:
# Objective
def hyperopt_objective(params):
    XGB_hpropt = H2OXGBoostEstimator(col_sample_rate_per_tree = .9, # Specify the column subsampling rate per tree. It
                                    # is multiplicative with col_sample_rate, so setting both parameters to 0.8, for example,
                                    # results in 64% of columns being considered at any given node to split.
                                    # 0.9 has been taken because it seemed the most common choice among the online blogs
                                   score_tree_interval = 100, # Scoring the model after every so many trees
                                   seed = 2020, max_runtime_secs= 1800, # Each XGBoost will run for maximum 30 minutes
                                   stopping_rounds = 4,  # Stops training when the option selected for stopping_metric doesn’t
                                    # improve for the specified number of  training rounds, based on a simple moving average. 
                                   stopping_metric = "AUC", # Since the kaggle score is based upon AUC, therefore I am using AUC as metrc
                                   stopping_tolerance = 1e-3, # to stop training if the improvement is less than this value. 1e-3 is the
                                     # default value
                                   ntrees = int(params['ntrees']), # Discussed under "Hyper-parameters to optimise"
                                   max_depth = int(params['max_depth']), # Discussed under "Hyper-parameters to optimise"
                                   learn_rate = params['learn_rate'], # Discussed under "Hyper-parameters to optimise"
                                   sample_rate = params['sample_rate'], # Discussed under "Hyper-parameters to optimise"
                                   min_rows = params['min_rows'] # Discussed under "Hyper-parameters to optimise"
                                   # ,max_runtime_secs = 18000, 
                                  )
    
    
    # training the model
    XGB_hpropt.train(x=vars_to_use,
               y='target',
               training_frame=train)
    

    # validating the model on validation frame
    preds = XGB_hpropt.predict(valid)
    preds = preds[2].as_data_frame()['p1'].values
    score = roc_auc_score(valid[var_dep].as_data_frame().as_matrix(), preds)
    
    return -score 

In [28]:
# Search Space
search_space = {
    'ntrees': hyperopt.hp.quniform('ntrees', 1200, 4000, 1)
    , 'max_depth': hyperopt.hp.quniform('max_depth', 4,8, 1)
    , 'learn_rate': hyperopt.hp.uniform('learn_rate', 0.001, 0.01)
    , 'sample_rate': hyperopt.hp.uniform('sample_rate', 0.4, 0.9)
    , 'min_rows': hyperopt.hp.quniform('min_rows', 10 , 40, 1)
}


In [29]:
# Optimisation
trials = hyperopt.Trials()

h2o.no_progress()
best = hyperopt.fmin(hyperopt_objective,
                     space=search_space,
                     algo=hyperopt.tpe.suggest,
                     max_evals=10, # Each model will take 30 minutes and there will be 10 models like this, thus 5 hours
                     trials=trials,
                     rstate=RandomState(2020)
)

print(best)

  0%|          | 0/10 [00:00<?, ?it/s, best loss: ?]




 10%|█         | 1/10 [30:57<4:38:35, 1857.28s/it, best loss: -0.8810321924378307]




 20%|██        | 2/10 [58:04<3:58:25, 1788.15s/it, best loss: -0.8811926515395831]




 30%|███       | 3/10 [1:04:12<2:38:54, 1362.09s/it, best loss: -0.8811926515395831]




 40%|████      | 4/10 [1:33:56<2:28:53, 1488.83s/it, best loss: -0.8820504208614022]




 50%|█████     | 5/10 [2:00:04<2:06:02, 1512.51s/it, best loss: -0.8820504208614022]




 60%|██████    | 6/10 [2:31:02<1:47:45, 1616.27s/it, best loss: -0.8820643226233149]




 70%|███████   | 7/10 [2:45:01<1:09:09, 1383.06s/it, best loss: -0.8820643226233149]




 80%|████████  | 8/10 [2:55:17<38:25, 1152.86s/it, best loss: -0.8820643226233149]  




 90%|█████████ | 9/10 [3:26:04<22:41, 1361.22s/it, best loss: -0.8820643226233149]




100%|██████████| 10/10 [3:40:01<00:00, 1203.76s/it, best loss: -0.8820643226233149]
{'learn_rate': 0.00632500998161564, 'max_depth': 8.0, 'min_rows': 37.0, 'ntrees': 3785.0, 'sample_rate': 0.7140605409926624}


In [25]:
# The best model returned the value of 
#     {'learn_rate': 0.00632500998161564,
#      'max_depth': 8.0,
#      'min_rows': 37.0,
#      'ntrees': 3785.0,
#      'sample_rate': 0.7140605409926624}

# I actually did not run the model next time but by mistake run this cell, therefore got this error
best

NameError: name 'best' is not defined

In [19]:
# Implementing the best model returned by Hyperopt
XGB_hpropt_bst = H2OXGBoostEstimator(col_sample_rate_per_tree = .9,
                               score_tree_interval = 100,
                               seed = 2020, max_runtime_secs= 3600,
                               stopping_rounds = 4, 
                               stopping_metric = "AUC", 
                               stopping_tolerance = 1e-3,
                               ntrees = 3785,
                               max_depth = 8,
                               learn_rate = 0.00632500998161564,
                               sample_rate = 0.7140605409926624,
                               min_rows = 37, nfolds = 5
                               # ,max_runtime_secs = 18000, 
                              )


XGB_hpropt_bst.train(x=vars_to_use,
           y='target',
           training_frame=train)

xgboost Model Build progress: |███████████████████████████████████████████| 100%


In [21]:
preds = XGB_hpropt_bst.predict(h2o_df_test)
df_test['Predicted'] = np.round(preds[2].as_data_frame(), 5)
df_preds_dt = df_test[['unique_id', 'Predicted']].copy()
df_test[['unique_id', 'Predicted']].to_csv(dirPOutput + '250k_XGBoost_hyperoptt_test_2ndver.csv', index=False)



xgboost prediction progress: |████████████████████████████████████████████| 100%




In [22]:
XGB_hpropt_bst

Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  XGBoost_model_python_1594806448431_1


ModelMetricsBinomial: xgboost
** Reported on train data. **

MSE: 0.13771069972813313
RMSE: 0.37109392305470745
LogLoss: 0.42207939087711466
Mean Per-Class Error: 0.20218874513834284
AUC: 0.8863087173166594
pr_auc: 0.8771276603825596
Gini: 0.7726174346333188
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4171403104482695: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,63288.0,25722.0,0.289,(25722.0/89010.0)
1,10819.0,75329.0,0.1256,(10819.0/86148.0)
Total,74107.0,101051.0,0.2086,(36541.0/175158.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4171403,0.8048013,231.0
max f2,0.1880436,0.8816929,318.0
max f0point5,0.6272633,0.8067679,150.0
max accuracy,0.4848333,0.7974457,204.0
max precision,0.9876546,1.0,0.0
max recall,0.0062810,1.0,398.0
max specificity,0.9876546,1.0,0.0
max absolute_mcc,0.4830881,0.5959146,205.0
max min_per_class_accuracy,0.5073536,0.7962813,195.0


Gains/Lift Table: Avg response rate: 49.18 %, avg score: 49.19 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100024,0.9777053,2.0216167,2.0216167,0.9942922,0.9827959,0.9942922,0.9827959,0.0202210,0.0202210,102.1616749,102.1616749
,2,0.0200048,0.9677721,2.0158142,2.0187155,0.9914384,0.9724900,0.9928653,0.9776429,0.0201630,0.0403840,101.5814175,101.8715462
,3,0.0300015,0.9601332,2.0134819,2.0169716,0.9902913,0.9638160,0.9920076,0.9730357,0.0201282,0.0605121,101.3481879,101.6971598
,4,0.0400039,0.9536010,2.0111721,2.0155215,0.9891553,0.9568361,0.9912944,0.9689852,0.0201165,0.0806287,101.1172116,101.5521521
,5,0.0500006,0.9473292,1.9972254,2.0118635,0.9822958,0.9504915,0.9894953,0.9652877,0.0199656,0.1005943,99.7225393,101.1863549
,6,0.1000011,0.9174969,1.9535924,1.9827280,0.9608358,0.9327099,0.9751656,0.9489988,0.0976807,0.1982751,95.3592403,98.2727976
,7,0.1500017,0.8797814,1.8795347,1.9483302,0.9244120,0.8996772,0.9582477,0.9325583,0.0939778,0.2922529,87.9534652,94.8330201
,8,0.2000023,0.8372408,1.7892260,1.9085541,0.8799954,0.8589350,0.9386846,0.9141524,0.0894623,0.3817152,78.9225983,90.8554147
,9,0.3000034,0.7372544,1.6485395,1.8218826,0.8108016,0.7884930,0.8960569,0.8722660,0.1648558,0.5465710,64.8539472,82.1882588




ModelMetricsBinomial: xgboost
** Reported on cross-validation data. **

MSE: 0.14274577098723212
RMSE: 0.37781711314766053
LogLoss: 0.4360396586479798
Mean Per-Class Error: 0.20995014317543115
AUC: 0.8774186628877368
pr_auc: 0.8667499376145853
Gini: 0.7548373257754737
Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4241522872324221: 


0,1,2,3,4
,0.0,1.0,Error,Rate
0,63112.0,25898.0,0.291,(25898.0/89010.0)
1,11793.0,74355.0,0.1369,(11793.0/86148.0)
Total,74905.0,100253.0,0.2152,(37691.0/175158.0)


Maximum Metrics: Maximum metrics at their respective thresholds



0,1,2,3
metric,threshold,value,idx
max f1,0.4241523,0.7977961,230.0
max f2,0.1978648,0.8775257,316.0
max f0point5,0.6137050,0.7978300,154.0
max accuracy,0.4840694,0.7896756,206.0
max precision,0.9885210,0.9951691,0.0
max recall,0.0062424,1.0,398.0
max specificity,0.9885210,0.9999888,0.0
max absolute_mcc,0.4530389,0.5805592,219.0
max min_per_class_accuracy,0.5063477,0.7884507,197.0


Gains/Lift Table: Avg response rate: 49.18 %, avg score: 49.19 %



0,1,2,3,4,5,6,7,8,9,10,11,12,13
,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain
,1,0.0100024,0.9762732,2.0123326,2.0123326,0.9897260,0.9816759,0.9897260,0.9816759,0.0201282,0.0201282,101.2332631,101.2332631
,2,0.0200048,0.9664814,2.0100116,2.0111721,0.9885845,0.9712869,0.9891553,0.9764814,0.0201049,0.0402331,101.0011602,101.1172116
,3,0.0300015,0.9590667,1.9925807,2.0049773,0.9800114,0.9625933,0.9861085,0.9718538,0.0199192,0.0601523,99.2580683,100.4977330
,4,0.0400039,0.9524291,1.9868013,2.0004327,0.9771689,0.9556459,0.9838733,0.9678012,0.0198728,0.0800251,98.6801306,100.0432676
,5,0.0500006,0.9464165,1.9763242,1.9956126,0.9720160,0.9493658,0.9815026,0.9641154,0.0197567,0.0997818,97.6324197,99.5612632
,6,0.1000011,0.9165082,1.9322341,1.9639233,0.9503311,0.9315949,0.9659169,0.9478552,0.0966128,0.1963946,93.2234055,96.3923343
,7,0.1500017,0.8788684,1.8502830,1.9260432,0.9100251,0.8986896,0.9472863,0.9314666,0.0925152,0.2889098,85.0283001,92.6043229
,8,0.2000023,0.8347360,1.7676355,1.8864413,0.8693766,0.8570068,0.9278089,0.9128517,0.0883828,0.3772926,76.7635479,88.6441292
,9,0.3000034,0.7346722,1.6254400,1.7994408,0.7994405,0.7861476,0.8850194,0.8706170,0.1625459,0.5398384,62.5439954,79.9440846



Cross-Validation Metrics Summary: 


0,1,2,3,4,5,6,7
,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
accuracy,0.7830661,0.0032539,0.7864478,0.7839656,0.7743307,0.7833688,0.7872176
auc,0.8774689,0.0009027,0.8780848,0.8772098,0.8752478,0.8776830,0.8791192
err,0.2169339,0.0032539,0.2135522,0.2160344,0.2256692,0.2166312,0.2127824
err_count,7600.0,131.76115,7507.0,7580.0,7958.0,7534.0,7421.0
f0point5,0.7595552,0.0050253,0.7630531,0.7621844,0.7456005,0.7614669,0.7654711
f1,0.7983068,0.0010096,0.7987561,0.7994709,0.7956238,0.7994997,0.7981833
f2,0.8413495,0.0044821,0.8379643,0.8405933,0.8528421,0.8415315,0.8338163
lift_top_group,2.0100374,0.0042276,2.0150995,2.0039814,2.016216,2.0017529,2.013138
logloss,0.4360315,0.0018154,0.4356689,0.4365759,0.4397690,0.4364041,0.4317396


Scoring History: 


0,1,2,3,4,5,6,7,8,9
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
,2020-07-15 10:45:12,49 min 10.086 sec,0.0,0.5,0.6931472,0.5,0.0,1.0,0.5081698
,2020-07-15 10:46:06,50 min 4.280 sec,100.0,0.4270245,0.5526871,0.8714396,0.8616882,2.0100116,0.2247685
,2020-07-15 10:46:59,50 min 57.128 sec,200.0,0.3972490,0.4913000,0.8745307,0.8655151,2.0158142,0.2200756
,2020-07-15 10:47:54,51 min 52.382 sec,300.0,0.3856947,0.4627185,0.8764400,0.8669533,2.0192957,0.2164160
,2020-07-15 10:48:47,52 min 45.473 sec,400.0,0.3809107,0.4485616,0.8777452,0.8679981,2.0192957,0.2169641
,2020-07-15 10:49:41,53 min 38.838 sec,500.0,0.3784605,0.4406499,0.8789008,0.8697821,2.0204562,0.2157252
,2020-07-15 10:50:35,54 min 33.051 sec,600.0,0.3768100,0.4355581,0.8801161,0.8709247,2.0204562,0.2140068
,2020-07-15 10:51:30,55 min 27.952 sec,700.0,0.3755155,0.4319546,0.8813186,0.8710628,2.0204562,0.2131504
,2020-07-15 10:52:26,56 min 23.782 sec,800.0,0.3743740,0.4291463,0.8825444,0.8727993,2.0192957,0.2125053


Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
f10,4053835.7500000,1.0,0.6988023
e19,446379.3437500,0.1101128,0.0769471
e18,121911.7187500,0.0300732,0.0210152
f02,121454.7187500,0.0299604,0.0209364
f27.F,93993.9062500,0.0231864,0.0162027
---,---,---,---
e21.P,10.0393209,0.0000025,0.0000017
e21.J,8.7862110,0.0000022,0.0000015
c05.A,8.7393188,0.0000022,0.0000015



See the whole table with table.as_data_frame()




##### Old model (not so important)
I ran 40 models for 2 mins each, and was not expecting a better results since each model got only a time frame of 2 mins. But to my surprise this model yielded better Kaggle score of 0.867 compare to 0.865 of the above hyperopt

In [86]:
# Optimisation - OLD
trials = hyperopt.Trials()

h2o.no_progress()
best = hyperopt.fmin(hyperopt_objective,
                     space=search_space,
                     algo=hyperopt.tpe.suggest,
                     max_evals=40,
                     trials=trials,
                     rstate=RandomState(2020)
)

print(best)

  0%|          | 0/40 [00:00<?, ?it/s, best loss: ?]




  2%|▎         | 1/40 [02:04<1:20:39, 124.10s/it, best loss: -0.8674585083914759]




  5%|▌         | 2/40 [04:08<1:18:42, 124.28s/it, best loss: -0.870175121415311] 




  8%|▊         | 3/40 [06:14<1:16:54, 124.73s/it, best loss: -0.870175121415311]




 10%|█         | 4/40 [08:18<1:14:44, 124.56s/it, best loss: -0.871667523546091]




 12%|█▎        | 5/40 [10:24<1:12:52, 124.93s/it, best loss: -0.871667523546091]




 15%|█▌        | 6/40 [12:29<1:10:45, 124.87s/it, best loss: -0.871667523546091]




 18%|█▊        | 7/40 [14:33<1:08:31, 124.58s/it, best loss: -0.871667523546091]




 20%|██        | 8/40 [16:38<1:06:37, 124.93s/it, best loss: -0.871667523546091]




 22%|██▎       | 9/40 [18:43<1:04:25, 124.69s/it, best loss: -0.871667523546091]




 25%|██▌       | 10/40 [20:47<1:02:16, 124.55s/it, best loss: -0.871667523546091]




 28%|██▊       | 11/40 [22:51<1:00:13, 124.60s/it, best loss: -0.871667523546091]




 30%|███       | 12/40 [24:57<58:12, 124.72s/it, best loss: -0.871667523546091]  




 32%|███▎      | 13/40 [27:01<56:02, 124.55s/it, best loss: -0.871667523546091]




 35%|███▌      | 14/40 [29:05<53:54, 124.42s/it, best loss: -0.871667523546091]




 38%|███▊      | 15/40 [31:09<51:46, 124.27s/it, best loss: -0.871667523546091]




 40%|████      | 16/40 [33:13<49:39, 124.15s/it, best loss: -0.871667523546091]




 42%|████▎     | 17/40 [35:17<47:37, 124.24s/it, best loss: -0.871667523546091]




 45%|████▌     | 18/40 [37:21<45:31, 124.15s/it, best loss: -0.871667523546091]




 48%|████▊     | 19/40 [39:25<43:26, 124.10s/it, best loss: -0.871667523546091]




 50%|█████     | 20/40 [41:29<41:22, 124.13s/it, best loss: -0.871667523546091]




 52%|█████▎    | 21/40 [43:34<39:22, 124.32s/it, best loss: -0.871667523546091]




 55%|█████▌    | 22/40 [45:39<37:20, 124.46s/it, best loss: -0.871667523546091]




 57%|█████▊    | 23/40 [47:43<35:17, 124.56s/it, best loss: -0.8718621667871277]




 60%|██████    | 24/40 [49:48<33:10, 124.44s/it, best loss: -0.8718621667871277]




 62%|██████▎   | 25/40 [51:52<31:08, 124.54s/it, best loss: -0.8718621667871277]




 65%|██████▌   | 26/40 [53:57<29:01, 124.42s/it, best loss: -0.8718621667871277]




 68%|██████▊   | 27/40 [56:01<26:59, 124.56s/it, best loss: -0.8718621667871277]




 70%|███████   | 28/40 [58:06<24:55, 124.61s/it, best loss: -0.8718621667871277]




 72%|███████▎  | 29/40 [1:00:11<22:51, 124.66s/it, best loss: -0.8718621667871277]




 75%|███████▌  | 30/40 [1:02:15<20:45, 124.53s/it, best loss: -0.8718621667871277]




 78%|███████▊  | 31/40 [1:04:20<18:41, 124.61s/it, best loss: -0.8718621667871277]




 80%|████████  | 32/40 [1:06:25<16:36, 124.59s/it, best loss: -0.8718621667871277]




 82%|████████▎ | 33/40 [1:08:29<14:32, 124.64s/it, best loss: -0.8718621667871277]




 85%|████████▌ | 34/40 [1:10:34<12:27, 124.64s/it, best loss: -0.8718621667871277]




 88%|████████▊ | 35/40 [1:12:38<10:22, 124.59s/it, best loss: -0.8718621667871277]




 90%|█████████ | 36/40 [1:14:43<08:17, 124.47s/it, best loss: -0.8718621667871277]




 92%|█████████▎| 37/40 [1:16:47<06:13, 124.55s/it, best loss: -0.8718621667871277]




 95%|█████████▌| 38/40 [1:18:51<04:08, 124.42s/it, best loss: -0.8718621667871277]




 98%|█████████▊| 39/40 [1:20:56<02:04, 124.37s/it, best loss: -0.8718621667871277]




100%|██████████| 40/40 [1:23:00<00:00, 124.49s/it, best loss: -0.8719438506668313]
{'learn_rate': 0.008933646880323847, 'max_depth': 8.0, 'min_rows': 10.0, 'ntrees': 2908.0, 'sample_rate': 0.5396549266120586}


In [109]:
best.

{'learn_rate': 0.008933646880323847,
 'max_depth': 8.0,
 'min_rows': 10.0,
 'ntrees': 2908.0,
 'sample_rate': 0.5396549266120586}