# Using XGBoost


 - https://cambridgespark.com/content/tutorials/hyperparameter-tuning-in-xgboost/index.html
 - http://xgboost.readthedocs.io/en/latest//parameter.html
 - https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/


In [1]:
import xgboost as xgb
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sklearn.model_selection import train_test_split

## Facebook Comment Volume Dataset Data Set 

 - https://archive.ics.uci.edu/ml/machine-learning-databases/00363/
 
This dataset is composed of 53 features describing a post on Facebook: the number of likes on the page it was posted, the category of the page, the time and day it was posted, etc. The last column is the target: the number of comments the post received

In [2]:
file = "Dataset/Training/Features_Variant_1.csv"
df = pd.read_csv(file,header=None)


In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,44,45,46,47,48,49,50,51,52,53
0,634995,0,463,1,0.0,806.0,11.291045,1.0,70.495138,0.0,...,0,0,0,0,0,0,0,0,1,0
1,634995,0,463,1,0.0,806.0,11.291045,1.0,70.495138,0.0,...,0,0,0,0,0,0,0,1,0,0
2,634995,0,463,1,0.0,806.0,11.291045,1.0,70.495138,0.0,...,1,0,0,0,0,0,0,0,1,0
3,634995,0,463,1,0.0,806.0,11.291045,1.0,70.495138,0.0,...,1,0,0,1,0,0,0,0,0,0
4,634995,0,463,1,0.0,806.0,11.291045,1.0,70.495138,0.0,...,0,0,0,0,0,1,0,0,0,0


In [4]:
print("Dataset has {} entries and {} features".format(*df.shape))

Dataset has 40949 entries and 54 features


In [5]:
X = df.loc[:,:52].values
y = df.loc[:,53].values

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.1,random_state = 42)

In [7]:
#loading into DMatrices
dtrain = xgb.DMatrix(X_train,label=y_train)
dtest = xgb.DMatrix(X_test,label=y_test)

In [8]:
#Build a simple baseline model using MAE
from sklearn.metrics import mean_absolute_error

#get mean value
mean_train = np.mean(y_train)

#get predictions on test set
baseline_pred = np.ones(y_test.shape)*mean_train

#compute MAE
mae_base = mean_absolute_error(y_test,baseline_pred)

print("Baseline MAE is {:.2f}".format(mae_base))


Baseline MAE is 11.31


## Training and Tuning XGBoost Model

We will tune 6 of the hyperparameters that are usually having a big impact on performance

In [9]:
#params dictionary
params = {
        'max_depth':6,
        'min_child_weight':1,
        'eta':0.3,
        'subsample':1,
        'colsample_bytree':1,
        #other parameters
        'objective':'reg:linear',
}

XGBoost provides a nice way to find the best number of rounds whilst training. Since trees are built sequentially, instead of fixing the number of rounds at the beginning, we can test our model at each step and see if adding a new tree/round improves performance.

To do so, we define a test dataset and a metric that is used to assess performance at each round. If performance haven't improved for N rounds (N is defined by the variable early_stopping_round), we stop the training and keep the best number of boosting rounds. Let's see how to use it.

First, we need to add the evaluation metric we are interested in to our params dictionary.

In [10]:
params['eval_metric'] = "mae"

We still need to pass a num_boost_round which corresponds to the maximum number of boosting rounds that we allow. We set it to a large value hoping to find the optimal number of rounds before reaching it, if we haven't improved performance on our test dataset in early_stopping_round rounds

In [11]:
num_boost_round = 999

In order to automatically find the best number of boosting rounds, we need to pass extra parameters on top of the params dictionary, the training DMatrix and num_boost_round:

 - evals: a list of pairs (test_dmatrix, name_of_test). Here we will use our dtest DMatrix.
 - early_stopping_rounds: The number of rounds without improvements after which we should stop, here we set it to 10.

In [12]:
model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")],
    early_stopping_rounds=10
)

[0]	Test-mae:5.97478
Will train until Test-mae hasn't improved in 10 rounds.
[1]	Test-mae:5.03359
[2]	Test-mae:4.64572
[3]	Test-mae:4.42331
[4]	Test-mae:4.39328
[5]	Test-mae:4.35544
[6]	Test-mae:4.31315
[7]	Test-mae:4.33087
[8]	Test-mae:4.37164
[9]	Test-mae:4.38774
[10]	Test-mae:4.39443
[11]	Test-mae:4.40661
[12]	Test-mae:4.39124
[13]	Test-mae:4.39088
[14]	Test-mae:4.39827
[15]	Test-mae:4.39104
[16]	Test-mae:4.40307
Stopping. Best iteration:
[6]	Test-mae:4.31315



## Using XGBoost's CV


In order to tune the other hyperparameters, we will use the cv function from XGBoost. It allows us to run cross-validation on our training dataset and returns a mean MAE score.

We need to pass it:

params: our dictionary of parameters.
our dtrain matrix.
num_boost_round: number of boosting rounds. Here we will use a large number again and count on early_stopping_rounds to find the optimal number of rounds before reaching the maximum.
seed: random seed. It's important to set a seed here, to ensure we are using the same folds for each step so we can properly compare the scores with different parameters.
nfold: the number of folds to use for cross-validation
metrics: the metrics to use to evaluate our model, here we use MAE.

In [13]:
cv_results = xgb.cv(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    seed=42,
    nfold=5,
    metrics={'mae'},
    early_stopping_rounds=10
)
cv_results

Unnamed: 0,test-mae-mean,test-mae-std,train-mae-mean,train-mae-std
0,5.689189,0.270149,5.604765,0.064495
1,4.849525,0.271883,4.622477,0.065106
2,4.468342,0.239475,4.05971,0.065772
3,4.268584,0.224462,3.722983,0.06082
4,4.192448,0.189762,3.510303,0.061203
5,4.172856,0.189612,3.367213,0.061021
6,4.15786,0.192572,3.245549,0.060276
7,4.143254,0.19444,3.151495,0.062612
8,4.147843,0.196197,3.082321,0.05902
9,4.144657,0.189785,3.016803,0.057321


In [14]:
cv_results['test-mae-mean'].min()

4.0827876000000005

## Parameters max_depth and min_child_weight

Parameters max_depth and min_child_weight
Those parameters add constraints on the architecture of the trees.

max_depth is the maximum number of nodes allowed from the root to the farthest leaf of a tree. Deeper trees can model more complex relationships by adding more nodes, but as we go deeper, splits become less relevant and are sometimes only due to noise, causing the model to overfit.
min_child_weight is the minimum weight (or number of samples if all samples have a weight of 1) required in order to create a new node in the tree. A smaller min_child_weight allows the algorithm to create children that correspond to fewer samples, thus allowing for more complex trees, but again, more likely to overfit.
Thus, those parameters can be used to control the complexity of the trees. It is important to tune them together in order to find a good trade-off between model bias and variance

In [17]:
gridsearch_params = [
    (max_depth, min_child_weight)
    for max_depth in range(9,12)
    for min_child_weight in range(5,8)
]

In [18]:
res

[(9, 5), (9, 6), (9, 7), (10, 5), (10, 6), (10, 7), (11, 5), (11, 6), (11, 7)]

In [19]:
# Define initial best params and MAE
min_mae = float("Inf")
best_params = None
for max_depth, min_child_weight in gridsearch_params:
    print("CV with max_depth={}, min_child_weight={}".format(
                             max_depth,
                             min_child_weight))

    # Update our parameters
    params['max_depth'] = max_depth
    params['min_child_weight'] = min_child_weight

    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=num_boost_round,
        seed=42,
        nfold=5,
        metrics={'mae'},
        early_stopping_rounds=10
    )

    # Update best MAE
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (max_depth,min_child_weight)

print("Best params: {}, {}, MAE: {}".format(best_params[0], best_params[1], min_mae))

CV with max_depth=9, min_child_weight=5
	MAE 4.04524 for 6 rounds
CV with max_depth=9, min_child_weight=6
	MAE 4.0764622 for 5 rounds
CV with max_depth=9, min_child_weight=7
	MAE 4.0753928 for 5 rounds
CV with max_depth=10, min_child_weight=5
	MAE 4.0805826000000005 for 5 rounds
CV with max_depth=10, min_child_weight=6
	MAE 4.035100600000001 for 5 rounds
CV with max_depth=10, min_child_weight=7
	MAE 4.0872416000000005 for 5 rounds
CV with max_depth=11, min_child_weight=5
	MAE 4.062633 for 5 rounds
CV with max_depth=11, min_child_weight=6
	MAE 4.054831999999999 for 5 rounds
CV with max_depth=11, min_child_weight=7
	MAE 4.0581036 for 5 rounds
Best params: 10, 6, MAE: 4.035100600000001


In [20]:
params['max_depth'] = 10
params['min_child_weight'] = 6

In [21]:
params

{'colsample_bytree': 1,
 'eta': 0.3,
 'eval_metric': 'mae',
 'max_depth': 10,
 'min_child_weight': 6,
 'objective': 'reg:linear',
 'subsample': 1}

## Parameters subsample and colsample_bytree

Those parameters control the sampling of the dataset that is done at each boosting round.

Instead of using the whole training set every time, we can build a tree on slightly different data at each step, which makes it less likely to overfit to a single sample or feature.

subsample corresponds to the fraction of observations (the rows) to subsample at each step. By default it is set to 1 meaning that we use all rows.
colsample_bytree corresponds to the fraction of features (the columns) to use. By default it is set to 1 meaning that we will use all features.

In [22]:
gridsearch_params = [
    (subsample, colsample)
    for subsample in [i/10. for i in range(7,11)]
    for colsample in [i/10. for i in range(7,11)]
]

In [24]:
min_mae = float("Inf")
best_params = None

# We start by the largest values and go down to the smallest
for subsample, colsample in reversed(gridsearch_params):
    print("CV with subsample={}, colsample={}".format(
                             subsample,
                             colsample))

    # We update our parameters
    params['subsample'] = subsample
    params['colsample_bytree'] = colsample

    # Run CV
    cv_results = xgb.cv(
        params,
        dtrain,
        num_boost_round=num_boost_round,
        seed=42,
        nfold=5,
        metrics={'mae'},
        early_stopping_rounds=10
    )

    # Update best score
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = (subsample,colsample)

print("Best params: {}, {}, MAE: {}".format(best_params[0], best_params[1], min_mae))

CV with subsample=1.0, colsample=1.0
	MAE 4.035100600000001 for 5 rounds
CV with subsample=1.0, colsample=0.9
	MAE 4.1112535999999995 for 6 rounds
CV with subsample=1.0, colsample=0.8
	MAE 4.15728 for 7 rounds
CV with subsample=1.0, colsample=0.7
	MAE 4.327159 for 6 rounds
CV with subsample=0.9, colsample=1.0
	MAE 4.115240200000001 for 5 rounds
CV with subsample=0.9, colsample=0.9
	MAE 4.1338758 for 5 rounds
CV with subsample=0.9, colsample=0.8
	MAE 4.173397 for 7 rounds
CV with subsample=0.9, colsample=0.7
	MAE 4.2954372 for 6 rounds
CV with subsample=0.8, colsample=1.0
	MAE 4.0704557999999995 for 6 rounds
CV with subsample=0.8, colsample=0.9
	MAE 4.083678000000001 for 5 rounds
CV with subsample=0.8, colsample=0.8
	MAE 4.1932928 for 5 rounds
CV with subsample=0.8, colsample=0.7
	MAE 4.309526199999999 for 7 rounds
CV with subsample=0.7, colsample=1.0
	MAE 4.0939432 for 4 rounds
CV with subsample=0.7, colsample=0.9
	MAE 4.1591404 for 5 rounds
CV with subsample=0.7, colsample=0.8
	MAE 4.

In [27]:
#update dictionary
params['subsample'] = 1
params['colsample_bytree'] = 1.
params

{'colsample_bytree': 1.0,
 'eta': 0.3,
 'eval_metric': 'mae',
 'max_depth': 10,
 'min_child_weight': 6,
 'objective': 'reg:linear',
 'subsample': 1}

## Parameter ETA

The ETA parameter controls the learning rate. It corresponds to the shrinkage of the weights associated to features after each round, in other words it defines the amount of "correction" we make at each step (remember how each boosting round is correcting the errors of the previous?).

In practice, having a lower eta makes our model more robust to overfitting thus, usually, the lower the learning rate, the best. But with a lower eta, we need more boosting rounds, which takes more time to train, sometimes for only marginal improvements.

In [29]:
# This can take some time…
min_mae = float("Inf")
best_params = None

for eta in [.3, .2, .1, .05, .01]:
    print("CV with eta={}".format(eta))

    # We update our parameters
    params['eta'] = eta

    # Run and time CV
    cv_results = xgb.cv(
            params,
            dtrain,
            num_boost_round=num_boost_round,
            seed=42,
            nfold=5,
            metrics=['mae'],
            early_stopping_rounds=10
          )

    # Update best score
    mean_mae = cv_results['test-mae-mean'].min()
    boost_rounds = cv_results['test-mae-mean'].argmin()
    print("\tMAE {} for {} rounds\n".format(mean_mae, boost_rounds))
    if mean_mae < min_mae:
        min_mae = mean_mae
        best_params = eta

print("Best params: {}, MAE: {}".format(best_params, min_mae))

CV with eta=0.3
	MAE 4.035100600000001 for 5 rounds

CV with eta=0.2
	MAE 4.0020152 for 9 rounds

CV with eta=0.1
	MAE 3.9625017999999996 for 21 rounds

CV with eta=0.05
	MAE 3.9589836 for 42 rounds

CV with eta=0.01
	MAE 3.9502756000000003 for 218 rounds

Best params: 0.01, MAE: 3.9502756000000003


## Results

In [31]:
params['eta'] = 0.01
params

{'colsample_bytree': 1.0,
 'eta': 0.01,
 'eval_metric': 'mae',
 'max_depth': 10,
 'min_child_weight': 6,
 'objective': 'reg:linear',
 'subsample': 1}

In [32]:
#try on test set
model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")],
    early_stopping_rounds=10
)

[0]	Test-mae:7.69091
Will train until Test-mae hasn't improved in 10 rounds.
[1]	Test-mae:7.62044
[2]	Test-mae:7.55171
[3]	Test-mae:7.48296
[4]	Test-mae:7.41607
[5]	Test-mae:7.35005
[6]	Test-mae:7.28594
[7]	Test-mae:7.22193
[8]	Test-mae:7.15977
[9]	Test-mae:7.09845
[10]	Test-mae:7.03766
[11]	Test-mae:6.97696
[12]	Test-mae:6.91763
[13]	Test-mae:6.85842
[14]	Test-mae:6.8029
[15]	Test-mae:6.74587
[16]	Test-mae:6.69068
[17]	Test-mae:6.63611
[18]	Test-mae:6.58169
[19]	Test-mae:6.52847
[20]	Test-mae:6.47689
[21]	Test-mae:6.42755
[22]	Test-mae:6.37666
[23]	Test-mae:6.32902
[24]	Test-mae:6.28125
[25]	Test-mae:6.23493
[26]	Test-mae:6.18968
[27]	Test-mae:6.14404
[28]	Test-mae:6.10002
[29]	Test-mae:6.05626
[30]	Test-mae:6.01302
[31]	Test-mae:5.96975
[32]	Test-mae:5.92706
[33]	Test-mae:5.88582
[34]	Test-mae:5.84803
[35]	Test-mae:5.80888
[36]	Test-mae:5.77119
[37]	Test-mae:5.73382
[38]	Test-mae:5.69565
[39]	Test-mae:5.65948
[40]	Test-mae:5.62329
[41]	Test-mae:5.58676
[42]	Test-mae:5.55166
[43]	Test

In [33]:
print("Best MAE: {:.2f} in {} rounds".format(model.best_score, model.best_iteration+1))

Best MAE: 4.12 in 149 rounds


In [35]:
num_boost_round = model.best_iteration + 1

best_model = xgb.train(
    params,
    dtrain,
    num_boost_round=num_boost_round,
    evals=[(dtest, "Test")]
)

[0]	Test-mae:7.69091
[1]	Test-mae:7.62044
[2]	Test-mae:7.55171
[3]	Test-mae:7.48296
[4]	Test-mae:7.41607
[5]	Test-mae:7.35005
[6]	Test-mae:7.28594
[7]	Test-mae:7.22193
[8]	Test-mae:7.15977
[9]	Test-mae:7.09845
[10]	Test-mae:7.03766
[11]	Test-mae:6.97696
[12]	Test-mae:6.91763
[13]	Test-mae:6.85842
[14]	Test-mae:6.8029
[15]	Test-mae:6.74587
[16]	Test-mae:6.69068
[17]	Test-mae:6.63611
[18]	Test-mae:6.58169
[19]	Test-mae:6.52847
[20]	Test-mae:6.47689
[21]	Test-mae:6.42755
[22]	Test-mae:6.37666
[23]	Test-mae:6.32902
[24]	Test-mae:6.28125
[25]	Test-mae:6.23493
[26]	Test-mae:6.18968
[27]	Test-mae:6.14404
[28]	Test-mae:6.10002
[29]	Test-mae:6.05626
[30]	Test-mae:6.01302
[31]	Test-mae:5.96975
[32]	Test-mae:5.92706
[33]	Test-mae:5.88582
[34]	Test-mae:5.84803
[35]	Test-mae:5.80888
[36]	Test-mae:5.77119
[37]	Test-mae:5.73382
[38]	Test-mae:5.69565
[39]	Test-mae:5.65948
[40]	Test-mae:5.62329
[41]	Test-mae:5.58676
[42]	Test-mae:5.55166
[43]	Test-mae:5.51758
[44]	Test-mae:5.48502
[45]	Test-mae:5.45151

In [36]:
mean_absolute_error(best_model.predict(dtest), y_test)

4.121239465908108

In [37]:
best_model.save_model("my_model.model")

In [38]:
loaded_model = xgb.Booster()
loaded_model.load_model("my_model.model")

# And use it for predictions.
loaded_model.predict(dtest)

array([3.6986032 , 0.33287936, 1.8307076 , ..., 3.327849  , 0.16464412,
       3.212406  ], dtype=float32)

In [39]:
loaded_model

<xgboost.core.Booster at 0x10870acc0>