# Tuning with XGBoost

In [4]:
import xgboost as xgb
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np


categories = ['alt.atheism', 'talk.religion.misc']
newsgroups_train = fetch_20newsgroups(subset='train',categories = categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)

vect = TfidfVectorizer(min_df=5)

X_train = vect.fit_transform(newsgroups_train.data)
y_train = newsgroups_train.target
X_test = vect.transform(newsgroups_test.data)
y_test = newsgroups_test.target

feature_names = np.asarray(vect.get_feature_names())

## General Parameters
These define the overall functionality of XGBoost.

- booster [default=gbtree] Select the type of model to run at each iteration. It has 2 options: gbtree: tree-based models gblinear: linear models
- silent [default=0]: Silent mode is activated is set to 1, i.e. no running messages will be printed.It’s generally good to keep it 0 as the messages might help in understanding the model.
- nthread [default to maximum number of threads available if not set] This is used for parallel processing and number of cores in the system should be entered If you wish to run on all cores, value should not be entered and algorithm will detect automatically There are 2 more parameters which are set automatically by XGBoost and you need not worry about them. Lets move on to Booster parameters.

 

## Booster Parameters
Though there are 2 types of boosters, I’ll consider only tree booster here because it always outperforms the linear booster and thus the later is rarely used.

- eta [default=0.3] Analogous to learning rate in GBM Makes the model more robust by shrinking the weights on each step Typical final values to be used: 0.01-0.2
- min_child_weight [default=1] Defines the minimum sum of weights of all observations required in a child. This is similar to min_child_leaf in GBM but not exactly. This refers to min “sum of weights” of observations while GBM has min “number of observations”.
 Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the particular sample selected for a tree.
 Too high values can lead to under-fitting hence, it should be tuned using CV.
- max_depth [default=6]
 The maximum depth of a tree, same as GBM.
 Used to control over-fitting as higher depth will allow model to learn relations very specific to a particular sample.
 Should be tuned using CV.
 Typical values: 3-10
- max_leaf_nodes
 The maximum number of terminal nodes or leaves in a tree.
 Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
 If this is defined, GBM will ignore max_depth.
- gamma [default=0]
 A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
 Makes the algorithm conservative. The values can vary depending on the loss function and should be tuned.
- max_delta_step [default=0]
 In maximum delta step we allow each tree’s weight estimation to be. If the value is set to 0, it means there is no constraint. If it is set to a positive value, it can help making the update step more conservative.
 Usually this parameter is not needed, but it might help in logistic regression when class is extremely imbalanced.
 This is generally not used but you can explore further if you wish.
- subsample [default=1]
 Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
 Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
 Typical values: 0.5-1
- colsample_bytree [default=1]
 Similar to max_features in GBM. Denotes the fraction of columns to be randomly samples for each tree.
 Typical values: 0.5-1
- colsample_bylevel [default=1]
 Denotes the subsample ratio of columns for each split, in each level.
 I don’t use this often because subsample and colsample_bytree will do the job for you. but you can explore further if you feel so.
- lambda [default=1]
 L2 regularization term on weights (analogous to Ridge regression)
 This used to handle the regularization part of XGBoost. Though many data scientists don’t use it often, it should be explored to reduce overfitting.
- alpha [default=0]
 L1 regularization term on weight (analogous to Lasso regression)
 Can be used in case of very high dimensionality so that the algorithm runs faster when implemented
- scale_pos_weight [default=1]
 A value greater than 0 should be used in case of high class imbalance as it helps in faster convergence.
 

## Learning Task Parameters
These parameters are used to define the optimization objective the metric to be calculated at each step.

- objective [default=reg:linear]
 This defines the loss function to be minimized. Mostly used values are:
 binary:logistic –logistic regression for binary classification, returns predicted probability (not class)
 multi:softmax –multiclass classification using the softmax objective, returns predicted class (not probabilities)
 you also need to set an additional num_class (number of classes) parameter defining the number of unique classes
 multi:softprob –same as softmax, but returns predicted probability of each data point belonging to each class.
- eval_metric [ default according to objective ]
 The metric to be used for validation data.
 The default values are rmse for regression and error for classification.
 Typical values are:
 rmse – root mean square error
 mae – mean absolute error
 logloss – negative log-likelihood
 error – Binary classification error rate (0.5 threshold)
 merror – Multiclass classification error rate
 mlogloss – Multiclass logloss
 auc: Area under the curve
- seed [default=0]
 The random number seed.
 Can be used for generating reproducible results and also for parameter tuning.

### Step 1: Fix learning rate and number of estimators for tuning tree-based parameters

Run the model once with default parameter values and early_stopping in place. Set the num_estimators really high.

In [12]:
dtrain = xgb.DMatrix(X_train,y_train)
dtest = xgb.DMatrix(X_test,y_test)

params = {"learning_rate":0.1,
         "max_depth":5,
         "min_child_weight":1,
         "gamma":0,
         "subsample":1,
         "objective":"binary:logistic"}

bst = xgb.train(params,
                dtrain,
                num_boost_round=1000,
                verbose_eval=5,
                evals=[(dtest,"test")],
                early_stopping_rounds=20)

[0]	test-error:0.335088
Will train until test-error hasn't improved in 20 rounds.
[5]	test-error:0.310526
[10]	test-error:0.289474
[15]	test-error:0.273684
[20]	test-error:0.25614
[25]	test-error:0.25614
[30]	test-error:0.249123
[35]	test-error:0.236842
[40]	test-error:0.226316
[45]	test-error:0.240351
[50]	test-error:0.235088
[55]	test-error:0.240351
[60]	test-error:0.236842
Stopping. Best iteration:
[40]	test-error:0.226316



### Step 2: Tune max_depth and min_child_weight

We tune these first as they will have the highest impact on model outcome. To start with, let’s set wider ranges and then we will perform another iteration for smaller ranges.

In [20]:
from sklearn.model_selection import GridSearchCV

params = {'max_depth': range(3,10,2),
          'min_child_weight':range(1,6,2)
         }

rs = GridSearchCV(xgb.XGBClassifier(n_estimators=40),
                       params)
rs.fit(X_train,y_train)

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


GridSearchCV(cv=None, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=40, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': range(3, 10, 2), 'min_child_weight': range(1, 6, 2)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [21]:
rs.best_params_

{'max_depth': 9, 'min_child_weight': 1}

In [22]:
rs.grid_scores_



[mean: 0.87281, std: 0.02520, params: {'max_depth': 3, 'min_child_weight': 1},
 mean: 0.86231, std: 0.03413, params: {'max_depth': 3, 'min_child_weight': 3},
 mean: 0.82497, std: 0.02032, params: {'max_depth': 3, 'min_child_weight': 5},
 mean: 0.87865, std: 0.03987, params: {'max_depth': 5, 'min_child_weight': 1},
 mean: 0.87048, std: 0.03393, params: {'max_depth': 5, 'min_child_weight': 3},
 mean: 0.83547, std: 0.01274, params: {'max_depth': 5, 'min_child_weight': 5},
 mean: 0.87748, std: 0.02877, params: {'max_depth': 7, 'min_child_weight': 1},
 mean: 0.86231, std: 0.03169, params: {'max_depth': 7, 'min_child_weight': 3},
 mean: 0.83431, std: 0.01704, params: {'max_depth': 7, 'min_child_weight': 5},
 mean: 0.88681, std: 0.02723, params: {'max_depth': 9, 'min_child_weight': 1},
 mean: 0.85298, std: 0.01765, params: {'max_depth': 9, 'min_child_weight': 3},
 mean: 0.82730, std: 0.01440, params: {'max_depth': 9, 'min_child_weight': 5}]

In [23]:
rs.best_score_

0.8868144690781797

In [26]:
params = {'max_depth': [8,9,10,11,12],
         }

rs = GridSearchCV(xgb.XGBClassifier(min_cild_weight=1,
                                    n_estimators=40),
                       params)
rs.fit(X_train,y_train)

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


GridSearchCV(cv=None, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1,
       min_cild_weight=1, missing=None, n_estimators=40, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=1, verbosity=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [8, 9, 10, 11, 12]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [27]:
rs.best_params_

{'max_depth': 9}

### Step 3. Tune Gamma

In [30]:
params = {'gamma': [0,0.1,0.15,0.2],
         }

rs = GridSearchCV(xgb.XGBClassifier(min_child_weight=1,
                                    max_depth=9,
                                    n_estimators=40),
                       params)
rs.fit(X_train,y_train)

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


GridSearchCV(cv=None, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=9, min_child_weight=1,
       min_cild_weight=1, missing=None, n_estimators=40, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=1, verbosity=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'gamma': [0, 0.1, 0.15, 0.2]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn', scoring=None, verbose=0)

In [29]:
rs.best_params_

{'gamma': 0.1}

### Step 4: Tune subsample and colsample_bytree

In [31]:
params = {'subsample': [0.6,0.7,0.8,0.9,1],
          'colsample_bytree':[0.6,0.7,0.8,0.9,1]
         }

rs = GridSearchCV(xgb.XGBClassifier(min_child_weight=1,
                                    max_depth=9,
                                    gamma=0.1,
                                    n_estimators=40),
                       params)
rs.fit(X_train,y_train)

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


GridSearchCV(cv=None, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0.1,
       learning_rate=0.1, max_delta_step=0, max_depth=9,
       min_child_weight=1, missing=None, n_estimators=40, n_jobs=1,
       nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=None, subsample=1, verbosity=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'subsample': [0.6, 0.7, 0.8, 0.9, 1], 'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [33]:
rs.best_params_

{'colsample_bytree': 1, 'subsample': 1}

In [40]:
params = {"learning_rate":0.05,
         "max_depth":9,
         "min_child_weight":1,
         "gamma":0.1,
         "subsample":1,
         "objective":"binary:logistic"}

bst = xgb.train(params,
                dtrain,
                num_boost_round=1000,
                verbose_eval=5,
                evals=[(dtest,"test")],
                early_stopping_rounds=20)

[0]	test-error:0.278947
Will train until test-error hasn't improved in 20 rounds.
[5]	test-error:0.284211
[10]	test-error:0.257895
[15]	test-error:0.259649
[20]	test-error:0.249123
[25]	test-error:0.250877
[30]	test-error:0.252632
[35]	test-error:0.24386
[40]	test-error:0.235088
[45]	test-error:0.233333
[50]	test-error:0.231579
[55]	test-error:0.235088
[60]	test-error:0.22807
[65]	test-error:0.224561
[70]	test-error:0.229825
[75]	test-error:0.226316
[80]	test-error:0.229825
[85]	test-error:0.229825
Stopping. Best iteration:
[67]	test-error:0.221053



In [None]:
b