# Parameter tuning in XGBoost

[XGBoost](http://xgboost.readthedocs.io/en/latest/) is an optimized distributed gradient boosting library which provides a parallel tree boosting that solve many data science problems in a fast and accurate way. XGBoost was the algorithm of choice for many winning teams of a number of machine learning competitions.

I'm going to use [BNP Paribas Cardif Claims Management](https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/leaderboard) dataset to tune the parameters of XGBoost.

First of all, let's load the dataset:

In [124]:
import xgboost as xgb
import numpy as np
import pandas as pd
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics   
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import warnings
warnings.simplefilter("ignore")

data = pd.read_csv("bnp.csv")
print(data.shape)
data.head()

(114321, 133)


Unnamed: 0,ID,target,v1,v2,v3,v4,v5,v6,v7,v8,...,v122,v123,v124,v125,v126,v127,v128,v129,v130,v131
0,3,1,1.335739,8.727474,C,3.921026,7.915266,2.599278,3.176895,0.012941,...,8.0,1.98978,0.035754,AU,1.804126,3.113719,2.024285,0,0.636365,2.857144
1,4,1,,,C,,9.191265,,,2.30163,...,,,0.598896,AF,,,1.957825,0,,
2,5,1,0.943877,5.310079,C,4.410969,5.326159,3.979592,3.928571,0.019645,...,9.333333,2.477596,0.013452,AE,1.773709,3.922193,1.120468,2,0.883118,1.176472
3,6,1,0.797415,8.304757,C,4.22593,11.627438,2.0977,1.987549,0.171947,...,7.018256,1.812795,0.002267,CJ,1.41523,2.954381,1.990847,1,1.677108,1.034483
4,8,1,,,C,,,,,,...,,,,Z,,,,0,,


I'm going to use first 20000 rows of the dataset. Let's do some data preprocessing: change all NaNs to zeros and delete columns which contain strings. Also let's separate target column from dataset:

In [125]:
data = data[:20000]
data = data.fillna(0)
strings = data.select_dtypes(include='object')
data = data.drop(strings.columns.values.tolist(), axis=1)
target = data[['target']]
data = data.drop(['ID', 'target'], axis=1)
print(data.shape)
data.head()

(20000, 112)


Unnamed: 0,v1,v2,v4,v5,v6,v7,v8,v9,v10,v11,...,v121,v122,v123,v124,v126,v127,v128,v129,v130,v131
0,1.335739,8.727474,3.921026,7.915266,2.599278,3.176895,0.012941,9.999999,0.503281,16.434108,...,0.803572,8.0,1.98978,0.035754,1.804126,3.113719,2.024285,0,0.636365,2.857144
1,0.0,0.0,0.0,9.191265,0.0,0.0,2.30163,0.0,1.31291,0.0,...,0.0,0.0,0.0,0.598896,0.0,0.0,1.957825,0,0.0,0.0
2,0.943877,5.310079,4.410969,5.326159,3.979592,3.928571,0.019645,12.666667,0.765864,14.756098,...,2.238806,9.333333,2.477596,0.013452,1.773709,3.922193,1.120468,2,0.883118,1.176472
3,0.797415,8.304757,4.22593,11.627438,2.0977,1.987549,0.171947,8.965516,6.542669,16.347483,...,1.956521,7.018256,1.812795,0.002267,1.41523,2.954381,1.990847,1,1.677108,1.034483
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.050328,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,0.0


In [126]:
target.head()

Unnamed: 0,target
0,1
1,1
2,1
3,1
4,1


Now let's divide our dataset into train and test samples:

In [127]:
X_train, X_test, y_train, y_test = train_test_split(data, target, train_size=0.7, random_state=42)

Now we should begin. Let's choose some initial parameters of XGBoost and see what is the algorithm's quality:

In [128]:
xgb = XGBClassifier(
    n_estimators = 20,
    learning_rage = 0.5,
    max_depth = 3,
    subsample = 0.6,
    min_child_weight = 1,
    colsample_bytree = 0.8,
    gamma = 1,
    scale_pos_weight = 1,
    n_jobs = 4
)

xgb.fit(X_train, y_train)

old_accuracy = xgb.score(X_test, y_test)
old_roc_train = roc_auc_score(y_train, xgb.predict(X_train))
old_roc_test = roc_auc_score(y_test, xgb.predict(X_test))

print('Accuracy:', old_accuracy)
print('Train AUC-ROC:', old_roc_train)
print('Test AUC-ROC:', old_roc_test)

Accuracy: 0.7578333333333334
Train AUC-ROC: 0.5362336888086562
Test AUC-ROC: 0.5224054335048152


Our goal is to improve these scores. We are going to do this selecting different values of parameters of XGBoost. For each parameter we are going to choose its possible values and find best values using the [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). After we found best parameters we will save them and fit XGBoost again using them.

In [129]:
# find best value for n_estimators and learning_rate

params1 = {
    'n_estimators' : np.arange(20, 201, 20),
    'learning_rate' : np.arange(0.01, 0.51, 0.1)
}

clf = GridSearchCV(xgb, param_grid=params1)
clf.fit(X_train, np.array(y_train).ravel())

best_params1 = clf.best_params_

xgb.set_params(**best_params1)

print('Best value for n_estimators:', best_params1['n_estimators'])
print('Best value for learning_rate:', best_params1['learning_rate'])

Best value for n_estimators: 40
Best value for learning_rate: 0.11


In [130]:
# find best value for max_depth and min_child_weight

params2 = {
    'max_depth' : np.arange(2,8),
    'min_child_weight' : np.arange(1,6,2)
}

clf = GridSearchCV(xgb, param_grid=params2)
clf.fit(X_train, np.array(y_train).ravel())

best_params2 = clf.best_params_

xgb.set_params(**best_params2)

print('Best value for max_depth:', best_params2['max_depth'])
print('Best value for min_child_weight:', best_params2['min_child_weight'])

Best value for max_depth: 5
Best value for min_child_weight: 5


In [131]:
# find best value for max_depth and min_child_weight

params3 = {
    'gamma' : np.arange(0, 0.5, 5)
}

clf = GridSearchCV(xgb, param_grid=params3)
clf.fit(X_train, np.array(y_train).ravel())

best_params3 = clf.best_params_

xgb.set_params(**best_params3)

print('Best value for gamma:', best_params3['gamma'])

Best value for gamma: 0.0


In [132]:
# find best value for subsample and colsample_bytree

params4 = {
    'subsample' : [0.6, 0.7, 0.8, 0.9],
    'colsample_bytree' : [0.6, 0.7, 0.8, 0.9]
}

clf = GridSearchCV(xgb, param_grid=params4)
clf.fit(X_train, np.array(y_train).ravel())

best_params4 = clf.best_params_

xgb.set_params(**best_params4)

print('Best value for subsample:', best_params4['subsample'])
print('Best value for colsample_bytree:', best_params4['colsample_bytree'])

Best value for subsample: 0.6
Best value for colsample_bytree: 0.8


In [133]:
# find best value for reg_alpha

params5 = {
    'reg_alpha' : np.logspace(-5, 2, 10)
}

clf = GridSearchCV(xgb, param_grid=params5)
clf.fit(X_train, np.array(y_train).ravel())

best_params5 = clf.best_params_

xgb.set_params(**best_params5)

print('Best value for reg_alpha:', best_params5['reg_alpha'])

Best value for reg_alpha: 1e-05


Now lets fit XGBoost again and see what is the difference:

In [136]:
xgb.fit(X_train, y_train)

print('Accuracy: old value: {}, new value: {}'.format(old_accuracy, xgb.score(X_test, y_test)))
print('Train AUC-ROC: {}, new value: {}'.format(old_roc_train, roc_auc_score(y_train, xgb.predict(X_train))))
print('Test AUC-ROC: {}, new value: {}'.format(old_roc_test, roc_auc_score(y_test, xgb.predict(X_test))))

Accuracy: old value: 0.7578333333333334, new value: 0.7655
Train AUC-ROC: 0.5362336888086562, new value: 0.574835634451019
Test AUC-ROC: 0.5224054335048152, new value: 0.5448417515224935


As we can see, after tuning the parameters of XGBoost, accuracy score and AUC-ROC score has improved. 