### XGBoost. Measuring accuracy

Outline:
<ul>
<li>XGB fit/predict</li>
<li>Measuring accuracy with cross-validation</li>
<li>Measuring AUC</li>
</ul>

It's time to create your first XGBoost model! You can use the scikit-learn .fit() / .predict() paradigm that you are already familiar to build your XGBoost models, as the xgboost library has a scikit-learn compatible API!

Here, you'll be working with 'Pima Indians Diabetes Database'. The dataset consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Your goal is to predict the onset of diabetes based on diagnostic measures. To do this, you'll split the data into training and test sets, fit a small xgboost model on the training set, and evaluate its performance on the test set by computing its accuracy.

As usual, import pandas and numpy as pd and np, and train_test_split from sklearn.model_selection.

<b>1. XGB fit/predict</b>

In [1]:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'

Import the libs

In [2]:
import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
import xgboost as xgb



read the dataset

In [5]:
dataset = pd.read_csv('pima-indians-diabetes.data.csv')

In [6]:
dataset.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,y
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


Create arrays for the features and the target: X, y

In [11]:
X = dataset.iloc[:, :8]
y = dataset.iloc[:, 8]

Create the training and test sets. Set the random_state = 123

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 123)

Instantiate the XGBClassifier as <b>xg_cl</b>

the following parameters should be set in the object (please refer to the <a href='https://xgboost.readthedocs.io/en/latest/python/python_api.html'>documentation</a>):

the objective as binary logistic,

the number of estimators should be set to ten,

the seed should be 123

you can also set the n_jobs, however it's optional

In [13]:
xg_cl = xgb.XGBClassifier(objective = 'binary:logistic', n_estimators=10, seed=123)

<b>Fit</b> the classifier to the training set

In [14]:
xg_cl.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=10,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123,
       silent=True, subsample=1)

Make the <b>predictions</b> on the test set

In [15]:
y_pred = xg_cl.predict(X_test)

  if diff:


Compute the accuracy following this:

sum up all the y_pred's predictions equal to relevant outputs in y_tests and divide it by the number of samples in the y_test

Your accuracy should be 0.759740

In [16]:
accuracy = float(np.sum(y_pred==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

accuracy: 0.770833


<b>2. Measuring accuracy</b>

You'll now practice using XGBoost's learning API through its baked in <b>cross-validation</b> capabilities. XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.

Before this point of the lab we didn't use <b>DMatrix</b>, but when you use the xgboost.cv (cross-validation) object, you have to <b>first explicitly convert</b> your data into a DMatrix. So, that's what you will do here before running cross-validation on your dataset.

Create a DMatrix called <b>dataset_dmatrix</b> from dataset using xgb.DMatrix(). The features are available in X and the labels in y.

In [17]:
dataset_dmatrix = xgb.DMatrix(X, y)

Create the parameter dictionary: params

the objective is reg:logistic

max_depth is 3

You can refer to the <i>Learning Task Parameters</i> section of <a href='https://xgboost.readthedocs.io/en/latest/parameter.html'>documentation</a>

In [31]:
params = {'objective':'reg:logistic', 'max_depth':3}

Perform 3-fold cross-validation by calling xgb.cv(). 

dtrain is your dataset_dmatrix, <br />
params is your parameter dictionary,  <br />
folds is the number of cross-validation folds (3),  <br />
num_boosting_rounds is the number of trees we want to build (5),  <br />
metrics is the metric you want to compute (this will be "error", which we will convert to an accuracy).  <br />
as_pandas = True returns the result in DataFrame

You can find more about cv in the <i>Learning API</i> section of <a href='https://xgboost.readthedocs.io/en/latest/python/python_api.html'>documentation</a>

In [33]:
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=dataset_dmatrix, params=params, nfold=3, num_boost_round=5, metrics='error', as_pandas=True)

[00:45:18] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[00:45:18] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[00:45:18] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[00:45:18] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[00:45:18] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[00:45:18] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[00:45:18] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[00:45:18] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[00:45:18] /work

Print cv_results

In [34]:
print(cv_results)

   train-error-mean  train-error-std  test-error-mean  test-error-std
0          0.219401         0.017052         0.248698        0.012890
1          0.207682         0.013563         0.251302        0.014382
2          0.193359         0.002762         0.246094        0.003189
3          0.186198         0.004603         0.239583        0.011201
4          0.180338         0.011755         0.235677        0.008027


Compute and print the accuracy for the fifth boost (itteration) following this formula:

1 - test-error-mean[i]

where i is the number of boost

In [43]:
print(1 - cv_results['test-error-mean'].iloc[-1])

0.764323


<b>3. Measuring AUC</b>

Now that you've used cross-validation to compute average out-of-sample accuracy (after converting from an error), it's very easy to compute any other metric you might be interested in. All you have to do is pass it (or a list of metrics) in as an argument to the metrics parameter of xgb.cv().

Your job in this exercise is to compute another common metric used in binary classification - the area under the curve ("auc"). As before, dataset is available in your workspace, along with the DMatrix dataset_dmatrix and parameter dictionary params.

Perform 3-fold cross-validation with 5 boosting rounds and <b>"auc"</b> as your metric.

In [45]:
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=dataset_dmatrix, params=params, nfold=3, num_boost_round=5, metrics='auc')

[01:11:55] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[01:11:55] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[01:11:55] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[01:11:55] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[01:11:55] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[01:11:55] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=3
[01:11:55] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[01:11:55] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[01:11:55] /work

Print cv_results

In [46]:
print(cv_results)

   train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
0        0.834422       0.007203       0.777795      0.038891
1        0.870386       0.009238       0.816690      0.006971
2        0.881643       0.007172       0.819141      0.009037
3        0.890653       0.007123       0.824610      0.008893
4        0.896990       0.006331       0.825160      0.009273


Print the AUC (for the last boosting)

In [48]:
print(cv_results['test-auc-mean'].iloc[-1])

0.8251603333333334
