XGBoost is an implementation of gradient boosted decision trees designed for speed and performance that is dominative competitive machine learning.

# 1. Install XGBoost for Use in Python

```bash
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
cp make/minimum.mk ./config.mk
make -j4
cd python-package
sudo python setup.py install
```

# 2. Predict Onset of Diabetes

The [Pima Indians onset of diabetes dataset](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes) is comprised of 8 input variables that describe medical details of patients and one output variable to indicate whether the patient will have an onset of diabetes within 5 years.

In [1]:
import xgboost
from numpy import loadtxt
import urllib2
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# load data
url='https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
dat = urllib2.urlopen(url)
dataset = loadtxt(dat, delimiter=",")

In [3]:
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]

In [4]:
# split data into train and test sets
seed = 999
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

In [5]:
# fit model on training data
model = xgboost.XGBClassifier()
model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

By default, the predictions made by XGBoost are probabilities. Because this is a binary classification problem, each prediction is the probability of the input pattern belonging to the first class. We can easily convert them to binary class values by rounding them to 0 or 1.

In [6]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [7]:
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: {:.2f}".format(accuracy))

Accuracy: 0.73


# reference

- http://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/