# First XGBoost Model with scikit-learn

XGBoost is an implementation of gradient boosted decision trees designed for speed and performance that is dominating  machine learning competitions.

For up-to-date instructions for installing XGBoost for Python [see](http://xgboost.readthedocs.io/en/latest/build.html#building-on-osx).

For reference, you can review the XGBoost Python [API](http://xgboost.readthedocs.io/en/latest/python/python_api.html) reference.

## Problem Description: Predict Onset of Diabetes

We are going to use the Pima Indians onset of diabetes [dataset](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes). 

This dataset is comprised of 8 input variables that describe medical details of patients and one output variable to indicate whether the patient will have an onset of diabetes within 5 years.

This is a good dataset for a first XGBoost model because all of the input variables are numeric and the problem is a simple binary classification problem. It is not necessarily a good problem for the XGBoost algorithm because it is a relatively small dataset and an easy problem to model.

## Train the XGBoost Model

XGBoost provides a wrapper class to allow models to be treated like classifiers or regressors in the scikit-learn framework.

This means we can use the full scikit-learn library with XGBoost models.

The XGBoost model for classification is called **XGBClassifier**. We can create and and fit it to our training dataset. Models are fit using the scikit-learn API and the model.fit() function.

Parameters for training the model can be passed to the model in the constructor. Here, we use the sensible defaults.

You can learn more about the meaning of each parameter and how to configure them on the XGBoost [parameters](http://xgboost.readthedocs.io/en/latest//parameter.html) page.

You can see the parameters used in a trained model by printing the model, for example:

<pre>
print(model)
</pre>

## Make Predictions with XGBoost Model

We can make predictions using the fit model on the test dataset.

To make predictions we use the scikit-learn function model.predict().

By default, the predictions made by XGBoost are probabilities. Because this is a binary classification problem, each prediction is the probability of the input pattern belonging to the first class. We can easily convert them to binary class values by rounding them to 0 or 1.

Now that we have used the fit model to make predictions on new data, we can evaluate the performance of the predictions by comparing them to the expected values. For this we will use the built in accuracy_score() function in scikit-learn.

We can tie all of these pieces together, below is the full code listing.

In [3]:
# First XGBoost model for Pima Indians dataset
import numpy
import xgboost
from sklearn import cross_validation
from sklearn.metrics import accuracy_score

# load data
dataset = numpy.loadtxt('pima-indians-diabetes.csv', delimiter=",")

# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]

# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, Y, test_size=test_size, random_state=seed)

# fit model no training data
model = xgboost.XGBClassifier()
print model
model.fit(X_train, y_train)

# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)
Accuracy: 77.95%


This is a good accuracy score on this problem, which we would expect, given the capabilities of the model and the modest complexity of the problem.