# Dask and XGBoost


Dask and XGBoost can work together to train gradient boosted trees in parallel. This notebook shows how.

XGBoost provides a powerful prediction framework. It wins Kaggle contests and is popular in industry because it has good performance (i.e., high accuracy models) and can be easily interpreted (i.e., it's easy to find the important features from a XGBoost model).

The goals of this notebook are to show Dask and XGBoost working together, and explain a little bit of what they do together.  Briefly:

1.  Dask provides distributed dataframes for data access, cleaning, and pre-processing
2.  XGBoost provides distributed training and prediction
3.  Dask is able to set up XGBoost in distributed mode, hand it data, and let it do the training easily


<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg" width="30%" alt="Dask logo"> <img src="https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/logo-m/xgboost.png" width="25%" alt="Dask logo">

In [None]:
%matplotlib inline

In [None]:
from dask.distributed import Client

client = Client(n_workers=4, threads_per_worker=1)
client

## Create data

First we create a synthetic data with realistic sizes:

In [None]:
import dask.array as da
from dask_ml.model_selection import train_test_split
import numpy as np

num_samples, num_features = 100000, 20

X = da.random.normal(size=(num_samples, num_features),
                     chunks=num_samples // 100)
w_star = da.random.uniform(size=num_features,
                           chunks=num_features)**3
y = da.sign(X @ w_star)

# for binary:logistic objective, only [0, 1] labels allowed
y = (y + 1) / 2

Now we separate our data into a training set and testing set, which will allow for good evaluation after training.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

## Train with XGBoost

Now, let's try to do something with this data using [dask-xgboost][dxgb].

[dxgb]:https://github.com/dask/dask-xgboost

In [None]:
import dask_xgboost
import xgboost

dask-xgboost is a small wrapper around xgboost, and will behave the same as xgboost.

During training Dask will take care of loading, cleaning, and pre-processing the data. XGBoost will leverage their own distributed training system using all the workers that Dask has available. Dask sets XGBoost up, gives XGBoost data and lets XGBoost do it's training in the background using all the workers Dask has available.

Let's do some training:

In [None]:
%%time
params = {'objective': 'binary:logistic', 'nround': 1000, 
          'max_depth': 5, 'eta': 0.01, 'subsample': 0.5, 
          'min_child_weight': 0.5}

bst = dask_xgboost.train(client, params, X_train, y_train)

## Analyze Results

The `bst` object is a regular `xgboost.Booster` object. 

In [None]:
bst

This means all the methods mentioned in the [XGBoost documentation][2] are available. 

### Plot feature importance

For example we use XGBoost features to plot the importance of features.

[2]:https://xgboost.readthedocs.io/en/latest/python/python_intro.html#

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import dask

w_star = dask.compute(w_star)[0]
idx_top = np.argsort(w_star)[-7:]
top = w_star[idx_top]

fig, axs = plt.subplots(figsize=(12, 8), ncols=2)

axs[0] = xgboost.plot_importance(bst, ax=axs[0], height=0.8, max_num_features=9)

axs[0].grid(False, axis="y")
axs[0].set_title('Estimated feature importance')

df = pd.DataFrame({'feature_importance': top}, index=idx_top)
df.plot.bar(ax=axs[1])
axs[1].set_title('Ground truth feature importance')
plt.show()

We can see from this example that XGBoost does a good job at feature *support recovery*, or figuring out which features are important. Notice that in both plots, feature `9` shows up in both plots (in the left plot as `f9`).  The top 5 estimated most important features are *actually* the top 5 most important features.


### Plot ROC Curve


We can also use Scikit-Learn to plot the [Receiver Operating Characteristic (ROC) curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) to see how well our model performs.

In [None]:
y_hat = dask_xgboost.predict(client, bst, X_test).persist()
y_hat

In [None]:
from sklearn.metrics import roc_curve, auc
fig, ax = plt.subplots(figsize=(5, 5))
fpr, tpr, _ = roc_curve(y_test, y_hat)
ax.plot(fpr, tpr, lw=3,
        label='ROC Curve (area = {:.2f})'.format(auc(fpr, tpr)))
ax.plot([0, 1], [0, 1], 'k--', lw=2)

ax.set(
    xlim=(0, 1),
    ylim=(0, 1),
    title="ROC Curve",
    xlabel="False Positive Rate",
    ylabel="True Positive Rate",
)
ax.legend();
plt.show()

This Receiver Operating Characteristic (ROC) curve tells how well our classifier is doing. We can tell it's doing well by how far it bends the upper-left. A perfect classifier would be in the upper-left corner, and a random classifier would follow the horizontal line.

The area under this curve is `area = 0.89`. This tells us the probability that our classifier will predict correctly for a randomly chosen instance.

## Learn more
* XGBoost documentation: https://xgboost.readthedocs.io/en/latest/python/python_intro.html#
* Dask-XGBoost documentation: http://dask-ml.readthedocs.io/en/latest/xgboost.html
* A blogpost on dask-xgboost http://matthewrocklin.com/blog/work/2017/03/28/dask-xgboost
* Similar example with real world dataset: https://dask-ml.readthedocs.io/en/latest/examples/xgboost.html
* Recorded screencast stepping through a similar example: https://www.youtube.com/watch?v=Cc4E-PdDSro 