This example is similar to the example at https://dask-ml.readthedocs.io/en/latest/examples/xgboost.html, but this example takes a while to run because it uses a real dataset. There is a recorded screencast that walks through this example  at https://www.youtube.com/watch?v=Cc4E-PdDSro, with comments from the lead Dask developer, Matthew Rocklin.

This example is similar but uses synthetic data instead, to allow it to be easily runnable. This example uses a Dask wrapper around XGBoost.

<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg" width="30%" alt="Dask logo"> <img src="https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/logo-m/xgboost.png" width="25%" alt="Dask logo">

In [None]:
%matplotlib inline

In [None]:
import os
from dask import compute, persist
from dask.distributed import Client, progress

client = Client(os.environ.get("DISTRIBUTED_ADDRESS"))  # connect to cluster 
client

First we create a bunch of synthetic data:

In [None]:
import dask.array as da
from dask_ml.model_selection import train_test_split
import numpy as np

n, d = int(100e3), 20
da.random.seed(42)
X = da.random.normal(size=(n, d), chunks=n // 100)
w_star = da.random.uniform(size=d, chunks=d)
w_star = w_star**3
y = da.sign(X @ w_star)

# for binary:logistic objective, only [0, 1] labels allowed
y = (y + 1) / 2

Now let's separate into a training set and testing set, which will allow for good evaluation after training.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

Now, let's try and predict from this data with [dask-xgboost][dxgb]:

[dxgb]:https://github.com/dask/dask-xgboost

In [None]:
import dask_xgboost as dxgb
import xgboost as xgb

dask-xgboost is a small wrapper around xgboost, and will behave the same as xgboost.

During training it will take care of handling the data, and allow xgboost to use it's own distributed scheduler.

In [None]:
%%time
params = {'objective': 'binary:logistic', 'nround': 1000, 
          'max_depth': 5, 'eta': 0.01, 'subsample': 0.5, 
          'min_child_weight': 0.5}

bst = dxgb.train(client, params, X_train, y_train)
bst

The `bst` object is a regular `xboost.Booster` object, and has all the familar methods available. For example,

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

if isinstance(w_star, da.Array):
    w_star = w_star.compute()
idx_top = np.argsort(w_star)[-7:]
top = w_star[idx_top]

fig, axs = plt.subplots(figsize=(12, 8), ncols=2)
axs[0] = xgb.plot_importance(bst, ax=axs[0], height=0.8, max_num_features=7)
axs[0].grid(False, axis="y")
axs[0].set_title('Estimated feature importance')

df = pd.DataFrame({'feature_importance': top}, index=idx_top)
df.plot.bar(ax=axs[1])
axs[1].set_title('Ground truth feature importance')
plt.show()


Or we can plot the ROC curve after we do some predictions:

In [None]:
y_hat = dxgb.predict(client, bst, X_test)
y_hat = client.persist(y_hat)
y_hat

In [None]:
from sklearn.metrics import roc_curve, auc
fig, ax = plt.subplots(figsize=(8, 8))
fpr, tpr, _ = roc_curve(y_test, y_hat)
ax.plot(fpr, tpr, lw=3,
        label='ROC Curve (ares = {:.2f})'.format(auc(fpr, tpr)))
ax.plot([0, 1], [0, 1], 'k--', lw=2)

ax.set(
    xlim=(0, 1),
    ylim=(0, 1),
    title="ROC Curve",
    xlabel="False Positive Rate",
    ylabel="True Positive Rate",
)
ax.legend();
plt.show()