# Distributed XGBoost with Dask on CML

something something


Code examples drawn in part from the [documentation](https://xgboost.readthedocs.io/en/stable/tutorials/dask.html) and [this blog post](https://medium.com/rapids-ai/a-new-official-dask-api-for-xgboost-e8b10f3d1eb7) published by RAPIDS AI, both excellent sources of information for further details. 



In [None]:
!pip3 install xgboost

In [1]:
import os
import time

import cdsw
import xgboost as xgb
import dask.array as da
from dask import dataframe as dd
import dask_ml

from dask.distributed import Client

## Start up Dask Cluster

In [2]:
dask_scheduler = cdsw.launch_workers(
    n=1,
    cpu=1,
    memory=2,
    code=f"!dask-scheduler --host 0.0.0.0 --dashboard-address 127.0.0.1:8090 --scheduler-file /home/cdsw/_scheduler_/dask.log",
)

# Wait for the scheduler to start.
time.sleep(10)

In [3]:
def get_scheduler_url(dask_scheduler):
    scheduler_workers = cdsw.list_workers()
    scheduler_id = dask_scheduler[0]["id"]
    scheduler_ip = [
        worker["ip_address"] for worker in scheduler_workers if worker["id"] == scheduler_id
    ][0]

    return f"tcp://{scheduler_ip}:8786"

scheduler_url = get_scheduler_url(dask_scheduler)

In [4]:
N_WORKERS = 3

dask_workers = cdsw.launch_workers(
    n=N_WORKERS,
    cpu=1,
    memory=4,
    code=f"!dask-worker {scheduler_url} --local-directory /home/cdsw/_worker_",
)

# Wait for the workers to start.
time.sleep(10)

In [5]:
client = Client(scheduler_url)

## Get some data

In this example we'll be using the [HIGGs data](https://archive.ics.uci.edu/ml/datasets/HIGGS) from the UCI Machine Learning Repository. This binary classification dataset contains 11 million samples consisting of 28 features for each sample.  This dataset is quite large and likely won't fit in memory  as it requires nearly 8GiB on disk (and we've suggested that you run this notebook with only 2 GiB). This is a prime example of a time when distributed training will serve us well - when our local RAM or computational resources are not sufficient for the task at hand.  


In the cells below we download the dataset and unzip it (Dask DataFrames don't play nice with zipped data formats).

In [13]:
# download from the UCI archives
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz --output HIGGS.csv.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2685M  100 2685M    0     0  85.5M      0  0:00:31  0:00:31 --:--:-- 97.1M


In [16]:
!gzip -d HIGGS.csv.gz

Next we load the data into a Dask Dataframe. From the data description, we know the first column is the label, with all other columns being features. 

In [6]:
colnames = ['label'] + ['feature-%02d' % i for i in range(1, 29)]
dask_df = dd.read_csv("HIGGS.csv", header=None, names=colnames)

We now have a Dask DataFrame. These objects mimic much, but not all, of the functionality of a traditional Pandas DataFrame. We can take a look at the object we have, but it's not very interesting yet. This is because Dask operations are lazily evaluated, that is, no computations are performed until explicitly asked for by calling `.compute()`. Dask is _lazy_ in everything it does, including reading in the actual values of the dataset.

In [7]:
dask_df

Unnamed: 0_level_0,label,feature-01,feature-02,feature-03,feature-04,feature-05,feature-06,feature-07,feature-08,feature-09,feature-10,feature-11,feature-12,feature-13,feature-14,feature-15,feature-16,feature-17,feature-18,feature-19,feature-20,feature-21,feature-22,feature-23,feature-24,feature-25,feature-26,feature-27,feature-28
npartitions=125,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1
,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


On the other hand, Dask is especially good at determining how to read data in, once called to do so. Spefically, Dask has determined that these 11 million samples should be chunked into 125 partitions, with each partition containing about 88,000 samples each. Because none of the data has yet been loaded into memory, we see only the structure of the DataFrame, rather than the values. 

To see the values themselves we must force Dask to execute a computation. Calling `head()` on the DataFrame executes a `.compute()` operation under the hood which in turn loads the data from the first partiion and displays the top 5 values. 

In [28]:
dask_df.head()

Unnamed: 0,label,feature-01,feature-02,feature-03,feature-04,feature-05,feature-06,feature-07,feature-08,feature-09,...,feature-19,feature-20,feature-21,feature-22,feature-23,feature-24,feature-25,feature-26,feature-27,feature-28
0,1.0,0.869293,-0.635082,0.22569,0.32747,-0.689993,0.754202,-0.248573,-1.092064,0.0,...,-0.010455,-0.045767,3.101961,1.35376,0.979563,0.978076,0.920005,0.721657,0.988751,0.876678
1,1.0,0.907542,0.329147,0.359412,1.49797,-0.31301,1.095531,-0.557525,-1.58823,2.173076,...,-1.13893,-0.000819,0.0,0.30222,0.833048,0.9857,0.978098,0.779732,0.992356,0.798343
2,1.0,0.798835,1.470639,-1.635975,0.453773,0.425629,1.104875,1.282322,1.381664,0.0,...,1.128848,0.900461,0.0,0.909753,1.10833,0.985692,0.951331,0.803252,0.865924,0.780118
3,0.0,1.344385,-0.876626,0.935913,1.99205,0.882454,1.786066,-1.646778,-0.942383,0.0,...,-0.678379,-1.360356,0.0,0.946652,1.028704,0.998656,0.728281,0.8692,1.026736,0.957904
4,1.0,1.105009,0.321356,1.522401,0.882808,-1.205349,0.681466,-1.070464,-0.921871,0.0,...,-0.373566,0.113041,0.0,0.755856,1.361057,0.98661,0.838085,1.133295,0.872245,0.808487


### Train/Test split

Next, we need to perform a train/test split. Dask ML can help with it's scikit-learn integration. 

In [32]:
#dask_df_sm = dask_df.partitions[0]

In [37]:
#y = dask_df_sm['label']
#X = dask_df_sm[dask_df_sm.columns.difference(['label'])]

In [41]:
 X_train, X_test, y_train, y_test = dask_ml.model_selection.train_test_split(X, y, shuffle=True)

The call below will execute the entire chain of commands we've strung together so far -- it will read in the first partition of data into a Dask Dataframe, split off the target column, perform a train/test split, and display those values below. 

In [39]:
y_train.compute()

77948    1.0
79525    0.0
80928    1.0
4024     1.0
82718    1.0
        ... 
6607     1.0
66917    0.0
10456    1.0
82585    1.0
58665    1.0
Name: label, Length: 78923, dtype: float64

In [40]:
y_test.compute()

5635     0.0
85983    0.0
13269    0.0
33459    1.0
21220    0.0
        ... 
32804    1.0
13068    0.0
2677     1.0
45363    0.0
6494     0.0
Name: label, Length: 9077, dtype: float64

Good -- these two subsets of the dataset look like they contain different pieces of information and we still haven't needed to load the entire dataset into memory! 


 

## From Dask DataFrames to DMatrices

This implmentation of XGBoost requires data in a specialized format called a DMatrix. This data object has been especially optimized to be memory efficient. 

Below, we perform an operation that converts a Dask DataFrame to a Dask DMatrix. Under the hood, this process may place one or more DataFrame partitions onto one or more DMatrix objects.  

<img src="https://miro.medium.com/max/1400/0*AX-9WEYvaCI2h86I">

This takes a while because we are literally moving the data around now, and there's a lot of data to shuffle!

In [42]:
# X and y must be Dask dataframes or arrays

dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)

## Train the model
With the data in an ingestible format, it's time to train our XGBoost model. The training call is similar to non-distributed calls to XGBoost but with one key difference -- we now pass in the Dask Client, which is responsible for orchestrating the training across the Dask cluster. 


Link to more parameters for the model

In [86]:
output = xgb.dask.train(
    client,
    {"verbosity": 2, "tree_method": "hist", "objective": "reg:squarederror"},
    dtrain,
    num_boost_round=4,
    evals=[(dtrain, "train")],
)

In [87]:
output

{'booster': <xgboost.core.Booster at 0x7fcc1c2554d0>,
 'history': {'train': OrderedDict([('rmse',
                [0.475212, 0.461451, 0.451909, 0.445539])])}}

This the model -- let's save the model for later use. 


In [None]:
# save the model to disk


### Evaluation

During the iterative process of ML modeling, we'll want to evaluate our model on the training and validations sets in order to ascertain our bias-variance trade-off.  Because our train set is quite large, this is still a good job for distributed cluster. 


We can score our model by passing our DaskDMatrix object to the xgb.dask.predict method. The result is another Dask Array so we must use `.compute()` to retrieve a non-distributed data object (e.g., a Numpy array)

In [2]:
# read the model back in

In [88]:
prediction = xgb.dask.predict(client, output, dtrain)

In [89]:
pred = prediction.compute()

In [99]:
labels = [round(t) for t in pred]
len(labels)

88000

In [92]:
y_pred = y_train.compute()

In [101]:
sum(labels == y_pred)/len(labels)

0.7045340909090909

### Validation

Not Implemented yet.

## Inference

Once we have an evaluated model that we're happy with, we can use it for inference like we would any model -- with or without distributed computational resources. 

Not implemented yet

In [3]:
# show how you can call the model without using Dask arrays, etc. 

## Alternative: Scikit-Learn API 

show the same steps above but with the other api

## Shut down workers

In [106]:
cdsw.stop_workers(*[worker["id"] for worker in dask_workers + dask_scheduler])

[<Response [204]>, <Response [204]>, <Response [204]>, <Response [204]>]

distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
