<span style="color:#8735fb; font-size:24pt"> **Multi Node Multi-GPU example on SLURM using dask-cuda** </span>

[Dask Cuda](https://dask-cuda.readthedocs.io/en/latest/) is a library which extends Dask's distributed single machine [LocalCluster](https://docs.dask.org/en/latest/setup/single-distributed.html#localcluster) for use in workloads that can leverage multiple GPUs.  
In this notebook, we explore how we can leverage multiple GPUs in a multiple nodes using XGboost. 
We will further explore how we can use ForestInference library of cuml to do inference. 

For the purposes of this demo, we will use the a part of the NYC Taxi Dataset (only the files of 2014 calendar year will be used here). The goal is to predict the fare amount for a given trip given the times and coordinates of the taxi trip.
We will download the data from `GCFS` file system where the dataset is publicly hosted by anaconda. 

### <span style="color:#8735fb; font-size:22pt"> **Step -1: Start two `dask-cuda-worker`s and one `dask-scheduler` in the slurm cluster.** </span>

Steps to follow:
- Git clone `rapids-prom` in the SLURM cluster landing pad.
- `cd rapids-prom` and then start two workers and one scheduler for 30 minutes : `./deploy -w 2 -t 00:30:30`
- This will start two dask cuda worker and those two dask cuda worker will have all the GPUS in the nodes
- Then copy the contents from the scheduler json file from `$LOCAL_DIRECTORY/dask-scheduler.json`

### <span style="color:#8735fb; font-size:22pt"> **Step 0: Import Stuff** </span>

In [None]:
from dask.distributed import Client, WorkerPlugin, wait, progress, get_worker
from dask_cuda import LocalCUDACluster
import dask_cudf
from dask_ml.model_selection import train_test_split
from cuml.dask.common import utils as dask_utils
from cuml.metrics import mean_squared_error
from cuml import ForestInference
import cudf
import xgboost as xgb
from datetime import datetime
from dateutil import parser
import numpy as np
from timeit import default_timer as timer
import dask

<span style="color:#8735fb; font-size:22pt"> **Step 1: Set up the SLURM CUDA Cluster** </span>

Spawn the workers and scehdulers in the cluster.

In [None]:
json = """
{
  "type": "Scheduler",
  "id": "Scheduler-8a1d6ceb-d1cb-431f-a547-5b2212371be8",
  "address": "tcp:/<host>:<port>",
  "services": {
    "dashboard": 8787
  },
  "started": 1621361462.335848,
  "workers": {}
}
"""
with open("dask-scheduler.json", "w") as f:
    f.write(json)

Cannot set up the slurm cluster,  too busy now.

In [None]:
# CUDA_VISIBLE_DEVICES = [0,1,2,3,4,5,6,7] # this is if we were to host a localCUDA cluster.
# npartitions = len(CUDA_VISIBLE_DEVICES) #8
# clust = LocalCUDACluster(host="", 
#                          CUDA_VISIBLE_DEVICES=CUDA_VISIBLE_DEVICES,
#                          n_workers=npartitions,
#                          scheduler_port = 8786)
# client = Client(clust)
# clust

In [None]:
# # Set the number of partitions to 8 for Dask. Depending on how many GPUs you have, you can set the number of partitions. 
# # right now someone is doing something on CUDA 0
client = Client(scheduler_file="./dask-scheduler.json")
npartitions = len(client.has_what().keys())
client

In [None]:
def pretty_print(scheduler_dict):
    print(f"All workers for scheduler id: {scheduler_dict['id']}, address: {scheduler_dict['address']}")
    for worker in scheduler_dict['workers']:
        print(f"Worker: {worker} , gpu_machines: {scheduler_dict['workers'][worker]['gpu']}")

pretty_print(client.scheduler_info()) # will show information on the len(CUDA_VISIBLE_DEVICES) partitions

<span style="color:#8735fb; font-size:22pt"> **Step 2: Data Setup, Cleanup and Enhancement** </span>

### <span style="color:#8735fb; font-size:18pt"> Step 2.a: Set the path for downloading the data from GCFS </span>

In [None]:
model_path = './trained_model_xgb_nyctaxi_gcs.xgb'
taxi_data_csv_path = "gcs://anaconda-public-data/nyc-taxi/csv"
taxi_data_local = "./nyc.parquet"

Let's look at the data locally to see what we're dealing with. We will make use of the data from 2014 for the purposes of the demo. We see that there are columns for pickup and dropoff times, distance, along with latitude, longitude, etc. These are the information we'll use to estimate the trip fare amount.

### <span style="color:#8735fb; font-size:18pt"> Step 2.b: Data Cleanup, Enhancement and Persisting Scripts </span>

The data needs to be cleaned up before it can be used in a meaningful way. We first perform a renaming of some columns to a cleaner name (for instance, some of the years have `tpep_ropoff_datetime` instead of `dropfoff_datetime`). We also define the datatypes each of the columns need to be read as.

We'll add new features by making use of "uder defined functions" on the dataframe. We'll make use of [apply_rows](https://docs.rapids.ai/api/cudf/stable/api.html#cudf.core.dataframe.DataFrame.apply_rows), which is similar to Pandas' apply funciton. `apply_rows` operation is [JIT compiled by numba](https://numba.pydata.org/numba-doc/dev/cuda/kernels.html) into GPU kernels. 

The kernels we define are - 
1. Haversine distance: This is used for calculating the total trip distance.

2. Day of the week: This can be useful information for determining the fare cost.

`add_features` function combined the two to produce a new dataframe that has the added features.

#### Adding features functions

In [None]:
import math
from math import cos, sin, asin, sqrt, pi

def haversine_distance_kernel(pickup_latitude_r, pickup_longitude_r, dropoff_latitude_r, dropoff_longitude_r, h_distance, radius):
    for i, (x_1, y_1, x_2, y_2) in enumerate(zip(pickup_latitude_r, pickup_longitude_r, dropoff_latitude_r, dropoff_longitude_r,)):
        x_1 = pi/180 * x_1
        y_1 = pi/180 * y_1
        x_2 = pi/180 * x_2
        y_2 = pi/180 * y_2
        
        dlon = y_2 - y_1
        dlat = x_2 - x_1
        a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2
        
        c = 2 * asin(sqrt(a)) 
        # radius = 6371 # Radius of earth in kilometers # currently passed as input arguments
        
        h_distance[i] = c * radius

def day_of_the_week_kernel(day, month, year, day_of_week):
    for i, (d_1, m_1, y_1) in enumerate(zip(day, month, year)):
        if month[i] <3:
            shift = month[i]
        else:
            shift = 0
        Y = year[i] - (month[i] < 3)
        y = Y - 2000
        c = 20
        d = day[i]
        m = month[i] + shift + 1
        day_of_week[i] = (d + math.floor(m*2.6) + y + (y//4) + (c//4) -2*c)%7
        
def add_features(df):
    df['hour'] = df['pickup_datetime'].dt.hour
    df['year'] = df['pickup_datetime'].dt.year
    df['month'] = df['pickup_datetime'].dt.month
    df['day'] = df['pickup_datetime'].dt.day
    df['diff'] = (df['dropoff_datetime'] - df['pickup_datetime']).dt.seconds #convert difference between pickup and dropoff into seconds
    
    df['pickup_latitude_r'] = df['pickup_latitude']//.01*.01
    df['pickup_longitude_r'] = df['pickup_longitude']//.01*.01
    df['dropoff_latitude_r'] = df['dropoff_latitude']//.01*.01
    df['dropoff_longitude_r'] = df['dropoff_longitude']//.01*.01
    
    df = df.drop('dropoff_datetime', axis=1)
    df = df.drop('pickup_datetime', axis =1)
    
    
    df = df.apply_rows(haversine_distance_kernel,
                   incols=['pickup_latitude_r', 'pickup_longitude_r', 'dropoff_latitude_r', 'dropoff_longitude_r'],
                   outcols=dict(h_distance=np.float32),
                   kwargs=dict(radius=6371))
    
    
    df = df.apply_rows(day_of_the_week_kernel,
                      incols=['day', 'month', 'year'],
                      outcols=dict(day_of_week=np.float32),
                      kwargs=dict())
    
    
    df['is_weekend'] = (df['day_of_week']<2)
    return df

#### Functions for cleaning and persisting the data in the workers.

In [None]:
def persist_train_infer_split(client, df, response_dtype, response_id, infer_frac=1.0, random_state=42, shuffle=True):
    workers = client.has_what().keys()
    X, y = df.drop([response_id], axis=1), df[response_id].astype('float32')
    infer_frac = max(0, min(infer_frac, 1.0))
    X_train, X_infer, y_train, y_infer = train_test_split(X, y, shuffle=True, random_state=random_state, test_size=infer_frac)
    
    with dask.annotate(workers=set(workers)):
        X_train, y_train = client.persist(
            collections=[X_train, y_train]) 
    
    if (infer_frac != 1.0):
        with dask.annotate(workers=set(workers)):
            X_infer, y_infer = client.persist(
                collections=[X_infer, y_infer])

        wait([X_train, y_train, X_infer, y_infer])
    else:
        X_infer = X_train
        y_infer = y_train

        wait([X_train, y_train])
    
    return X_train, y_train, X_infer, y_infer


def clean(df_part, remap, must_haves):
    """
    This function performs the various clean up tasks for the data
    and returns the cleaned dataframe.
    """
    tmp = {col:col.strip().lower() for col in list(df_part.columns)}
    df_part = df_part.rename(columns=tmp)
    
    # rename using the supplied mapping
    df_part = df_part.rename(columns=remap)
    
    # iterate through columns in this df partition
    for col in df_part.columns:
        # drop anything not in our expected list
        if col not in must_haves:
            df_part = df_part.drop(col, axis=1)
            continue

        # fixes datetime error found by Ty Mckercher and fixed by Paul Mahler
        if df_part[col].dtype == 'object' and col in ['pickup_datetime', 'dropoff_datetime']:
            df_part[col] = df_part[col].astype('datetime64[ms]')
            continue

        # if column was read as a string, recast as float
        if df_part[col].dtype == 'object':
            df_part[col] = df_part[col].astype('float32')
        else:
            # downcast from 64bit to 32bit types
            # Tesla T4 are faster on 32bit ops
            if 'int' in str(df_part[col].dtype):
                df_part[col] = df_part[col].astype('int32')
            if 'float' in str(df_part[col].dtype):
                df_part[col] = df_part[col].astype('float32')
            df_part[col] = df_part[col].fillna(-1)
            
    return df_part

def taxi_parquet_data_loader(client, data_path, response_dtype=np.float32, infer_frac=1.0, random_state=0):
    # list of column names that need to be re-mapped
    remap = {}
    remap['tpep_pickup_datetime'] = 'pickup_datetime'
    remap['tpep_dropoff_datetime'] = 'dropoff_datetime'
    remap['ratecodeid'] = 'rate_code'

    #create a list of columns & dtypes the df must have
    must_haves = {
     'pickup_datetime': 'datetime64[ms]',
     'dropoff_datetime': 'datetime64[ms]',
     'passenger_count': 'int32',
     'trip_distance': 'float32',
     'pickup_longitude': 'float32',
     'pickup_latitude': 'float32',
     'rate_code': 'int32',
     'dropoff_longitude': 'float32',
     'dropoff_latitude': 'float32',
     'fare_amount': 'float32'
    }

    # apply a list of filter conditions to throw out records with missing or outlier values
    query_fragments = [
        'fare_amount > 0 and fare_amount < 500',
        'passenger_count > 0 and passenger_count < 6',
        'pickup_longitude > -75 and pickup_longitude < -73',
        'dropoff_longitude > -75 and dropoff_longitude < -73',
        'pickup_latitude > 40 and pickup_latitude < 42',
        'dropoff_latitude > 40 and dropoff_latitude < 42'
    ]

    workers = client.has_what().keys()
    response_id = 'fare_amount'
    fields = ['passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'rate_code',
                 'dropoff_longitude', 'dropoff_latitude', 'fare_amount']   
    taxi_data_csv = dask_cudf.read_csv(data_path, chunksize=25e6, npartitions=len(workers))
    taxi_data_csv = clean(taxi_data_csv, remap, must_haves)
    taxi_data_csv = taxi_data_csv.map_partitions(add_features)
    taxi_data_csv = taxi_data_csv.query(' and '.join(query_fragments))
    taxi_data_csv = taxi_data_csv[fields]
    
    return persist_train_infer_split(client, taxi_data_csv, response_dtype, response_id, infer_frac, random_state)
    

### <span style="color:#8735fb; font-size:18pt"> Step 2.c: Get the Split Data and persist across workers </span>

It takes a bit of time since the data is transferred from the local machine to the Slurm cluster nodes.

In [None]:
tic = timer()
X_train, y_train, X_infer, y_infer = taxi_parquet_data_loader(client, f"{taxi_data_csv_path}/2014/yellow_*.csv", infer_frac=0.25, random_state=42)
toc = timer()
print(f"Wall clock time taken for ETL and persisting : {toc-tic} s")

<span style="color:#8735fb; font-size:22pt"> **Step 3: Train a XGBoost Model** </span>

We are now ready to fit a XGBoost model on the data to predict the fare for the trip.

Always good to check what is the condition of the GPUs you have.

In [None]:
# !nvidia-smi
pretty_print(client.scheduler_info()) # will show information on the len(CUDA_VISIBLE_DEVICES) partitions

### <span style="color:#8735fb; font-size:18pt"> Step 3.a: Set training Parameters </span>

We will use the eval metrix RMSE for this problem. Note that for better performance we should perform HPO ideally to get the best parameters. 

Refer to the notebooks in the repository for how to perform automated HPO [using RayTune](https://github.com/rapidsai/cloud-ml-examples/blob/main/ray/notebooks/Ray_RAPIDS_HPO.ipynb) and [using Optuna](https://github.com/rapidsai/cloud-ml-examples/blob/main/optuna/notebooks/optuna_rapids.ipynb).

In [None]:
params = {
    'learning_rate': 0.15,
    'max_depth': 8,
    'objective': 'reg:squarederror',
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 1,
    'gamma': 1,
    'silent': True,
    'verbose_eval': True,
    'booster' : 'gbtree', # 'gblinear' not implemented in dask
    'eval_metric': 'rmse',
    'tree_method':'gpu_hist',
    'num_boost_rounds': 100
}


### <span style="color:#8735fb; font-size:18pt"> Step 3.b: Train XGBoost Model </span>

This will be more of less fast since the data is already in the SLURM Cluster.

In [None]:
data_train = xgb.dask.DaskDMatrix(client, X_train, y_train)
tic = timer()
xgboost_output = xgb.dask.train(client, params,data_train, 
                                    num_boost_round=params['num_boost_rounds'])
xgb_cuda_model = xgboost_output['booster']
toc = timer()
print(f"Wall clock time taken for this cell : {toc-tic} s")

### <span style="color:#8735fb; font-size:18pt"> Step 3.c: Save the Model to disk </span>

In [None]:
model_filename = 'trained-model_nyctaxi.xgb'
xgb_cuda_model.save_model(model_filename)

### <span style="color:#8735fb; font-size:22pt"> **Step 4: Predict & Score using vanilla XGBoost Predict** </span>

In [None]:
_y_test = y_infer.compute()
wait(_y_test)

In [None]:
d_test = xgb.dask.DaskDMatrix(client, X_infer)
tic = timer()
y_pred = xgb.dask.predict(client, xgb_cuda_model, d_test)
y_pred= y_pred.compute()
wait(y_pred)
toc = timer()
print(f"Wall clock time taken for xgb.dask.predict : {toc-tic} s")

#### Inference with inplace predict of dask XGBoost

In [None]:
tic = timer()
y_pred = xgb.dask.inplace_predict(client, xgb_cuda_model, X_infer)
y_pred = y_pred.compute()
wait(y_pred)
toc = timer()
print(f"Wall clock time taken for inplace inference : {toc-tic} s")

In [None]:
tic = timer()
print("Calculating MSE")
score = mean_squared_error(y_pred, _y_test)
print("Workflow Complete - RMSE: ", np.sqrt(score))
toc = timer()
print(f"Wall clock time taken for this cell : {toc-tic} s")

### <span style="color:#8735fb; font-size:22pt"> **Step 5: Predict & Score using FIL or Forest Inference Library** </span>

[ForestInference](https://docs.rapids.ai/api/cuml/stable/api.html?highlight=forestinference#cuml.ForestInference) provides GPU accelerated inference capabilities for tree models. 
It accepts a **trained** tree model in a treelite format (currently LightGBM, XGBoost and SKLearn GBDT and random forest models
are supported). 
However, you cannot use it to train anything. 

In [None]:
from cuml import ForestInference
from dask.distributed import get_worker

Here if the `X_test` is small enough we can call `compute()` on it. But in general, if the `X_test` is quite large, it is better to persist it on the different workers and then call the predict on the individual workers separately. We we show both. 

### <span style="color:#8735fb; font-size:18pt"> Step 5.a: Predict using `compute` on a single worker in case the test dataset is small. </span>

In [None]:
tic = timer()
X_test_computed = X_infer.compute()
wait(X_test_computed)
loaded_model = ForestInference.load(model_filename, model_type='xgboost')
toc = timer()
print(f"Wall clock time taken for this cell : {toc-tic} s")

In [None]:
tic = timer()
fil_pred = loaded_model.predict(X_test_computed)
print("Final - RMSE: ", np.sqrt(score))
toc = timer()
print(f"Wall clock time taken for this cell : {toc-tic} s")

In [None]:
tic=timer()
score = mean_squared_error(fil_pred, _y_test)
print("Final - RMSE: ", np.sqrt(score))
toc = timer()
print(f"Wall clock time taken for this cell : {toc-tic} s")

Compare the RMSE results with that obtained without using FIL. You would see they are approximately same.

### <span style="color:#8735fb; font-size:18pt"> Step 5.b: Predict using `persist` on multiple workers in case the test dataset is huge. </span>

Cannot load using FIL directly, since the dask workers need to get the XGB model to load using FIL. Therefore we send the XGB model to the workers. 

In [None]:
workers = client.has_what().keys()
print(workers)
n_workers = len(workers)
n_partitions = n_workers

In [None]:
def unzipFile(zipname):
    worker = get_worker()
    import zipfile
    import os
    with zipfile.ZipFile(os.path.join(worker.local_directory, zipname)) as zf:
        zf.extractall(worker.local_directory)

def checkOrMakeLocalDir():
    worker = get_worker()
    import os
    if not os.path.exists(worker.local_directory):
        os.makedirs(worker.local_directory)
    
def workerModelInit(model_file):   
    # this function will run in each worker and initialize the worker 
    import os
    worker = get_worker()
    worker.data["fil_model"] = ForestInference.load(filename=os.path.join(worker.local_directory, model_file),model_type='xgboost')
    
def predict(input_df):
    # this function will run in each worker and predict 
    worker = get_worker()
    return worker.data["fil_model"].predict(input_df)

def persistModelinWorkers(client, zip_file_name, model_file_name):
    import zipfile
    zf = zipfile.ZipFile(zip_file_name, mode='w')
    zf.write(f"./{model_file_name}")
    zf.close()
    # check to see if local directory present in workers
    # if not present make it
    fut = client.run(checkOrMakeLocalDir)
    wait(fut)
    # upload the zip file in workers
    fut = client.upload_file(f"./{zip_file_name}")
    wait(fut)
    # unzip file in the workers
    fut = client.run(unzipFile, zip_file_name)
    wait(fut)
    # load model using FIL in workers
    fut = client.run(workerModelInit, model_file_name)
    wait(fut)
    
    

Persist the `X_infer` in the workers if not already persisted. 

#### <span style="color:#8735fb; font-size:14pt"> Now, we cannot serialzie ForestInference model directly and send it to the workers. However, the client object has a method [client.upload_file](https://distributed.dask.org/en/latest/api.html#distributed.Client.upload_file) which allows the client to send a .py/.egg/.zip file. Therefore we can send the zipped `xgboost` model and the unzip and load the model in the dask workers. The `xgboost` model will be stored in the `worker.local_diectory` drectory. Next we can load the xgboost model using FIL indvidually in each worker. </span>

In [None]:
%%time
persistModelinWorkers(client, "zipfile_write.zip", "trained-model_nyctaxi.xgb")

### If local directory not present, create the directory for storage.

In [None]:
tic = timer()
predictions = X_infer.map_partitions(predict, meta="float") # this is like MPI reduce
y_pred = predictions.compute()
wait(y_pred)
toc = timer()
print(f"Wall clock time taken for this cell : {toc-tic} s")

In [None]:
rows_csv = X_infer.iloc[:,0].shape[0].compute()
print(f"It took {toc-tic} seconds to predict on {rows_csv} rows using FIL distributedly on each worker")

In [None]:
tic = timer()
score = mean_squared_error(y_pred, _y_test)
toc = timer()
print("Final - RMSE: ", np.sqrt(score))

### <span style="color:#8735fb; font-size:22pt"> **Step 6: Clean up** </span>

In [None]:
client.close()
# clust.close()

https://distributed.dask.org/en/latest/limitations.html