# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [None]:
!nvidia-smi

Tue Nov 30 13:54:23 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#Setup xgboost, dask-ml, RAPIDS
Set up script installs
1. Updates gcc in Colab
1. Installs Conda
1. Install RAPIDS' current stable version of its libraries, as well as some external libraries including:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuSignal
  1. BlazingSQL
  1. xgboost
1. Copy RAPIDS .so files into current working directory, a neccessary workaround for RAPIDS+Colab integration.
5. Install additional modules (dask-ml, pytorch lightning) using pip to avoid extremely slow conda environment checks. The modules will be visible in conda, as will the native Colab modules.

Overall, expect the installs to take 20-30 minutes -- **they will also require multiple restarts of the Colab instance**, so if you are running this on your own, be prepared to monitor this process pretty closely. Once the conda installs are running, you should be OK, unless you have changed something elsewhere in the environment.

In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
***********************************************************************
Woo! Your instance has the right kind of GPU, a Tesla P100-PCIE-16GB!
***********************************************************************



In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

Updating your Colab environment.  This will restart your kernel.  Don't Panic!
Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:2 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:4 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:5 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Get:6 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:7 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Hit:8 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic InRelease
Get:9 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:10 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Ign:11 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:12 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86

In [None]:
!pip install dask-ml rasterio ipython-autotime
%load_ext autotime

Collecting numpy>=1.20.0
  Using cached numpy-1.20.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.21.2
    Uninstalling numpy-1.21.2:
      Successfully uninstalled numpy-1.21.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dask-cudf 21.10.1 requires cupy-cuda114, which is not installed.
cuspatial 21.10.0 requires cython, which is not installed.
cuml 21.10.2 requires cython, which is not installed.
cugraph 21.10.0+0.g84617024.dirty requires cython, which is not installed.
cudf 21.10.1 requires cupy-cuda112, which is not installed.
cudf 21.10.1 requires Cython<0.30,>=0.29, which is not installed.
cudf-kafka 21.10.1 requires cython, which is not installed.
dask-cudf 21.10.1 requires dask==2021.09.1, but you have dask 2021.11.2 

In [None]:
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

✨🍰✨ Everything looks OK!
time: 12.3 ms (started: 2021-11-30 20:14:36 +00:00)


In [None]:
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

✨🍰✨ Everything looks OK!
time: 3.53 ms (started: 2021-11-30 20:14:36 +00:00)


In [None]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
# The <packages> option are default blank or 'core'.  By default, we install RAPIDSAI and BlazingSQL.  The 'core' option will install only RAPIDSAI and not include BlazingSQL, 
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'

Installing RAPIDS Stable 21.10
Starting the RAPIDS install on Colab.  This will take about 15 minutes.
Collecting package metadata (current_repodata.json): ...working... done
done

# All requested packages already installed.

RAPIDS conda installation complete.  Updating Colab's libraries...
Copying /usr/local/lib/libcudf.so to /usr/lib/libcudf.so
Copying /usr/local/lib/libnccl.so to /usr/lib/libnccl.so
Copying /usr/local/lib/libcuml.so to /usr/lib/libcuml.so
Copying /usr/local/lib/libcugraph.so to /usr/lib/libcugraph.so
Copying /usr/local/lib/libxgboost.so to /usr/lib/libxgboost.so
Copying /usr/local/lib/libcuspatial.so to /usr/lib/libcuspatial.so
Copying /usr/local/lib/libgeos.so to /usr/lib/libgeos.so
Copying /usr/local/lib/libgeos_c.so to /usr/lib/libgeos_c.so
time: 59.7 s (started: 2021-11-30 20:14:36 +00:00)


In [None]:
import warnings
warnings.filterwarnings("ignore")

time: 1.98 ms (started: 2021-11-30 20:15:36 +00:00)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
time: 2.97 ms (started: 2021-11-30 20:15:36 +00:00)


In [None]:
import sys
import os
import random
import time
import argparse
from time import time
from typing import Tuple

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import cross_val_score

import cv2
import rasterio
from rasterio.plot import show
from imutils import paths
import tqdm.notebook as tq
import imutils

import dask
from dask.distributed import Client
import xgboost as xgb
from xgboost import dask as dxgb
from dask import dataframe as dd
import dask_cudf
from dask_cuda import LocalCUDACluster
from distributed import Client, wait
from dask_ml.model_selection import train_test_split
import pyarrow as pa
import cupy

time: 18.2 ms (started: 2021-11-30 20:24:38 +00:00)


# Image Classification: Ground-Level Fire Imagery

## Setup

The commented cells are for reference purposes only, to document how I created the parquet file from the original image data. We don't want these cells to run every time, because they take around 30 minutes.

In [None]:
class Config:
    # initialize the path to the fire and non-fire dataset directories
    FIRE_PATH = os.path.sep.join(["/content/drive/MyDrive/datasets/Robbery_Accident_Fire_Database2",
        "Fire"])
    NON_FIRE_PATH = "/content/drive/MyDrive/datasets/spatial_envelope_256x256_static_8outdoorcategories"
    
    # initialize the class labels in the dataset
    CLASSES = ["Non-Fire", "Fire"]

    # define the size of the training and testing split
    TRAIN_SPLIT = 0.75
    TEST_SPLIT = 0.25
    MAX_SAMPLES = 1300
    
    # define the initial learning rate, batch size, and number of epochs
    INIT_LR = .01
    BATCH_SIZE = 64
    NUM_EPOCHS = 50

    # set the path to the serialized model after training
    MODEL_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "fire_detection.model"])
    # set the path to the parquet file
    PARQUET_PATH = os.path.sep.join(["/content/drive/MyDrive/datasets/", "groundfire.parquet"])
    
    # define the path to the output learning rate finder plot and
    # training history plot
    LRFIND_PLOT_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "lrfind_plot.png"])
    TRAINING_PLOT_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "training_plot.png"])

    # define the path to the output directory that will store our final
    # output with labels/annotations along with the number of images to
    # sample
    OUTPUT_IMAGE_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "examples"])
    SAMPLE_SIZE = 50

# initialize the configuration object
config = Config()

time: 16.6 ms (started: 2021-11-30 22:23:26 +00:00)


In [None]:
# def load_dataset(datasetPath, shape, max_samples):
#   # grab the paths to all images in our dataset directory, then
#   # initialize our lists of images
#   imagePaths = list(paths.list_images(datasetPath))
#   data = []

#   # loop over the image paths, limiting size of dataset to 1300 samples per class
#   count = 0
#   for imagePath in tq.tqdm(imagePaths):
#     count += 1
#     if count > max_samples:
#       print("breaking at {}".format(max_samples))
#       break
#     # ignoring aspect ratio
#     image = cv2.imread(imagePath)
#     # load the image and resize it to be a fixed 128x128 pixels
#     image = cv2.resize(image, (128, 128))
#     if shape == "row":
#       image = np.asarray(image).flatten()
#     # add the image to the data lists
#     data.append(image)

#   # return the data list as a NumPy array
#   np_data = np.array(data, dtype="float32")
#   np_data /= 255
#   # Scale to [0,1]
#   return np_data

time: 5.79 ms (started: 2021-11-30 20:15:38 +00:00)


In [None]:
# fireData = load_dataset(config.FIRE_PATH, "row", config.MAX_SAMPLES)
# # print(fireData.shape)
# nonFireData = load_dataset(config.NON_FIRE_PATH, "row", config.MAX_SAMPLES)
# fireData = np.concatenate([fireData, np.ones((fireData.shape[0],1),dtype=fireData.dtype)], axis=1)
# nonFireData = np.concatenate([nonFireData, np.zeros((nonFireData.shape[0],1),dtype=nonFireData.dtype)], axis=1)
# data = np.vstack([fireData, nonFireData])
# # print(data.shape)
# #2600 rows, 49153 cols

time: 2.88 ms (started: 2021-11-30 20:15:38 +00:00)


In [None]:
# data_T = data.transpose()
# arrays = [
#   pa.array(col)  # Create one arrow array per column
#   for col in data_T
# ]
# table = pa.Table.from_arrays(
#     arrays,
#     names=['feature-{}'.format(i) for i in range(len(arrays)-1)]+["label"] # give names to each columns
# )
# pa.parquet.write_table(table, 'groundfire.parquet')

time: 2.48 ms (started: 2021-11-30 20:15:38 +00:00)


In [None]:
## Debug cells

# !ls -lh groundfire.parquet
# df = dask_cudf.read_parquet('groundfire.parquet')

time: 1.32 ms (started: 2021-11-30 20:15:38 +00:00)


In [None]:
## Save the model

# from google.colab import files

# files.download('landsatfire.parquet')

time: 1.32 ms (started: 2021-11-30 20:15:38 +00:00)


## Train and Test

If we were running a true simulation, we would not reuse our validation data as test data. However, for the purposes of demonstrating the methodology, this is not a concern.

In [None]:
def load_groundfire(
    path,
) -> Tuple[
    dask_cudf.DataFrame, dask_cudf.Series, dask_cudf.DataFrame, dask_cudf.Series
]:
    df = dask_cudf.read_parquet(path)

    y = df["label"]
    X = df[df.columns.difference(["label"])]

    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, test_size=0.25, random_state=42, shuffle=True
    )
    X_train, X_valid, y_train, y_valid = client.persist(
        [X_train, X_valid, y_train, y_valid]
    )
    wait([X_train, X_valid, y_train, y_valid])

    return X_train, X_valid, y_train, y_valid

time: 14.8 ms (started: 2021-11-30 20:15:38 +00:00)


In [None]:
def fit_model_customized_es(client, X, y, X_valid, y_valid):
    early_stopping_rounds = 10

    es = xgb.callback.EarlyStopping(rounds=early_stopping_rounds, save_best=True)

    Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y)

    Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid)

    booster = xgb.dask.train(
        client,
        {
            "objective": "binary:logistic",
            "eval_metric": "error",
            "tree_method": "gpu_hist",
        },
        Xy,
        evals=[(Xy_valid, "Valid")],
        num_boost_round=1000,
        callbacks=[es],
    )["booster"]
    return booster

time: 13.1 ms (started: 2021-11-30 20:15:38 +00:00)


In [None]:
def explain(client, model, X):
   # Use array instead of dataframe in case of output dim is greater than 2.
   X_array = X.values
   contribs = dxgb.predict(
       client, model, X_array, pred_contribs=True, validate_features=False
   )
   # Use the result for further analysis
   return contribs

time: 3.56 ms (started: 2021-11-30 20:15:38 +00:00)


In [None]:
def predict(client, model, X):
    predt = dxgb.predict(client, model, X)
    assert isinstance(predt, dd.Series)
    return predt

time: 2.16 ms (started: 2021-11-30 20:15:38 +00:00)


The cell below preserves pointers to the model, dataframes and results locally.

In a true distributed environment (IE, in production), we would perform all operations inside Dask's context manager so that the load could be distributed over multiple nodes. But in Colab, we only have one node, and doing this allows us to inspect and manipulate the data after the context manager has run (or even if it crashes!)

One thing we cannot do is alter the dataframes -- if we do, Dask will know and produce an error --

```
Inputs contain futures that were created by another client
```

A comment on timing -- it's interesting to note that the data-processing runs 4-5x slower in Dask/CuDF than it does in SKLearn/Numpy. Distributing data only saves time in a true distributed environment.

In [None]:
X_train, X_valid, y_train, y_valid, booster, preds, contribs = None, None, None, None, None, None, None

time: 2.03 ms (started: 2021-11-30 20:15:38 +00:00)


In [None]:
if __name__ == "__main__":
   start_time = time()
   with LocalCUDACluster() as cluster:
       print("dashboard:", cluster.dashboard_link)
       with Client(cluster) as client:
          print("Load data ...")
          X_train, X_valid, y_train, y_valid = load_groundfire(config.PARQUET_PATH)
          print("Load complete.")
          print("Begin model training...")
          booster = fit_model_customized_es(client, X_train, y_train, X_valid, y_valid)
          print("Training complete.")
          preds = predict(client, booster, X_valid)
          #print("type of preds and yvalid")
          #print(type(preds)) #dask.dataframe.core.series
          print(type(y_valid)) #dask_cudf.core.series
          preds_np = cupy.asnumpy(preds.to_dask_array()) 
          preds_np = np.rint(preds_np)
          print("type of preds_np and y_valid_df")
          #print(type(preds_np)) #numpy array
          #y_valid_df = y_valid.compute() #cudf.core.series.Series
          print(type(y_valid_df)) 
          #y_valid_np = cupy.asnumpy(y_valid_df)
          #y_valid_np = y_valid.compute().values #cupy._core.core.ndarray
          y_valid_np = cupy.asnumpy(y_valid.compute().values)
          print(type(y_valid_np))
          # contribs = explain(client, booster, X_train)
          # preds = preds.as_matrix().compute()
          # preds = np.asarray(preds.to_dask_array())
          # y_valid = y_valid.as_matrix().compute()
          # y_valid = np.asarray(y_valid.to_dask_array())
          # contribs = contribs.compute().as_matrix()
          print("---METRICS---"); print(metrics.classification_report(y_valid_np, preds_np))
          print("---CONFUSION MATRIX---"); print(metrics.confusion_matrix(y_valid_np, preds_np))
          # print("---CONTRIBUTIONS TO PREDICTION---"); print(contribs)
   print("Execution Time %s seconds: " % (time() - start_time))

dashboard: http://127.0.0.1:8787/status
Load data ...
Load complete.
Begin model training...
[0]	Valid-error:0.30194
[1]	Valid-error:0.27803
[2]	Valid-error:0.27354
[3]	Valid-error:0.27055
[4]	Valid-error:0.26906
[5]	Valid-error:0.24066
[6]	Valid-error:0.23019
[7]	Valid-error:0.23916
[8]	Valid-error:0.24365
[9]	Valid-error:0.24215
[10]	Valid-error:0.23318
[11]	Valid-error:0.22571
[12]	Valid-error:0.23468
[13]	Valid-error:0.23019
[14]	Valid-error:0.22720
[15]	Valid-error:0.22571
[16]	Valid-error:0.22421
[17]	Valid-error:0.22123
[18]	Valid-error:0.21674
[19]	Valid-error:0.22272
[20]	Valid-error:0.21375
[21]	Valid-error:0.20777
[22]	Valid-error:0.20329
[23]	Valid-error:0.19731
[24]	Valid-error:0.19581
[25]	Valid-error:0.19581
[26]	Valid-error:0.19581
[27]	Valid-error:0.19731
[28]	Valid-error:0.20329
[29]	Valid-error:0.20030
[30]	Valid-error:0.19880
[31]	Valid-error:0.19432
[32]	Valid-error:0.19731
[33]	Valid-error:0.19581
[34]	Valid-error:0.19880
[35]	Valid-error:0.19432
[36]	Valid-error:

# Image Classification: LANDSAT-8 Imagery

## Setup

The commented cells are for reference purposes only, to document how I created the parquet file from the original image data. We don't want these cells to run every time, because they take around 30 minutes.

In [None]:
class Config:
    # initialize the path to the fire and non-fire dataset directories
    FIRE_PATH = os.path.sep.join(["/content/drive/MyDrive/datasets/landsat_mini/Training",
        "Fire"])
    NON_FIRE_PATH = os.path.sep.join(["/content/drive/MyDrive/datasets/landsat_mini/Training", "No_Fire"])
    
    # initialize the class labels in the dataset
    CLASSES = ["Non-Fire", "Fire"]

    # define the size of the training and testing split
    TRAIN_SPLIT = 0.75
    TEST_SPLIT = 0.25
    MAX_SAMPLES = 1300
    
    # define the initial learning rate, batch size, and number of epochs
    INIT_LR = .01
    BATCH_SIZE = 64
    NUM_EPOCHS = 50

    # set the path to the serialized model after training
    MODEL_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "fire_detection_xgb_landsat.model"])

    PARQUET_PATH = os.path.sep.join(["/content/drive/MyDrive/datasets/", "landsatfire.parquet"])

    
    # define the path to the output learning rate finder plot and
    # training history plot
    LRFIND_PLOT_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "lrfind_plot_xgb_landsat.png"])
    TRAINING_PLOT_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "training_plot_xgb_landsat.png"])

    # define the path to the output directory that will store our final
    # output with labels/annotations along with the number of images to
    # sample
    OUTPUT_IMAGE_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "examples"])
    SAMPLE_SIZE = 50

# initialize the configuration object
config = Config()

time: 24.6 ms (started: 2021-11-30 23:00:22 +00:00)


In [None]:
# def load_dataset(datasetPath, shape, max_samples):
#   # grab the paths to all images in our dataset directory, then
#   # initialize our lists of images
#   imagePaths = list(paths.list_images(datasetPath))
#   data = []

#   # loop over the image paths, limiting size of dataset to 1300 samples per class
#   count = 0
#   for imagePath in tq.tqdm(imagePaths):
#     count += 1
#     if count > max_samples:
#       print("breaking at {}".format(max_samples))
#       break
#     # ignoring aspect ratio
#     img = rasterio.open(imagePath)
#     image = rasterio.plot.reshape_as_image(img.read((2,3,7)))
#     if shape == "row":
#       image = image.flatten()
#     # add the image to the data lists
#     data.append(image)

#   # return the data list as a NumPy array
#   np_data = np.array(data, dtype="float32")
#   # print(np_data.shape)
#   # Scale to [0,1]
#   np_data /= (np_data.max()-np_data.min())
#   return np_data

time: 16.3 ms (started: 2021-11-30 22:51:51 +00:00)


In [None]:
# fireData = load_dataset(config.FIRE_PATH, "row", config.MAX_SAMPLES)
# # print(fireData.shape)
# nonFireData = load_dataset(config.NON_FIRE_PATH, "row", config.MAX_SAMPLES)
# fireData = np.concatenate([fireData, np.ones((fireData.shape[0],1),dtype=fireData.dtype)], axis=1)
# nonFireData = np.concatenate([nonFireData, np.zeros((nonFireData.shape[0],1),dtype=nonFireData.dtype)], axis=1)
# data = np.vstack([fireData, nonFireData])
# # print(data.shape)

  0%|          | 0/242 [00:00<?, ?it/s]

  0%|          | 0/128 [00:00<?, ?it/s]

time: 2min 59s (started: 2021-11-30 22:51:51 +00:00)


In [None]:
# data_T = data.transpose()
# arrays = [
#   pa.array(col)  # Create one arrow array per column
#   for col in data_T
# ]
# table = pa.Table.from_arrays(
#     arrays,
#     names=['feature-{}'.format(i) for i in range(len(arrays)-1)]+["label"] # give names to each columns
# )
# pa.parquet.write_table(table, 'landsatfire.parquet')

time: 4.27 s (started: 2021-11-30 22:54:50 +00:00)


In [None]:
## Debug cells

# !ls -lh groundfire.parquet
# df = dask_cudf.read_parquet('groundfire.parquet')

time: 1.94 ms (started: 2021-11-30 22:54:55 +00:00)


In [None]:
# # Save the model

# from google.colab import files

# files.download('landsatfire.parquet')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

time: 10.6 ms (started: 2021-11-30 22:54:55 +00:00)


## Train and Test

If we were running a true simulation, we would not reuse our validation data as test data. However, for the purposes of demonstrating the methodology, this is not a concern, so long as we remember that the accuracy values here are definitely NOT accurate.

In [None]:
def load_groundfire(
    path,
) -> Tuple[
    dask_cudf.DataFrame, dask_cudf.Series, dask_cudf.DataFrame, dask_cudf.Series
]:
    df = dask_cudf.read_parquet(path)

    y = df["label"]
    X = df[df.columns.difference(["label"])]

    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, test_size=0.25, random_state=42, shuffle=True
    )
    X_train, X_valid, y_train, y_valid = client.persist(
        [X_train, X_valid, y_train, y_valid]
    )
    wait([X_train, X_valid, y_train, y_valid])

    return X_train, X_valid, y_train, y_valid

time: 9.59 ms (started: 2021-11-30 23:00:10 +00:00)


In [None]:
def fit_model_customized_es(client, X, y, X_valid, y_valid):
    early_stopping_rounds = 10

    es = xgb.callback.EarlyStopping(rounds=early_stopping_rounds, save_best=True)

    Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y)

    Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid)

    booster = xgb.dask.train(
        client,
        {
            "objective": "binary:logistic",
            "eval_metric": "error",
            "tree_method": "gpu_hist",
        },
        Xy,
        evals=[(Xy_valid, "Valid")],
        num_boost_round=1000,
        callbacks=[es],
    )["booster"]
    return booster

time: 9.5 ms (started: 2021-11-30 23:00:10 +00:00)


In [None]:
def explain(client, model, X):
   # Use array instead of dataframe in case of output dim is greater than 2.
   X_array = X.values
   contribs = dxgb.predict(
       client, model, X_array, pred_contribs=True, validate_features=False
   )
   # Use the result for further analysis
   return contribs

time: 4.74 ms (started: 2021-11-30 23:00:10 +00:00)


In [None]:
def predict(client, model, X):
    predt = dxgb.predict(client, model, X)
    assert isinstance(predt, dd.Series)
    return predt

time: 3.82 ms (started: 2021-11-30 23:00:10 +00:00)


The cell below preserves pointers to the model, dataframes and results locally.

In a true distributed environment (IE, in production), we would perform all operations inside Dask's context manager so that the load could be distributed over multiple nodes. But in Colab, we only have one node, and doing this allows us to inspect and manipulate the data after the context manager has run (or even if it crashes!)

One thing we cannot do is alter the dataframes -- if we do, Dask will know and produce an error --

```
Inputs contain futures that were created by another client
```

A comment on timing -- it's interesting to note that the data-processing runs 4-5x slower in Dask/CuDF than it does in SKLearn/Numpy. Distributing data only saves time in a true distributed environment.

In [None]:
X_train, X_valid, y_train, y_valid, booster, preds, contribs = None, None, None, None, None, None, None

In [None]:
if __name__ == "__main__":
   start_time = time()
   with LocalCUDACluster() as cluster:
       print("dashboard:", cluster.dashboard_link)
       with Client(cluster) as client:
          print("Load data ...")
          X_train, X_valid, y_train, y_valid = load_groundfire("./landsatfire.parquet")
          print("Load complete.")
          print("Begin model training...")
          booster = fit_model_customized_es(client, X_train, y_train, X_valid, y_valid)
          print("Training complete.")
          preds = predict(client, booster, X_valid)
          #print("type of preds and yvalid")
          #print(type(preds)) #dask.dataframe.core.series
          #print(type(y_valid)) #dask_cudf.core.series
          preds_np = cupy.asnumpy(preds.to_dask_array()) 
          preds_np = np.rint(preds_np) #round predictions to nearest integer (0,1)
          print("type of preds_np and y_valid_df")
          #print(type(preds_np)) #numpy array
          y_valid_np = cupy.asnumpy(y_valid.compute().values)
          # print(type(y_valid_np)) #numpy array
          # contribs = explain(client, booster, X_train)
          # contribs = contribs.compute().as_matrix()
          print("---METRICS---"); print(metrics.classification_report(y_valid_np, preds_np))
          print("---CONFUSION MATRIX---"); print(metrics.confusion_matrix(y_valid_np, preds_np))
          # print("---CONTRIBUTIONS TO PREDICTION---"); print(contribs)
   print("Execution Time %s seconds: " % (time() - start_time))

dashboard: http://127.0.0.1:8787/status
Load data ...
Load complete.
Begin model training...
[0]	Valid-error:0.13043
[1]	Valid-error:0.09783
[2]	Valid-error:0.09783
[3]	Valid-error:0.10870
[4]	Valid-error:0.08696
[5]	Valid-error:0.08696
[6]	Valid-error:0.06522
[7]	Valid-error:0.05435
[8]	Valid-error:0.06522
[9]	Valid-error:0.05435
[10]	Valid-error:0.05435
[11]	Valid-error:0.05435
[12]	Valid-error:0.05435
[13]	Valid-error:0.05435
[14]	Valid-error:0.05435
[15]	Valid-error:0.05435
[16]	Valid-error:0.06522
[17]	Valid-error:0.04348
[18]	Valid-error:0.05435
[19]	Valid-error:0.04348
[20]	Valid-error:0.05435
[21]	Valid-error:0.04348
[22]	Valid-error:0.04348
[23]	Valid-error:0.04348
[24]	Valid-error:0.04348
[25]	Valid-error:0.04348
[26]	Valid-error:0.04348
Training complete.
type of preds_np and y_valid_df
---METRICS---
              precision    recall  f1-score   support

         0.0       0.94      0.94      0.94        35
         1.0       0.96      0.96      0.96        57

    accuracy 