# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [1]:
!nvidia-smi

Tue Nov 30 13:54:23 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#Setup xgboost, dask-ml, RAPIDS
Set up script installs
1. Updates gcc in Colab
1. Installs Conda
1. Install RAPIDS' current stable version of its libraries, as well as some external libraries including:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuSignal
  1. BlazingSQL
  1. xgboost
1. Copy RAPIDS .so files into current working directory, a neccessary workaround for RAPIDS+Colab integration.
5. Install additional modules (dask-ml, pytorch lightning) using pip to avoid extremely slow conda environment checks. The modules will be visible in conda, as will the native Colab modules.

Overall, expect the installs to take 20-30 minutes -- they will also require multiple restarts of the Colab instance.

In [1]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
***********************************************************************
Woo! Your instance has the right kind of GPU, a Tesla P100-PCIE-16GB!
***********************************************************************



In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

Updating your Colab environment.  This will restart your kernel.  Don't Panic!
Hit:1 http://security.ubuntu.com/ubuntu bionic-security InRelease
Hit:2 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:3 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:4 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:5 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:6 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Hit:7 http://archive.ubuntu.com/ubuntu bionic InRelease
Hit:8 http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu bionic InRelease
Hit:9 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Hit:10 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Ign:11 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Ign:12 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:13 https://developer.

In [1]:
!pip install dask-ml rasterio ipython-autotime
%load_ext autotime

Collecting dask-ml
  Using cached dask_ml-2021.11.16-py3-none-any.whl (147 kB)
Collecting rasterio
  Downloading rasterio-1.2.10-cp37-cp37m-manylinux1_x86_64.whl (19.3 MB)
[K     |████████████████████████████████| 19.3 MB 1.3 MB/s 
[?25hCollecting ipython-autotime
  Downloading ipython_autotime-0.3.1-py2.py3-none-any.whl (6.8 kB)
Collecting distributed>=2.4.0
  Using cached distributed-2021.11.2-py3-none-any.whl (802 kB)
Collecting multipledispatch>=0.4.9
  Using cached multipledispatch-0.6.0-py3-none-any.whl (11 kB)
Collecting numba>=0.51.0
  Downloading numba-0.54.1-cp37-cp37m-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 42.4 MB/s 
[?25hCollecting dask[array,dataframe]>=2.4.0
  Downloading dask-2021.11.2-py3-none-any.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 34.5 MB/s 
[?25hCollecting dask-glm>=0.2.0
  Using cached dask_glm-0.2.0-py2.py3-none-any.whl (12 kB)
Collecting scikit-learn>=1.0.0
  Download

time: 167 µs (started: 2021-11-30 13:57:44 +00:00)


In [2]:
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

⏬ Downloading https://github.com/jaimergp/miniforge/releases/latest/download/Mambaforge-colab-Linux-x86_64.sh...
📦 Installing...
📌 Adjusting configuration...
🩹 Patching environment...
⏲ Done in 0:00:31
🔁 Restarting kernel...


In [9]:
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

✨🍰✨ Everything looks OK!


In [2]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
# The <packages> option are default blank or 'core'.  By default, we install RAPIDSAI and BlazingSQL.  The 'core' option will install only RAPIDSAI and not include BlazingSQL, 
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'

Installing RAPIDS Stable 21.10
Starting the RAPIDS install on Colab.  This will take about 15 minutes.
Collecting package metadata (current_repodata.json): ...working... done
failed with initial frozen solve. Retrying with flexible solve.
failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
done

## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - cudatoolkit=11.2
    - gcsfs
    - llvmlite
    - openssl
    - python=3.7
    - rapids=21.10


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    abseil-cpp-20210324.2      |       h9c3ff4c_0        1010 KB  conda-forge
    aiohttp-3.7.4.post0        |   py37h5e8e339_0         625 KB  conda-forge
    anyio-3.4.0                |   py37h89c1867_0         149 KB  conda-forge
    appdirs-1.4.4              |     pyh9f0

In [10]:
import warnings
warnings.filterwarnings("ignore")

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [60]:
import sys
import os
import random
import time
import argparse
from time import time
from typing import Tuple

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import cross_val_score

import cv2
import rasterio
from rasterio.plot import show
from imutils import paths
import tqdm.notebook as tq
import imutils

from dask.distributed import Client
import xgboost as xgb
from xgboost import dask as dxgb
from dask import dataframe as dd
import dask_cudf
from dask_cuda import LocalCUDACluster
from distributed import Client, wait
from dask_ml.model_selection import train_test_split
import pyarrow as pa
import cupy

# Image Classification: Ground-Level Fire Imagery

## Setup

The commented cells are for reference purposes only, to document how I created the parquet file from the original image data. We don't want these cells to run every time, because they take around 30 minutes.

In [13]:
class Config:
    # initialize the path to the fire and non-fire dataset directories
    FIRE_PATH = os.path.sep.join(["/content/drive/MyDrive/datasets/Robbery_Accident_Fire_Database2",
        "Fire"])
    NON_FIRE_PATH = "/content/drive/MyDrive/datasets/spatial_envelope_256x256_static_8outdoorcategories"
    
    # initialize the class labels in the dataset
    CLASSES = ["Non-Fire", "Fire"]

    # define the size of the training and testing split
    TRAIN_SPLIT = 0.75
    TEST_SPLIT = 0.25
    MAX_SAMPLES = 1300
    
    # define the initial learning rate, batch size, and number of epochs
    INIT_LR = .01
    BATCH_SIZE = 64
    NUM_EPOCHS = 50

    # set the path to the serialized model after training
    MODEL_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "fire_detection.model"])
    # set the path to the parquet file
    PARQUET_PATH = os.path.sep.join(["/content/drive/MyDrive/datasets/", "groundfire.parquet"])
    
    # define the path to the output learning rate finder plot and
    # training history plot
    LRFIND_PLOT_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "lrfind_plot.png"])
    TRAINING_PLOT_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "training_plot.png"])

    # define the path to the output directory that will store our final
    # output with labels/annotations along with the number of images to
    # sample
    OUTPUT_IMAGE_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "examples"])
    SAMPLE_SIZE = 50

# initialize the configuration object
config = Config()

In [2]:
# def load_dataset(datasetPath, shape, max_samples):
#   # grab the paths to all images in our dataset directory, then
#   # initialize our lists of images
#   imagePaths = list(paths.list_images(datasetPath))
#   data = []

#   # loop over the image paths, limiting size of dataset to 1300 samples per class
#   count = 0
#   for imagePath in tq.tqdm(imagePaths):
#     count += 1
#     if count > max_samples:
#       print("breaking at {}".format(max_samples))
#       break
#     # ignoring aspect ratio
#     image = cv2.imread(imagePath)
#     # load the image and resize it to be a fixed 128x128 pixels
#     image = cv2.resize(image, (128, 128))
#     if shape == "row":
#       image = np.asarray(image).flatten()
#     # add the image to the data lists
#     data.append(image)

#   # return the data list as a NumPy array
#   np_data = np.array(data, dtype="float32")
#   np_data /= 255
#   # Scale to [0,1]
#   return np_data

In [3]:
# fireData = load_dataset(config.FIRE_PATH, "row", config.MAX_SAMPLES)
# # print(fireData.shape)
# nonFireData = load_dataset(config.NON_FIRE_PATH, "row", config.MAX_SAMPLES)
# fireData = np.concatenate([fireData, np.ones((fireData.shape[0],1),dtype=fireData.dtype)], axis=1)
# nonFireData = np.concatenate([nonFireData, np.zeros((nonFireData.shape[0],1),dtype=nonFireData.dtype)], axis=1)
# data = np.vstack([fireData, nonFireData])
# # print(data.shape)
# #2600 rows, 49153 cols

In [4]:
# data_T = data.transpose()
# arrays = [
#   pa.array(col)  # Create one arrow array per column
#   for col in data_T
# ]
# table = pa.Table.from_arrays(
#     arrays,
#     names=['feature-{}'.format(i) for i in range(len(arrays)-1)]+["label"] # give names to each columns
# )
# pa.parquet.write_table(table, 'groundfire.parquet')

In [5]:
## Debug cells

# !ls -lh groundfire.parquet
# df = dask_cudf.read_parquet('groundfire.parquet')

In [8]:
## Save the model

# from google.colab import files

# files.download('landsatfire.parquet')

## Train and Test

If we were running a true simulation, we would not reuse our validation data as test data. However, for the purposes of demonstrating the methodology, this is not a concern.

In [49]:
def load_groundfire(
    path,
) -> Tuple[
    dask_cudf.DataFrame, dask_cudf.Series, dask_cudf.DataFrame, dask_cudf.Series
]:
    df = dask_cudf.read_parquet(path)

    y = df["label"]
    X = df[df.columns.difference(["label"])]

    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, test_size=0.25, random_state=42, shuffle=True
    )
    X_train, X_valid, y_train, y_valid = client.persist(
        [X_train, X_valid, y_train, y_valid]
    )
    wait([X_train, X_valid, y_train, y_valid])

    return X_train, X_valid, y_train, y_valid

In [16]:
def fit_model_customized_es(client, X, y, X_valid, y_valid):
    early_stopping_rounds = 10

    es = xgb.callback.EarlyStopping(rounds=early_stopping_rounds, save_best=True)

    Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y)

    Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid)

    booster = xgb.dask.train(
        client,
        {
            "objective": "binary:logistic",
            "eval_metric": "error",
            "tree_method": "gpu_hist",
        },
        Xy,
        evals=[(Xy_valid, "Valid")],
        num_boost_round=1000,
        callbacks=[es],
    )["booster"]
    return booster

In [17]:
def explain(client, model, X):
   # Use array instead of dataframe in case of output dim is greater than 2.
   X_array = X.values
   contribs = dxgb.predict(
       client, model, X_array, pred_contribs=True, validate_features=False
   )
   # Use the result for further analysis
   return contribs

In [18]:
def predict(client, model, X):
    predt = dxgb.predict(client, model, X)
    assert isinstance(predt, dd.Series)
    return predt

The cell below preserves pointers to the model, dataframes and results locally.

In a true distributed environment (IE, in production), we would perform all operations inside Dask's context manager so that the load could be distributed over multiple nodes. But in Colab, we only have one node, and doing this allows us to inspect and manipulate the data after the context manager has run (or even if it crashes!)

One thing we cannot do is alter the dataframes -- if we do, Dask will know and produce an error --

```
Inputs contain futures that were created by another client
```

A comment on timing -- it's interesting to note that the data-processing runs 4-5x slower in Dask/CuDF than it does in SKLearn/Numpy. Distributing data only saves time in a true distributed environment.

In [55]:
X_train, X_valid, y_train, y_valid, booster, preds, contribs = None, None, None, None, None, None, None

In [84]:
# y_valid.to_dask_array()

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes unknown unknown Shape (nan,) (nan,) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",,

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan,)","(nan,)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray


In [None]:
if __name__ == "__main__":
   start_time = time()
   with LocalCUDACluster() as cluster:
       print("dashboard:", cluster.dashboard_link)
       with Client(cluster) as client:
          print("Load data ...")
          X_train, X_valid, y_train, y_valid = load_groundfire(config.PARQUET_PATH)
          print("Load complete.")
          print("Begin model training...")
          booster = fit_model_customized_es(client, X_train, y_train, X_valid, y_valid)
          print("Training complete.")
          preds = predict(client, booster, X_valid)
          # contribs = explain(client, booster, X_train)
          # preds = preds.compute().as_matrix()
          preds = preds.to_dask_array()
          # y_valid = y_valid.compute().as_matrix()
          y_valid = y_valid.to_dask_array()
          # contribs = contribs.compute().as_matrix()
          print("---METRICS---"); print(metrics.classification_report(y_valid, preds))
          print("---CONFUSION MATRIX---"); print(metrics.confusion_matrix(y_valid, preds))
          # print("---CONTRIBUTIONS TO PREDICTION---"); print(contribs)
   print("Execution Time %s seconds: " % (time() - start_time))

dashboard: http://127.0.0.1:8787/status
Load data ...


# Image Classification: XGBoost, LANDSAT-8 Imagery

## Setup

In [None]:
class Config:
    # initialize the path to the fire and non-fire dataset directories
    FIRE_PATH = os.path.sep.join(["/content/drive/MyDrive/datasets/landsat_mini/Training",
        "Fire"])
    NON_FIRE_PATH = os.path.sep.join(["/content/drive/MyDrive/datasets/landsat_mini/Training", "No_Fire"])
    
    # initialize the class labels in the dataset
    CLASSES = ["Non-Fire", "Fire"]

    # define the size of the training and testing split
    TRAIN_SPLIT = 0.75
    TEST_SPLIT = 0.25
    MAX_SAMPLES = 1300
    
    # define the initial learning rate, batch size, and number of epochs
    INIT_LR = .01
    BATCH_SIZE = 64
    NUM_EPOCHS = 50

    # set the path to the serialized model after training
    MODEL_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "fire_detection_xgb_landsat.model"])
    
    # define the path to the output learning rate finder plot and
    # training history plot
    LRFIND_PLOT_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "lrfind_plot_xgb_landsat.png"])
    TRAINING_PLOT_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "training_plot_xgb_landsat.png"])

    # define the path to the output directory that will store our final
    # output with labels/annotations along with the number of images to
    # sample
    OUTPUT_IMAGE_PATH = os.path.sep.join(["/content/drive/MyDrive/School/NYU/Big Data Fall 2021/Project/", "examples"])
    SAMPLE_SIZE = 50

# initialize the configuration object
config = Config()

time: 20.1 ms (started: 2021-11-26 19:10:42 +00:00)


In [None]:
def load_dataset(datasetPath, shape, max_samples):
  # grab the paths to all images in our dataset directory, then
  # initialize our lists of images
  imagePaths = list(paths.list_images(datasetPath))
  data = []

  # loop over the image paths, limiting size of dataset to 1300 samples per class
  count = 0
  for imagePath in tq.tqdm(imagePaths):
    count += 1
    if count > max_samples:
      print("breaking at {}".format(max_samples))
      break
    # ignoring aspect ratio
    img = rasterio.open(imagePath)
    image = rasterio.plot.reshape_as_image(img.read((2,3,7)))
    if shape == "row":
      image = image.flatten()
    # add the image to the data lists
    data.append(image)

  # return the data list as a NumPy array
  np_data = np.array(data, dtype="float32")
  # print(np_data.shape)
  # Scale to [0,1]
  np_data /= (np_data.max()-np_data.min())
  return np_data

time: 9.28 ms (started: 2021-11-26 19:10:42 +00:00)


In [None]:
# fireData = load_dataset(config.FIRE_PATH, "row", config.MAX_SAMPLES)
# # print(fireData.shape)
# nonFireData = load_dataset(config.NON_FIRE_PATH, "row", config.MAX_SAMPLES)
# fireData = np.concatenate([fireData, np.ones((fireData.shape[0],1),dtype=fireData.dtype)], axis=1)
# nonFireData = np.concatenate([nonFireData, np.zeros((nonFireData.shape[0],1),dtype=nonFireData.dtype)], axis=1)
# data = np.vstack([fireData, nonFireData])
# # print(data.shape)
# #2600 rows, 49153 cols

In [None]:
# data_T = data.transpose()
# arrays = [
#   pa.array(col)  # Create one arrow array per column
#   for col in data_T
# ]
# table = pa.Table.from_arrays(
#     arrays,
#     names=['feature-{}'.format(i) for i in range(len(arrays)-1)]+["label"] # give names to each columns
# )
# pa.parquet.write_table(table, 'groundfire.parquet')

In [None]:
## Debug cells

# !ls -lh groundfire.parquet
# df = dask_cudf.read_parquet('groundfire.parquet')

In [None]:
## Save the model

# from google.colab import files

# files.download('landsatfire.parquet')

## Train and Test

If we were running a true simulation, we would not reuse our validation data as test data. However, for the purposes of demonstrating the methodology, this is not a concern.

In [None]:
def load_groundfire(
    path,
) -> Tuple[
    dask_cudf.DataFrame, dask_cudf.Series, dask_cudf.DataFrame, dask_cudf.Series
]:
    df = dask_cudf.read_parquet(path)

    y = df["label"]
    X = df[df.columns.difference(["label"])]

    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, test_size=0.25, random_state=42, shuffle=True
    )
    X_train, X_valid, y_train, y_valid = client.persist(
        [X_train, X_valid, y_train, y_valid]
    )
    wait([X_train, X_valid, y_train, y_valid])

    return X_train, X_valid, y_train, y_valid

In [None]:
def fit_model_customized_es(client, X, y, X_valid, y_valid):
    early_stopping_rounds = 10

    es = xgb.callback.EarlyStopping(rounds=early_stopping_rounds, save_best=True)

    Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y)

    Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid)

    booster = xgb.dask.train(
        client,
        {
            "objective": "binary:logistic",
            "eval_metric": "error",
            "tree_method": "gpu_hist",
        },
        Xy,
        evals=[(Xy_valid, "Valid")],
        num_boost_round=1000,
        callbacks=[es],
    )["booster"]
    return booster

In [None]:
def explain(client, model, X):
   # Use array instead of dataframe in case of output dim is greater than 2.
   X_array = X.values
   contribs = dxgb.predict(
       client, model, X_array, pred_contribs=True, validate_features=False
   )
   # Use the result for further analysis
   return contribs

In [None]:
def predict(client, model, X):
    predt = dxgb.predict(client, model, X)
    assert isinstance(predt, dd.Series)
    return predt

The cell below preserves pointers to the model, dataframes and results locally.

In a true distributed environment (IE, in production), we would perform all operations inside Dask's context manager so that the load could be distributed over multiple nodes. But in Colab, we only have one node, and doing this allows us to inspect and manipulate the data after the context manager has run (or even if it crashes!)

One thing we cannot do is alter the dataframes -- if we do, Dask will know and produce an error --

```
Inputs contain futures that were created by another client
```

A comment on timing -- it's interesting to note that the data-processing runs 4-5x slower in Dask/CuDF than it does in SKLearn/Numpy. Distributing data only saves time in a true distributed environment.

In [None]:
X_train, X_valid, y_train, y_valid, booster, preds, contribs = None, None, None, None, None, None, None

In [None]:
if __name__ == "__main__":
   start_time = time()
   with LocalCUDACluster() as cluster:
       print("dashboard:", cluster.dashboard_link)
       with Client(cluster) as client:
          print("Load data ...")
          X_train, X_valid, y_train, y_valid = load_groundfire(config.PARQUET_PATH)
          print("Load complete.")
          print("Begin model training...")
          booster = fit_model_customized_es(client, X_train, y_train, X_valid, y_valid)
          print("Training complete.")
          preds = predict(client, booster, X_valid)
          contribs = explain(client, booster, X_train)
          print("---METRICS---"); print(metrics.classification_report(cupy.asarray(y_valid), cupy.asarray(preds)))
          print("---CONFUSION MATRIX---"); print(metrics.confusion_matrix(cupy.asarray(y_valid), cupy.asarray(preds)))
          print("---CONTRIBUTIONS TO PREDICTION---"); print(contribs)
   print("Execution Time %s seconds: " % (time() - start_time))

dashboard: http://127.0.0.1:8787/status
Load data ...
Load complete.
Begin model training...


# Example: XGBoost

https://developer.nvidia.com/blog/accelerating-xgboost-on-gpu-clusters-with-dask/

https://medium.com/rapids-ai/a-new-official-dask-api-for-xgboost-e8b10f3d1eb7

In [None]:
#!rm HIGGS.csv.gz

In [None]:
#!wget -N https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz

In [None]:
#!gunzip HIGGS.csv.gz

In [None]:
# from dask.distributed import Client
# import xgboost as xgb
# from dask import dataframe as dd
# import dask_cudf
# from dask_cuda import LocalCUDACluster

### Training

## 1.3 Version

In [None]:
# def main(client):
#     # We use HIGGS as the dataset for demonstration.
#     fname = 'HIGGS.csv'
#     colnames = ['label'] + ['feature-%02d' % i for i in range(1, 29)]
#     # By default dask dataframe uses pandas as data handling backend. Use dask cudf for acceleration
#     dask_df = dask_cudf.read_csv(fname, header=None, names=colnames)
#     y = dask_df['label']
#     X = dask_df[dask_df.columns.difference(['label'])]
#     # DaskDMatrix acts like normal DMatrix, works as a proxy for local
#     # DMatrix scattered around the workers.
#     dtrain = xgb.dask.DaskDMatrix(client, X, y)    # Use train method from xgboost.dask instead of xgboost.  This
#     # distributed version of train returns a dictionary containing the
#     # resulting booster and evaluation history obtained from
#     # evaluation metrics.
#     output = xgb.dask.train(client,
#                             # Use GPU training algorithm
#                             {'tree_method': 'gpu_hist'},
#                             dtrain,
#                             num_boost_round=100,
#                             evals=[(dtrain, 'train')])
#     booster = output['booster']  # booster is the trained model
#     history = output['history']  # A dictionary containing evaluation results
#     # Save the model to file
#     booster.save_model('xgboost-model')
#     print('Training evaluation history:', history)
#     booster.set_param({'predictor': 'gpu_predictor'})
#     # where X is a dask DataFrame or dask Array.
#     prediction = xgb.dask.predict(client, booster, dtrain)

In [None]:
# if __name__ == '__main__':    # `LocalCUDACluster` is used for assigning GPU to XGBoost 
#     # processes. Here `n_workers` represents the number of GPUs 
#     # since we use one GPU per worker process.    
#     with LocalCUDACluster() as cluster:
#         with Client(cluster) as client:
#             main(client)

## 1.4 Version

In [None]:
# import os
# from time import time
# from typing import Tuple

# from dask import dataframe as dd
# from dask_cuda import LocalCUDACluster
# from distributed import Client, wait
# import dask_cudf
# from dask_ml.model_selection import train_test_split

# import xgboost as xgb
# from xgboost import dask as dxgb
# import numpy as np
# import argparse

In [None]:
# def main(client):
#     # We use HIGGS as the dataset for demonstration.
#     parquet = to_parquet()
#     X_train, X_valid, y_train, y_valid = load_higgs(parquet)
#     booster = fit_model_customized_objective(client, X_train, y_train, X_valid, y_valid)
#     preds = inplace_predict_multi_parts(client, booster, X_train, X_valid)
#     print(preds)

In [None]:
# def to_parquet() -> str:
#    """Convert the HIGGS.csv file to parquet files."""
#    dirpath = "./"
#    parquet_path = os.path.join(dirpath, "HIGGS.parquet")
#    if os.path.exists(parquet_path):
#        return parquet_path
#    csv_path = os.path.join(dirpath, "HIGGS.csv")
#    colnames = ["label"] + ["feature-%02d" % i for i in range(1, 29)]
#    df = dask_cudf.read_csv(csv_path, header=None, names=colnames, dtype=np.float32)
#    df.to_parquet(parquet_path)
#    return parquet_path

In [None]:
# def load_higgs(
#     path,
# ) -> Tuple[
#     dask_cudf.DataFrame, dask_cudf.Series, dask_cudf.DataFrame, dask_cudf.Series
# ]:
#     df = dask_cudf.read_parquet(path)

#     y = df["label"]
#     X = df[df.columns.difference(["label"])]

#     X_train, X_valid, y_train, y_valid = train_test_split(
#         X, y, test_size=0.33, random_state=42
#     )
#     X_train, X_valid, y_train, y_valid = client.persist(
#         [X_train, X_valid, y_train, y_valid]
#     )
#     wait([X_train, X_valid, y_train, y_valid])

#     return X_train, X_valid, y_train, y_valid

In [None]:
# def fit_model_es(client, X, y, X_valid, y_valid) -> xgb.Booster:
#    early_stopping_rounds = 5
#    Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y)
#    Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid)
#    # train the model
#    booster = dxgb.train(
#        client,
#        {
#            "objective": "binary:logistic",
#            "eval_metric": "error",
#            "tree_method": "gpu_hist",
#        },
#        Xy,
#        evals=[(Xy_valid, "Valid")],
#        num_boost_round=1000,
#        early_stopping_rounds=early_stopping_rounds,
#    )["booster"]
#    return booster

In [None]:
# def fit_model_customized_es(client, X, y, X_valid, y_valid):
#     early_stopping_rounds = 5
#     es = xgb.callback.EarlyStopping(rounds=early_stopping_rounds, save_best=True)
#     Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y)
#     Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid)
#     # train the model
#     booster = xgb.dask.train(
#         client,
#         {
#             "objective": "binary:logistic",
#             "eval_metric": "error",
#             "tree_method": "gpu_hist",
#         },
#         Xy,
#         evals=[(Xy_valid, "Valid")],
#         num_boost_round=1000,
#         callbacks=[es],
#     )["booster"]
#     return booster

In [None]:
# def fit_model_customized_objective(client, X, y, X_valid, y_valid) -> dxgb.Booster:
#     def logit(predt: np.ndarray, Xy: xgb.DMatrix) -> Tuple[np.ndarray, np.ndarray]:
#         predt = 1.0 / (1.0 + np.exp(-predt))
#         labels = Xy.get_label()
#         grad = predt - labels
#         hess = predt * (1.0 - predt)
#         return grad, hess

#     def error(predt: np.ndarray, Xy: xgb.DMatrix) -> Tuple[str, float]:
#         label = Xy.get_label()
#         r = np.zeros(predt.shape)
#         predt = 1.0 / (1.0 + np.exp(-predt))
#         gt = predt > 0.5
#         r[gt] = 1 - label[gt]
#         le = predt <= 0.5
#         r[le] = label[le]
#         return "CustomErr", float(np.average(r))

#     # Use early stopping with custom objective and metric.
#     early_stopping_rounds = 5
#     # Specify the metric we want to use for early stopping.
#     es = xgb.callback.EarlyStopping(
#     rounds=early_stopping_rounds, save_best=True, metric_name="CustomErr"
#     )

#     Xy = dxgb.DaskDeviceQuantileDMatrix(client, X, y)
#     Xy_valid = dxgb.DaskDMatrix(client, X_valid, y_valid)
#     booster = dxgb.train(
#         client,
#         {"eval_metric": "error", "tree_method": "gpu_hist"},
#         Xy,
#         evals=[(Xy_valid, "Valid")],
#         num_boost_round=1000,
#         obj=logit,  # pass the custom objective
#         feval=error,  # pass the custom metric
#         callbacks=[es],
#     )["booster"]
#     return booster

In [None]:
# def explain(client, model, X):
#    # Use array instead of dataframe in case of output dim is greater than 2.
#    X_array = X.values
#    contribs = dxgb.predict(
#        client, model, X_array, pred_contribs=True, validate_features=False
#    )
#    # Use the result for further analysis
#    return contribs

In [None]:
# def predict(client, model, X):
#     predt = dxgb.predict(client, model, X)
#     assert isinstance(predt, dd.Series)
#     return predt

In [None]:
# def inplace_predict(client, model, X):
#     # Use inplace_predict instead of standard predict.
#     predt = dxgb.inplace_predict(client, model, X)
#     assert isinstance(predt, dd.Series)
#     return predt

In [None]:
# def inplace_predict_multi_parts(client, model, X_train, X_valid):
#     """Simulate the scenario that we need to run prediction on multiple datasets using train
# 	and valid. In real world the number of datasets is unlimited

#     """
#     # prescatter the model onto workers
#     model_f = client.scatter(model)
#     predictions = []
#     for X in [X_train, X_valid]:
#         # Use inplace_predict instead of standard predict.
#         predt = dxgb.inplace_predict(client, model_f, X)
#         assert isinstance(predt, dd.Series)
#         predictions.append(predt)
#         return predictions

In [None]:
# if __name__ == "__main__":
#    with LocalCUDACluster() as cluster:
#        print("dashboard:", cluster.dashboard_link)
#        with Client(cluster) as client:
#            main(client)