# Scale Out Demo: K-means (CPU-to-GPU-to-MNMG)

### Motivation
How can a data scientist streamline the path from a proof-of-concept to a fully distributed product feature?

It is a great feeling when you demonstrate an interesting new capability that impresses your leadership so much, they want to fast track it into production.  It is a not-so-great feeling when you realize how much work is needed to transform your early-stage prototype into customer-ready software at a global scale.

RAPIDS helps position your software to scale out quickly, with minimal code changes, even for multi-node, multi-GPU (MNMG) deployments.  RAPIDS does this by aligning its APIs with those of standard, CPU-based Python libraries.  RAPIDS enables both the productivity you expect from Python and the performance you need for global scale.

In general, consider scaling out technologies when your data is too big to fit on a single GPU or when you need to simultaneously analyze data that is spread across multiple files.

### This Notebook
This notebook is intended to:
1. Introduce scaling out with RAPIDS and Dask.
2. Introduce how Dask works "under the hood."
3. Show how to modify Python code to leverage RAPIDS and Dask.

This notebook is *not* intended to:
1. Address complexities from data pre-processing and messy datasets.
2. Address specific algorithm-level details.

Key items to note:
1. Adopting RAPIDS typically involves only changing import statements and occasional syntax differences.
    * RAPIDS strives to maintain the same API as other Python libraries.
    * You won't have to refactor a lot of function signatures.
2. Scaling out with Dask requires additional code on both CPUs and GPUs.

### Technology overview
The software libraries you use depends on what hardware you will use.

![image-4.png](attachment:image-4.png)
> The choice of technologies to use depends on the hardware you intend to use.

pandas is a Python library for data manipulation and analysis on CPUs.

cuDF is a Python GPU DataFrame library that provides a pandas-like API.  cuDF is built on the Apache Arrow in-memory columnar format, which is fast and enables transfers between different systems.

Dask is a Python library for parallel computing.  Dask is a core technology for scaling out software.  The Dask API is largely consistent with locally-run Python libraries (pandas, numpy, scikit-learn), which makes getting up to speed realtively easy.  The partitions of a Dask DataFrame are each pandas DataFrames.

Dask-cuDF is the Python GPU DataFrame library used for running distributed on Dask.

For distributed algorithms, the API requires a different input format than the locally run algorithms.  Dask's API requires a Dask DataFrame or Array as input.  cuML's API for distributed algorithms is very similar to the Dask DataFrame API.  The difference is that the underlying data frames are cuDF, not pandas.  Dask cuPy arrays are also available.

For more information, see:

* pandas: https://pandas.pydata.org/
* scikit-learn: https://scikit-learn.org/stable/
* RAPIDS: https://github.com/rapidsai | https://docs.rapids.ai/ | https://medium.com/rapids-ai
* cuDF: https://docs.rapids.ai/api/cudf/stable
* cuML: https://github.com/rapidsai/cuml
* Dask: https://dask.org/
* Dask-cuDF: https://github.com/rapidsai/dask-cudf


### K-means
K-means is a convenient algorithm to showcase scaling out because it is easy to understand and implemented for all quadrants discussed in this notebook.

K-means is a basic yet powerful clustering method optimized via Expectation Maximization.  It randomly selects K data points in X and computes which samples are close to these points.  For every cluster of points, a mean is computed, and this becomes the new centroid.

For CPU architectures, a k-means implementation is available in Scikit-learn for local computation and in Dask for distributed computation.  The "k-means++" initialization method is more stable than randomly selecting k points.

Scikit-learn does not have a distributed k-means implementation, but Dask does.

For GPU architectures, a k-means implementation is available in cuML, which can be run on a single GPU or distributed in a multi-node multi-GPU (MNMG) fashion.

cuML's k-means supports the "k-means||" initialization methond, which is a scalable variant of "k-means++."

cuML models can take input as array-like objects, either in host (CPU) as NumPy arrays or in device (GPU) as Numba or cuda_array_interface-compliant arrays, as well as cuDF data frames.

For more information, see:

* scikit-learn's k-means: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
* Dask ML's k-means: https://ml.dask.org/modules/generated/dask_ml.cluster.KMeans.html
* cuML's k-means:  https://docs.rapids.ai/api/cuml/stable/api.html#cuml.KMeans

## Imports

In [None]:
# Data generation.
from cuml.datasets import make_blobs

# Local compute.
import pandas as pd                              # Q1
from sklearn.cluster import KMeans as cpuKMeans  # Q1
import cudf                                      # Q2
from cuml.cluster import KMeans as gpuKMeans     # Q2

# Distributed compute.
from dask.distributed import Client, wait
from dask_ml.cluster import KMeans as daskKMeans           # Q3
import dask_cudf                                           # Q4
from dask_cuda import LocalCUDACluster                     # Q4
from cuml.dask.cluster.kmeans import KMeans as mnmgKMeans  # Q4

# Comparing results.
import cupy
import matplotlib.pyplot as plt
from sklearn.metrics import adjusted_rand_score

try:
    import matplotlib.pyplot as plt
except ImportError:
    print('Installing matplotlib.')
    !conda install -q -c conda-forge -y matplotlib
    import matplotlib.pyplot as plt

# Enable drawing images in this notebook.  No need to for interactive graphics.
%matplotlib inline

## Generate data
This dataset is for illustration purposes only.

In [None]:
# Generate a dataset.
n_samples = 100000
n_features = 2
n_clusters = 6
random_state = 0
cluster_std = 0.1

input_data, input_labels = make_blobs(n_samples=n_samples,
                                      n_features=n_features,
                                      centers=n_clusters,
                                      random_state=random_state,
                                      cluster_std=cluster_std)

# Save the data for CPU compute.
data_cpu = pd.DataFrame(input_data)
labels_cpu = pd.Series(input_labels.get())  # Explicit conversion to NumPy array.

# Save the data for GPU compute.
data_gpu = cudf.DataFrame(input_data)
labels_gpu = cudf.Series(input_labels.get()) # Same API as for pandas. You don't need to call "get()" here.

# Plot the raw data and labels.
fig = plt.figure(figsize=(16,10))
plt.scatter(data_cpu.iloc[:,0], data_cpu.iloc[:,1], c=labels_cpu, s=50, cmap='viridis')

## Local compute

### Q1: CPU model (Host)

In [None]:
# Instantiate, train and predict.
kmeans_cpu = cpuKMeans(init="k-means++",  # "k-means++" avoids poor clustering.
                       n_clusters=n_clusters,
                       random_state=random_state)
kmeans_cpu.fit(data_cpu)
labels_cpu = kmeans_cpu.predict(data_cpu)

print('CPU done.')

### Q2: GPU model (Device)
RAPIDS has the same API for k-means as scikit-learn.

In [None]:
# Instantiate, train and predict.
kmeans_gpu = gpuKMeans(init="k-means||",  # "k-means||" is a scalable variant of "k-means++".
                       n_clusters=n_clusters,
                       random_state=random_state)
kmeans_gpu.fit(data_gpu)
labels_gpu = kmeans_gpu.predict(data_gpu)

print('GPU done.')

## Dask introduction

The additional steps required to scale out your code with Dask are 
* Setting up Dask.
* Implementing your code to be compatible with Dask's execution paradigm.

![image.png](attachment:image.png)
> High level collections are used to generate task graphs which can be executed on a single machine or a cluster. Using the Distributed scheduler enables creation of a Dask cluster for multi-machine computation.
(image from: https://docs.dask.org/en/latest/how-to/deploy-dask-clusters.html)

### Distributed setup
Setting up Dask requires instantiating a cluster and a Client.
1. The Dask cluster has a distributed scheduler and a number of workers.
    * We'll use the default schedulers.
    * Typically the number of workers equals the number of available computing resources.
2. A Client points to the Dask cluster.
    * Instantiating the Client will make code run on the Dask cluster it points to.

Note: Using Dask on a single computer does not require any setup.  Dask will distribute computation using the computer's thread pool.

### Distributed implementation
Unlike the local CPU and GPU implementations, Dask implementation is a two-step process.
1. Build the Dask task graph.
    * This step is largely aligned to Python APIs.
    * All Dask methods prior to the next step create the Dask task graph.
2. Execute the Dask task graph.
    * Do this by calling <font color=red>`compute`</font> on the Dask DataFrame.


### Q3: Distributed CPU model
To scale out on CPUs, you follow the pattern outlined above: Setup Dask, create the Dask task graph and execute the Dask task graph.

In [None]:
# Setup the distributed CPU Dask cluster.
client_dask = Client()  # This is a CPU client with a the default cluster.

n_total_cpu_cores = sum(client_dask.ncores().values())

# Examine the Dask status.
print('Dask CPU cluster')
print('ncores per worker node: ' + str(client_dask.ncores()))
print('Total number of CPU cores: ' + str(n_total_cpu_cores))

# Build the Dask task graph.
# Instantiate, train and predict.
kmeans_dask = daskKMeans(init="k-means||",
                         n_clusters=n_clusters,
                         n_jobs=-1,  # -1 means use all CPUs
                         random_state=random_state)
kmeans_dask.fit(data_cpu)
kmeans_dask_df = kmeans_dask.predict(data_cpu)

# Execute the Dask task graph.
labels_dask = kmeans_dask_df.compute()

client_dask.close()  # Close the CPU Dask client.

# Display the output
print('Dask (CPU only) k-means labels:')
print(labels_dask)

### Q4: Distributed MNMG model
To scale out on GPUs, you follow the pattern outlined above, which is the same as for CPUs.  Note that we have to put the data into a Dask-cuDF object.

In [None]:
# Setup the MNMG Dask cluster.
cluster_mnmg = LocalCUDACluster()  # This is a GPU cluster.
client_mnmg = Client(cluster_mnmg)

n_total_gpu_cores = sum(client_mnmg.ncores().values())

print('Dask GPU cluster')
print('ncores per worker node: ' + str(client_mnmg.ncores()))
print('Total number of GPU cores: ' + str(n_total_gpu_cores))

# Convert data into Dask-cuDF format for distributed compute.
# To minimize memory consumption, ensure that each Dask worker has exactly one partition.
data_mnmg = dask_cudf.from_cudf(data_gpu, npartitions=n_total_gpu_cores)

# Build the Dask task graph.
# Instantiate, train and predict.
kmeans_mnmg = mnmgKMeans(init="k-means||",
                         n_clusters=n_clusters,
                         random_state=random_state)
kmeans_mnmg.fit(data_mnmg)
kmeans_mnmg_df = kmeans_mnmg.predict(data_mnmg)

# Execute the Dask task graph.
labels_mnmg = kmeans_mnmg_df.compute()

# Display the output.
print('MNMG k-means labels:')
print(labels_mnmg)

# NOTE: Keep the MNMG client open for analysis below.


## Analysis

Stella will show the Dask diagnostics dashboard.

### Visualization
Scikit-learn's k-means implementation uses the k-means++ initialization strategy while cuML's k-means uses k-means||. As a result, the exact centroids found may not be exact as the std deviation of the points around the centroids in make_blobs is increased.

Note: Visualizing the centroids will only work when n_features = 2.

In [None]:
fig = plt.figure(figsize=(16,10))
plt.scatter(data_cpu.iloc[:,0], data_cpu.iloc[:,1], c=labels_cpu, s=5, alpha=0.005, cmap='viridis')

# Plot the CPU centers (Q1).
centers_cpu = kmeans_cpu.cluster_centers_
plt.scatter(centers_cpu[:,0], centers_cpu[:,1], c='blue', s=100, alpha=0.5)

# Plot the GPU centers (Q2).
centers_gpu = kmeans_gpu.cluster_centers_
plt.scatter(cupy.asnumpy(centers_gpu[0].values), cupy.asnumpy(centers_gpu[1].values), facecolors='none', edgecolors='red', s=100)

# Plot the Dask centers (Q3).
centers_dask = kmeans_dask.cluster_centers_
plt.scatter(centers_dask[:,0], centers_dask[:,1], marker='s', facecolors='none', edgecolors='blue', s=250)

# Plot the MNMG centers (Q4).
centers_mnmg = kmeans_mnmg.cluster_centers_
plt.scatter(cupy.asnumpy(centers_mnmg[0].values), cupy.asnumpy(centers_mnmg[1].values), marker='s', facecolors='none', edgecolors='red', s=400)

plt.title('CPU, GPU and distributed k-means clustering')
plt.show()

### Results comparison

In [None]:
# Use the CPU labels as "truth".
# Local compute.
score_cpu = adjusted_rand_score(labels_cpu, kmeans_cpu.labels_)
score_gpu = adjusted_rand_score(labels_cpu, kmeans_gpu.labels_.to_numpy())

threshold = 1e-4
passed = abs(score_cpu - score_gpu) < threshold
print('Local CPU and GPU k-means results are ' + ('equal' if passed else 'NOT equal') + '.')

# Distributed compute.
score_dask = adjusted_rand_score(labels_cpu, kmeans_dask.labels_)
score_mnmg = adjusted_rand_score(labels_cpu, labels_mnmg.to_numpy())

passed = abs(score_dask - score_mnmg) < threshold
print('Distributed CPU and GPU k-means results are ' + ('equal' if passed else 'NOT equal') + '.')


In [None]:
# Close the MNMG cluster.
client_mnmg.close()

## Conclusion
RAPIDS makes it easy to position your code for hardware acceleration at any scale.  Because RAPIDS APIs largely match standard Python libraries, the step from local CPU to local GPU is minimal.

Scaling out from a single GPU to multi-node, multi-GPU (MNMG) distributions requires:
1. Selecting a Dask cluster.
2. Instantiating a Dask Client.
3. Building the Dask task graph.
4. Executing the Dask task graph.

Using RAPIDS from the beginning of your project could save you a lot of time and effort when it comes time to take your code to the next level.

Need to work with SQL?  Check out dask-sql: https://dask-sql.readthedocs.io/en/latest/