<font color="white">.</font> | <font color="white">.</font> | <font color="white">.</font>
-- | -- | --
![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg) | <h1><font size="+3">ASTG Python Courses</font></h1> | ![NASA](https://www.nccs.nasa.gov/sites/default/files/NCCS_Logo_0.png)

---

<CENTER>
<H1 style="color:red">
RAPIDS cuPy, cuDF and cuML
</H1>
</CENTER>

---

## <font color='red'>Reference Documents</font>
- [Introduction to GPUs](https://nyu-cds.github.io/python-gpu/01-introduction/)
- [RAPIDS: GPU DATA SCIENCE](https://rapids.ai/)
- [RAPIDS Notebooks](https://github.com/rapidsai/notebooks)
- [CuPy: A NumPy-compatible array library accelerated by CUDA](https://cupy.dev/)
- [cuML - GPU Machine Learning Algorithms](https://github.com/rapidsai/cuml)
- [Introduction to RAPIDS and GPU Data Science: CUDF/Dask vs. Pandas](https://code-love.com/2020/12/06/rapids-introduction/)
- [Introduction to RAPIDS](https://github.com/rapidsai-community/notebooks-contrib/blob/branch-0.14/getting_started_notebooks/intro_tutorials/01_Introduction_to_RAPIDS.ipynb) by Paul Hendricks
- [GPU Hackathons](https://www.gpuhackathons.org/technical-resources)

---

## <font color='red'>What will be Covered?</font>

* Introduction to GPUs
* What is RAPIDS?
* CuPy
* CuDF
* CuML

---

# <font color="red">GPUs</font>


- Graphics Processing Units (GPUs) are custom designed to be very efficient at handling computer graphics and image processing.
- GPUs are designed to handle billions of repetitive low level tasks, like arithmetic operations.
- Central Processing Units (CPUs) handle computations serially, meaning the logic in handled in one stream: the next task will complete when the subsequent task has finished. CPUs can execute tasks in parallel across cores. For example, most computer CPUs tend to have either two, four or six cores.
- In comparison, GPUs have hundreds of 'cores'. This massively parallel architecture is what gives the GPU its high compute performance.




![CPU_GPU](https://nyu-cds.github.io/python-gpu/fig/01-cpugpuarch.png)
Image Source: [nyu-cds.github.io](https://nyu-cds.github.io/python-gpu/fig/01-cpugpuarch.png)

- Compared to the CPU, the GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control.
- The GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with a high ratio of arithmetic operations to memory operations.
- Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

### Programming with GPUs

- GPU programming is the use of a GPU together with a CPU to accelerate computation in applications traditionally handled only by the CPU.
- Parallel computing platforms allow us to write codes to be executed on GPUs:
   - **Nvidia’s CUDA** (Compute Unified Device Architecture): can be called within  C, C++, Fortran, or Python codes without any skills in graphics programming. 
   - **OpenCL**: The most popular open, royalty-free standard for cross-platform, parallel programming. OpenCL defines a C-like language for writing programs, but third-party APIs exist for other programming languages and platforms such as Python or Java.
   - **OpenACC**: Designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures. It used by annotating C, C++, and Fortran source code to tell the GPU which areas that should be accelerated.
- A GPU program comprises two parts: 
   - a host part that runs on the CPU, and 
   - one or more kernels that run on the GPU. 
- Typically, the CPU portion of the program is used to set up the parameters and data for the computation, while the kernel portion performs the actual computation. 
- In some cases the CPU portion may comprise a parallel program that performs message passing operations using MPI.

![GPUs](http://www.nvidia.com/docs/IO/143716/how-gpu-acceleration-works.png)
Image Source: NVIDIA


**Useful Terminology**

| Term | Meaning |
| ---  | --- |
| `host` | the CPU |
| `device` | the GPU |
| `host memory` | the system main memory |
| `device memory` | onboard memory on a GPU card |
| `kernels` | a GPU function launched by the host and executed on the device |
| `device function` | a GPU function executed on the device which can only be called from the device  |

## Using a GPU

1. You must retarget code for the GPU
2. The working set must fit in GPU RAM
3. You must copy data to/from GPU RAM
4. Data accesses should be streaming, or use scratchpad as user-managed cache
5. Lots of parallelism preferred (throughput, not latency)
6. SIMD-style parallelism best suited
7. High arithmetic intensity (FLOPs/byte) preferred

# <font color="red"> What is RAPIDS?</font>

> RAPIDS utilizes NVIDIA CUDA® primitives for low-level compute optimization, and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

- A suite of open-source software libraries for executing end-to-end data science and analytics pipelines entirely on GPUs. 
- Is designed to look and feel like Python. All RAPIDS libraries are based on Python and are designed to have Pandas and Sklearn like interfaces to facilitate adoption.
- Accelerates data science pipelines to create more productive workflows.
- Works with different machine learning algorithms to provide a faster processing speed without serialization costs. 
- The main componanets are:
    - **cuDF**: used to perform data processing tasks (Pandas like).
    - **cuML**: used to create Machine Learning models (Sklearn like).
    - **cuGraph**: used to perform graphing tasks (Graph Theory).

![rapids](https://pbs.twimg.com/media/D2CeyaYVAAAe3kM.jpg)
Image Source: NVIDIA


![scientisr](https://qph.fs.quoracdn.net/main-qimg-040086613ce3f56014c109a918ee8a4f)
Image Source: NVIDIA

## Accessing the GPU on Google Colab

In order to access GPUs for free:

1. Go to the `Runtime` menu,
2. Click on `Change runtime type`, and 
3. In the pop-up box, under `Hardware accelerator`, select `GPU` and click on `SAVE`.

## Environment Sanity Check ##

- <font color='red'>Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.</font>
- Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4.

In [None]:
!nvidia-smi

#### Verify that you were allocated the GPU compatible with RAPIDS

In [None]:
import pynvml

pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
gpu_name = pynvml.nvmlDeviceGetName(handle).decode('UTF-8')

if('K80' not in gpu_name):
   print('***********************************************************************')
   print('Woo! Your instance has the right kind of GPU, a '+ str(gpu_name)+'!')
   print('***********************************************************************')
   print()
else:
   raise Exception("""
                  Unfortunately Colab didn't give you a RAPIDS compatible GPU (P4, P100, T4, or V100), but gave you a """+ gpu_name +""".
                  Make sure you've configured Colab to request a GPU Instance Type.                
                  If you get an incompatible GPU (i.e., a K80), use 'Runtime -> Factory Reset Runtimes...' to try again"""
                  )

#### Install RAPIDS

- Install most recent Miniconda release compatible with Google Colab's Python install (3.7.10)
- Removes incompatible files
- Install RAPIDS' current stable version of its libraries, including:
   - cuDF
   - cuML
   - cuGraph
   - cuSpatial
   - cuSignal
   - xgboost
- Set necessary environment variables
- Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions

In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh stable

import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.7/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.7/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

---

In [None]:
import numpy as np
import pandas as pd
import datetime as dt

# <font color="red">CuPy</font>

- An open-source array library accelerated with NVIDIA CUDA. 
- Provides GPU accelerated computing with Python.
- Is an implementation of NumPy-compatible multi-dimensional array on CUDA. In general, it can be used as a drop-in replacement. All you need to do is just replace Numpy with CuPy in your Python code.
- Is supports [methods](https://docs.cupy.dev/en/stable/reference/comparison.html) such as indexing, data types, broadcasting, array manipulation routines (`reshape`, `concatenate`, etc.), array creation routines (`empty`, `ones`, `ones_like`, etc.).
- Includes the following features for performance:
     - User-defined elementwise CUDA kernels
     - User-defined reduction CUDA kernels
     - Fusing CUDA kernels to optimize user-defined calculation
     - Customizable memory allocator and memory pool
     - cuDNN utilities
- Uses on-the-fly kernel synthesis:
     - When a kernel call is required, it compiles a kernel code optimized for the shapes and dtypes of given arguments, sends it to the GPU device, and executes the kernel.
     - The compiled code is cached to `$(HOME)/.cupy/kernel_cache` directory can be overwritten by setting the `CUPY_CACHE_DIR` environment variable). 
     - It may make things slower at the first kernel call, though this slow down will be resolved at the second execution. 
     - CuPy also caches the kernel code sent to GPU device within the process, which reduces the kernel transfer time on further calls.

In [None]:
import cupy as cp

## `cupy.ndarray`

- The `cupy.ndarray` class is in its core, which is a compatible GPU alternative of `numpy.ndarray`.
- The main difference of `cupy.ndarray` from `numpy.ndarray` is
that the content is allocated on the device memory.
- The `cupy.ndarray` data is allocated on the current device.

In [None]:
x_gpu = cp.array([1, 2, 3, 4, 5, 6])

Most of the array manipulations are also done in the way similar to NumPy. 

In [None]:
x_cpu = np.array([1, 2, 3, 4, 5, 6])
L2_cpu = np.linalg.norm(x_cpu)

We can calculate it on GPU with CuPy in a similar way:

In [None]:
x_gpu = cp.array([1, 2, 3, 4, 5, 6])
L2_gpu = cp.linalg.norm(x_gpu)

- CuPy implements many functions on `cupy.ndarray` objects.
- Understanding NumPy might help utilizing most features of CuPy.

## Current Device

- CuPy has a concept of the current device, which is the default device on which the allocation, manipulation, calculation etc. of arrays are taken place. 

Suppose the ID of current device is 0. The following code allocates array contents on GPU 0.

In [None]:
x_on_gpu0 = cp.array([1, 2, 3, 4, 5, 6])

The current device can be changed by `cupy.cuda.Device.use()` as follows:

In [None]:
x_on_gpu0 = cp.array([1, 2, 3, 4, 5, 6])
cp.cuda.Device(1).use()
x_on_gpu1 = cp.array([1, 2, 3, 4, 5, 6])

If you switch the current GPU temporarily, `with` statement comes in handy.

In [None]:
with cp.cuda.Device(1):
     x_on_gpu1 = cp.array([1, 2, 3, 4, 5, 6])
x_on_gpu0 = cp.array([1, 2, 3, 4, 5, 6])

Most operations of CuPy is done on the current device. Be careful that if processing of an array on a non-current device will cause an error:

In [None]:
with cp.cuda.Device(0):
     x_on_gpu0 = cp.array([1, 2, 3, 4, 5, 6])
with cp.cuda.Device(1):
     x_on_gpu0 * 2 # raises error

`cupy.ndarray.device` attribute indicates the device on which the array is allocated.

In [None]:
with cp.cuda.Device(1):
     x = cp.array([1, 2, 3, 4, 5, 6])

x.device

If the environment has only one device, such explicit device switching is not needed.

### Data Transfer

#### Move Data to a Device
`cupy.asarray()` can be used to move a numpy.ndarray, a list, or any object that can be passed to `numpy.array()` to the current device:

In [None]:
x_cpu = np.array([1, 2, 3, 4, 5, 6])
x_gpu = cp.asarray(x_cpu) # move the data to the current device.

`cupy.asarray()` can accept `cupy.ndarray`, which means we can transfer the array between devices with this function.

In [None]:
with cp.cuda.Device(0):
     x_gpu_0 = cp.ndarray([1, 2, 3]) # create an array in GPU 0

with cp.cuda.Device(1):
     x_gpu_1 = cp.asarray(x_gpu_0) # move the array to GPU 1

- `cupy.asarray()` does not copy the input array if possible. So, if you put an array of the current device, it returns the input object itself.
- If we do copy the array in this situation, you can use `cupy.array()` with `copy=True`. 
- Actually `cupy.asarray()` is equivalent to `cupy.array(arr, dtype, copy=False)`.

#### Move array from a device to the host

Moving a device array to the host can be done by `cupy.asnumpy()` as follows:

In [None]:
x_gpu = cp.array([1, 2, 3, 4, 5, 6]) # create an array in the current device
x_cpu = cp.asnumpy(x_gpu) # move the array to the host.

We can also use `cupy.ndarray.get()`:

In [None]:
x_cpu = x_gpu.get()

### How to write CPU/GPU agnostic code

- The compatibility of CuPy with NumPy enables us to write CPU/GPU generic code. 
- It can be made easy by the `cupy.get_array_module()` function. 
- This function returns the numpy or cupy module based on arguments.

A CPU/GPU generic function is defined using it like follows:

In [None]:
# Stable implementation of log(1 + exp(x))
def softplus(x):
    xp = cp.get_array_module(x)
    return xp.maximum(0, x) + xp.log1p(xp.exp(-abs(x)))

- Sometimes, an explicit conversion to a host or device array may be required. 
- `cupy.asarray()` and `cupy.asnumpy()` can be used in agnostic implementations to get host or device arrays from either CuPy or NumPy arrays.

In [None]:
y_cpu = np.array([7, 8, 9, 10, 11, 12])
x_cpu + y_cpu

In [None]:
x_gpu + y_cpu

In [None]:
cp.asnumpy(x_gpu) + y_cpu

In [None]:
cp.asnumpy(x_gpu) + cp.asnumpy(y_cpu)

In [None]:
x_gpu + cp.asarray(y_cpu)

In [None]:
cp.asarray(x_gpu) + cp.asarray(y_cpu)

---

#  [cuDF](https://docs.rapids.ai/api/cudf/stable/10min.html)

- CuDF is a library that aims to bring Pandas functionality to GPU.

In [None]:
import cudf

Use `cudf` to create a dataframe and perform operations:

In [None]:
num_rows = 100000
df = cudf.DataFrame({'X':np.random.randint(1000, size=num_rows),
                     'Y':np.random.randint(1000, size=num_rows)})
df

In [None]:
def add_squares(df):
    return df.X**2 + df.Y**2

In [None]:
%%time

df['add_squares'] = add_squares(df)

In [None]:
df

**Time Series Data**

`DataFrames` supports `datetime` typed columns, which allow users to interact with and filter data based on specific timestamps.

In [None]:
date_df = cudf.DataFrame()
date_df['date'] = pd.date_range('01/05/1980', periods=15000, freq='D')
date_df['value'] = cp.random.sample(len(date_df))
date_df

In [None]:
search_date1 = dt.datetime.strptime('2001-09-11', '%Y-%m-%d')
search_date2 = dt.datetime.strptime('2019-11-23', '%Y-%m-%d')

In [None]:
%%time

date_df.query('date >= @search_date1 and date <= @search_date2')

In [None]:
import dask_cudf
date_ddf = dask_cudf.from_cudf(date_df, npartitions=4)

In [None]:
date_ddf.head()

In [None]:
%%time

date_ddf.query('date >= @search_date1 and date <= @search_date2', 
               local_dict={'search_date1':search_date1, 
                           'search_date2':search_date2}).compute()

### Speed for Reading Datasets

In [None]:
import pandas as pd
url = 'https://data.ny.gov/api/views/xe9x-a24f/rows.csv'

In [None]:
%%time

df = pd.read_csv(url)

In [None]:
df

In [None]:
%%time

gdf = cudf.read_csv(url)

In [None]:
gdf

### Reading AERONET Observations at Goddard

In [None]:
url = "https://raw.githubusercontent.com/astg606/py_materials/master/aeronet/"
filename = url+"19930101_20210102_GSFC.lev20"

In [None]:
dateparse = lambda x: dt.datetime.strptime(x, '%d:%m:%Y %H:%M:%S')
df = pd.read_csv(filename, skiprows=6, na_values=-999,
                   parse_dates={'datetime': [0, 1]}, 
                   date_parser=dateparse, index_col=0, 
                   squeeze=True)
df

In [None]:
gdf = cudf.DataFrame(df)

**Renaming Columns**

In [None]:
old_cols = ['Day_of_Year', 'AOD_675nm', 'AOD_440nm', 
            '440-675_Angstrom_Exponent']

new_cols = ['DoY', 'A675', 'A440', 'Alpha']

In [None]:
ngdf = gdf[old_cols]
ngdf.columns = new_cols

In [None]:
ngdf

**Basic Operations**

In [None]:
ngdf['A550'] = ngdf['A675']*(675.0/550.0)**ngdf['Alpha']
ngdf

# <font color='red'>CuML</font>

- The mathematical operations underlying many machine learning algorithms are often matrix multiplications. 
- These types of operations are highly parallelizable and can be greatly accelerated using a GPU. 
- cuML makes it easy to build machine learning models in an accelerated fashion while still using an interface nearly identical to Scikit-Learn. 

In [None]:
%matplotlib inline
import numpy as np
import matplotlib
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

In [None]:
import cuml
from cuml import make_regression, train_test_split
from cuml.linear_model import LinearRegression as LinearRegression_GPU
from cuml.metrics.regression import r2_score
from sklearn.linear_model import LinearRegression as skLinearRegression

### First Example

#### Using Scikit-Learn

In [None]:
# create the relationship: y = 2.0 * x + 1.0
n_rows = 40000  # let's use 100 thousand data points
w = 2.0
x = np.random.normal(loc=0, scale=1, size=(n_rows,))
b = 1.0
y = w * x + b

# add a bit of noise
noise = np.random.normal(loc=0, scale=2, size=(n_rows,))
y_noisy = y + noise

In [None]:
plt.scatter(x, y_noisy, label='empirical data points')
plt.plot(x, y, color='black', label='true relationship')
plt.legend()

Instantiate and fit model:

In [None]:
linear_regression = skLinearRegression()

In [None]:
%%time

linear_regression.fit(np.expand_dims(x, 1), y)

Create new data and perform inference:

In [None]:
inputs = np.linspace(start=-5, stop=5, num=1000)
outputs = linear_regression.predict(np.expand_dims(inputs, 1))

In [None]:
plt.scatter(x, y_noisy, label='empirical data points')
plt.plot(x, y, color='black', label='true relationship')
plt.plot(inputs, outputs, color='red', 
         label='predicted relationship (cpu)')
plt.legend()

#### Using CuML 

Create a cuDF DataFrame:

In [None]:
df = cudf.DataFrame({'x': x, 'y': y_noisy})
print(df.head())

Load the GPU accelerated LinearRegression class from cuML, instantiate it, and fit it to our data:

In [None]:
linear_regression_gpu = LinearRegression_GPU()

In [None]:
%%time

linear_regression_gpu.fit(df[['x']], df['y'])

Create new data and perform inference:

In [None]:
new_data_df = cudf.DataFrame({'inputs': inputs})
outputs_gpu = linear_regression_gpu.predict(new_data_df[['inputs']])

In [None]:
plt.scatter(x, y_noisy, label='empirical data points')
plt.plot(x, y, color='black', label='true relationship')
plt.plot(inputs, outputs, color='red', 
         label='predicted relationship (cpu)')
plt.plot(inputs, outputs_gpu.to_array(), color='green', 
         label='predicted relationship (gpu)')
plt.legend()

### Second Example

Define parameters:
+ If you are running on a GPU with less than 16GB RAM, please change to 2**19 or you could run out of memory

In [None]:
n_samples = 2**20
n_features = 399
n_info = 70
random_state = 23

Generate data:
+ Use the `make_regression` function (generating) a random regression problem) to create the dataset:
   + `n_samples`: The number of samples (default=100)
   + `n_features`: The number of features (default=2).
   + `n_info`: The number of informative features, i.e., the number of features used to build the linear model used to generate the output (default=2).
   + `random_state`: Seed for the random number generator for dataset creation.
+ The function `make_regression` returns:
   + The input samples: data array of shape `[n_samples, n_features]`
   + The output values: data array of shape `[n_samples, 1]`
+ Split the dat into training set and test set

In [None]:
%%time
X, y = make_regression(n_samples=n_samples, 
                       n_features=n_features, 
                       n_informative = n_info,
                       random_state=random_state)

X = cudf.DataFrame(X)
y = cudf.DataFrame(y)[0]

X_cudf, X_cudf_test, y_cudf, y_cudf_test = train_test_split(X, y, 
                                                            test_size = 0.2, 
                                                            random_state=random_state)

Copy dataset from GPU memory to host memory. This is done to later compare CPU and GPU results.

In [None]:
X_train = X_cudf.to_pandas()
X_test = X_cudf_test.to_pandas()
y_train = y_cudf.to_pandas()
y_test = y_cudf_test.to_pandas()

#### Using Scikit-Learn

In [None]:
%%time
ols_sk = skLinearRegression(fit_intercept=True,
                            normalize=True,
                            n_jobs=-1)

ols_sk.fit(X_train, y_train)

In [None]:
%%time
predict_sk = ols_sk.predict(X_test)

In [None]:
%%time
r2_score_sk = r2_score(y_cudf_test, predict_sk)

#### Using CuML

In [None]:
%%time
ols_cuml = LinearRegression_GPU(fit_intercept=True,
                                normalize=True,
                                algorithm='eig')

ols_cuml.fit(X_cudf, y_cudf)

In [None]:
%%time
predict_cuml = ols_cuml.predict(X_cudf_test)

In [None]:
%%time
r2_score_cuml = r2_score(y_cudf_test, predict_cuml)

In [None]:
print(f"R^2 score (SKL):  {r2_score_sk}")
print(f"R^2 score (cuML): {r2_score_cuml}")