# NERSC RAPIDS Workshop 2020 
## RAPIDS: Open GPU Data Science

This tutorial will cover the basics of Python data science on a GPU. It starts with an overview of GPU DataFrames. Next, it provides an introduction to GPU-based machine learning capabilites. It concludes with an introduction to how Dask scales workflows across multiple GPUs.


```
Nick Becker, NVIDIA
Randy Gelhausen, NVIDIA
Dante Gama Dessavre, NVIDIA
```

<a id="introduction"></a>
## Introduction to RAPIDS
#### By Paul Hendricks
-------

While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For the same reasons, scientific computing and deep learning has turned to NVIDIA GPU acceleration, data analytics and machine learning where GPU acceleration is ideal. 

NVIDIA created RAPIDS – an open-source data analytics and machine learning acceleration platform that leverages GPUs to accelerate computations. RAPIDS is based on Python, has Pandas-like and Scikit-Learn-like interfaces, is built on Apache Arrow in-memory data format, and can scale from 1 to multi-GPU to multi-nodes. RAPIDS integrates easily into the world’s most popular data science Python-based workflows. RAPIDS accelerates data science end-to-end – from data prep, to machine learning, to deep learning. And through Arrow, Spark users can easily move data into the RAPIDS platform for acceleration.

In these notebooks, we will discuss and show at a high level several packages in the RAPIDS ecosystem are as well as what they do. Subsequent notebooks will dive deeper into the various areas of data science and machine learning and show how you can use RAPIDS to accelerate your workflow in each of these areas.

<a id="setup"></a>
## Setup

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [1]:
!nvidia-smi

Tue Apr  7 19:05:12 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   32C    P0    33W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   33C    P0    34W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   

You should see output that starts with the date and followed by something like:

```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   32C    P0    33W / 300W |      0MiB / 16280MiB |      0%      Default |
```

If you have any issues seeing the output of the cell above please confirm you have followed the instructions correctly. If you still have issues after a second attempt, please email the NERSC team for help.

## Tutorial Guide

This tutorial is designed to be run in a specific order. The notebooks are broken up into three segments, each in its own directory. First, proceed to the `cudf/` directory and work through both notebooks. Note that the filenames are prepended with `01-` or `02-`. These indicate the order in which they should be run (first `01`, then `02`).

After completing the `cuDF` notebooks, please proceed to the `cuml/` directory and work through these three sets of exercises.

Finally, proceed to the `dask/` directory, and work through the notebook.

After completing these six notebooks, you should have a solid grasp of the RAPIDS suite of open source GPU data science and analytics libraries.