## References

- https://rapids.ai/start.html
- https://github.com/rapidsai/cudf
- https://hub.docker.com/r/rapidsai/rapidsai/
- https://docs.rapids.ai/api/cudf/stable/10min.html
- https://github.com/beckernick/nersc-rapids-workshop
- https://towardsdatascience.com/heres-how-you-can-speedup-pandas-with-cudf-and-gpus-9ddc1716d5f2

### Videos

- [cuDF: RAPIDS GPU-Accelerated Dataframe Library" - Mark Harris (PyCon AU 2019)](https://www.youtube.com/watch?reload=9&v=lV7rtDW94do)
- [Introduction to cuDF - NERSC NVIDIA RAPIDS Workshop on April 14, 2020](https://www.youtube.com/watch?v=pXnEniQRAdQ)

Notes:

- It requires NVidia GPU

Prerequisites

    - NVIDIA Pascal™ GPU architecture or better
    - CUDA 10.1/10.2/11.0 with a compatible NVIDIA driver
    - Ubuntu 16.04/18.04 or CentOS 7
    - Docker CE v18+
    - nvidia-docker v2+
    
Installation

`
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cudf
`

## Misc

check computer GPU
`
sudo lshw -numeric -C display
`

`
sudo lspci -v | less
`

Source:
https://www.howtogeek.com/508993/how-to-check-which-gpu-is-installed-on-linux/

Check Cuda Version
https://varhowto.com/check-cuda-version/
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

Cuda Guide:
https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal

Troubleshooting:
https://stackoverflow.com/a/64593288/2670476

## What is cuDF?

cuDF is a Python GPU DataFrame library (built on the Apache Arrow columnar memory format) for loading, joining, aggregating, filtering, and otherwise manipulating tabular data using a DataFrame style API.

In [1]:
input_file = 'data/data_1gb.csv'

In [2]:
import pandas as pd
import numpy as np
import cudf

pandas_df = pd.DataFrame({'a': np.random.randint(0, 100000000, size=100000000),
                          'b': np.random.randint(0, 100000000, size=100000000)})
                          
cudf_df = cudf.DataFrame.from_pandas(pandas_df)

In [3]:
pandas_df.head()

Unnamed: 0,a,b
0,44735207,10848825
1,49546445,31889142
2,49814898,17431315
3,50733617,37062480
4,93751776,87856412


In [4]:
cudf_df.head()

Unnamed: 0,a,b
0,44735207,10848825
1,49546445,31889142
2,49814898,17431315
3,50733617,37062480
4,93751776,87856412


In [5]:
%%time
pandas_df.a.mean()

CPU times: user 74.8 ms, sys: 0 ns, total: 74.8 ms
Wall time: 73.9 ms


50001905.82887279

In [6]:
%%time
cudf_df.a.mean()

CPU times: user 10.4 ms, sys: 353 µs, total: 10.7 ms
Wall time: 10 ms


50001905.82887279

In [7]:
%%time
pandas_df.merge(pandas_df, on='b')

CPU times: user 41.7 s, sys: 3.52 s, total: 45.2 s
Wall time: 45.2 s


Unnamed: 0,a_x,b,a_y
0,44735207,10848825,44735207
1,49546445,31889142,49546445
2,49546445,31889142,80781139
3,80781139,31889142,49546445
4,80781139,31889142,80781139
...,...,...,...
200003245,2610370,96587349,2610370
200003246,8941536,88230194,8941536
200003247,10655846,11617539,10655846
200003248,94339097,52507869,94339097


In [8]:
%%time
cudf_df.merge(cudf_df, on='b')

MemoryError: std::bad_alloc: CUDA error at: /home/developer/miniconda3/envs/rapids-0.17/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory

In [9]:
import cudf
gdf = cudf.read_csv(input_file)
for column in gdf.columns:
    print(gdf[column].mean())

# gdf

MemoryError: std::bad_alloc: CUDA error at: /home/developer/miniconda3/envs/rapids-0.17/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory

In [10]:
import cudf, io, requests
from io import StringIO

url = "https://github.com/plotly/datasets/raw/master/tips.csv"
content = requests.get(url).content.decode('utf-8')

tips_df = cudf.read_csv(StringIO(content))
tips_df['tip_percentage'] = tips_df['tip'] / tips_df['total_bill'] * 100

# display average tip by dining party size
print(tips_df.groupby('size').tip_percentage.mean())

size
1    21.729202
2    16.571919
3    15.215685
4    14.594901
5    14.149549
6    15.622920
Name: tip_percentage, dtype: float64


## Dask-CUDA

https://rapids.ai/start.html#get-rapids

Updated using:

`conda create -n rapids-0.17 -c rapidsai -c nvidia -c conda-forge \
    -c defaults rapids-blazing=0.17 python=3.7 cudatoolkit=11.0
`

`
conda activate rapids-0.17
`

In [11]:
# source: https://github.com/rapidsai/cudf/issues/2288

import os
import gc
import timeit
import cudf as cu
import dask_cudf as dkcu
# x = cu.read_csv("data/data_1gb.csv", flush=True)
x = dkcu.read_csv("data/data_1gb.csv", flush=True)
x

Unnamed: 0_level_0,mno_ms_id,pos_time,mno_cell_id
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
,int64,object,int64
,...,...,...
,...,...,...
,...,...,...
,...,...,...


In [12]:
import time

from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster

cluster = LocalCUDACluster()
client = Client(cluster)
client

0,1
Client  Scheduler: tcp://127.0.0.1:42119  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 1  Cores: 1  Memory: 67.26 GB
