# Computing basic Stats with the GPU

Instead of computing the basic stats on CPU, we COULD leverage GPU and CUDA processing...

There are a few framework in the tech stack to push for GPU-based processing including the **Rapids.ai** collection of tools developed partly by Nvidia and an Open Source project, and dedicated to data analytics and classic, statistics-driven ML. It's pairable with Dask and XGBoost for distributed ML. A few components of interest, [full list here](https://docs.rapids.ai/api):

* `cuDF`, a library for DataFrame manipulation built on Arrow
* `cuML`, a library with ML algs and ML primitives
* `cuGraph` to interface with the famous fully Pythonic NetworkX library

**...BUT,** until the hardware converges to the unified memory, this initiative has very limited use-cases in the overall scientific ML algorithmic.

We'll explore those limits in this notebook with the [CuPy framework](https://cupy.dev/).

## Push to the GPU

CuPy is an open-source array library for GPU-accelerated computing with Python. 

CuPy utilizes CUDA Toolkit libraries including cuBLAS, cuRAND, cuSOLVER, cuSPARSE, cuFFT, cuDNN and NCCL to make full use of the GPU architecture.

In [1]:
from kosmoss import CONFIG, PROCESSED_DATA_PATH
from kosmoss.utils import timing

In [2]:
import cupy as cp
import dask.array as da
import numpy as np
import os.path as osp

step = CONFIG['timestep']
num_workers = CONFIG['num_workers']
features_path = osp.join(PROCESSED_DATA_PATH, f'features-{step}')

ModuleNotFoundError: No module named 'cupy'

Let's first load the data from NumPy files using Dask lazy loading. For the sake of the demonstration, let's load only `x` for now...

In [None]:
x = da.from_npy_stack(osp.join(features_path, 'x'))

In [None]:
@timing
def load_and_push_gpu(x: da.Array) -> None:
    
    with cp.cuda.Device(0):
        
        # Loading the data into CPU memory
        x_ = cp.array(x.compute(num_workers=num_workers))
        
        # Pushing the data into GPU memory
        x_mean_gpu = cp.mean(x_, axis=0)
        
        # Retrieving the data from GPU to CPU
        x_mean_cpu_back = cp.asnumpy(x_mean_gpu)

You can CPU and GPU memory grow and compute surface being utilized by running `htop` and `watch -n 1 nvitop`.

In [None]:
load_and_push_gpu(x)

## Reminder on CPU

In [None]:
@timing
def multi_cpu_stats_compute_reminder(x: da.Array) -> None:
    x_mean_multi_cpu = da.mean(x, axis=0).compute(num_workers=num_workers)

In [None]:
multi_cpu_stats_compute_reminder(x)

No Comment.

Actually yes. Three. 

* It starts to become interesting when the original dataset size grows. But since the GPU is really memory bound, it rapidly becomes a bottleneck. Waiting for a definitive unified memory architecture.
* If you want to overcome the memory bottleneck, you have to take of all the boiler plate, which is why you turn to framework in the first place is to avoid this.
* Using GPUDirect could improve overall performance.