****************************************************************

In [1]:
# Importing the root of this bootcamp
import os.path as osp
import sys

sys.path.append(osp.abspath('..'))

# Computing basic Stats with the GPU

There are many framework in the tech stack to push for gpu-based processing including the *Rapids.ai* collection of tools developed partly by **Nvidia** and an Open Source project, but until the hardware converges to the unified memory, this initiative has very limited use-cases in the overall scientific ML algorithmic.

We'll explore those limits in this notebook.

Let's first load the data from NumPy files using Dask lazy loading. For the sake of the demonstration, let's load only `x` for now...

In [2]:
import cupy as cp
import dask.array as da
import numpy as np
import os.path as osp

import config
import utils 

step = config.config['timestep']
num_workers = config.config['num_workers']

feats_path = osp.join(config.processed_data_path, f'features-{step}')

x = da.from_npy_stack(osp.join(feats_path, 'x'))

In [8]:
@utils.timing
def load_and_push_gpu(x: da.Array) -> None:
    
    with cp.cuda.Device(0):
        
        # Loading the data into CPU memory
        x_ = cp.array(x.compute(num_workers=num_workers))
        
        # Pushing the data into GPU memory
        x_mean_gpu = cp.mean(x_, axis=0)
        
        # Retrieving the data from GPU to CPU
        x_mean_cpu_back = cp.asnumpy(x_mean_gpu)

In [9]:
load_and_push_gpu(x)

97405.09 ms


In [25]:
import torch, gc
gc.collect()
torch.cuda.empty_cache()

In [10]:
@utils.timing
def multi_cpu_stats_compute_reminder(x: da.Array) -> None:
    x_mean_multi_cpu = da.mean(x, axis=0).compute(num_workers=num_workers)

In [11]:
multi_cpu_stats_compute_reminder(x)

1946.51 ms


No Comment.

Actually yes. Three. 

* It starts to become interesting when the original dataset size grows. But since the GPU is really memory bound, it rapidly becomes a bottleneck. Waiting for a definitive unified memory architecture.
* If you want to overcome the memory bottleneck, you have to take of all the boiler plate, which is why you turn to framework in the first place is to avoid this.
* Using GPUDirect could improve overall performance.