# Showing basic GPU cupy usage
- checking local system GPU
- needed packages
- using cupy
- local memory and GPU memory
- GPU calculation


# PSL GPU usage 
check out the [wiki created by Chris](https://userdocs.psd.esrl.noaa.gov/Machine%20Learning#linux-gpu1) (VPN needed)

## Checking the NVIDIA GPU in the local system 

In [18]:
!nvidia-smi

Tue Mar 21 16:06:41 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA RTX A6000                Off| 00000000:81:00.0 Off |                  Off |
| 30%   19C    P8                4W / 300W|  31014MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Packages used
- cupy (https://cupy.dev)
    cupyx : a scipy mirroring package for GPU
    cudatoolkit : a cuda toolkit for python/cupy use
- numpy (https://numpy.org)

---

>Comparison table between cupy and numpu
https://docs.cupy.dev/en/stable/reference/comparison.html

>Writing cuda kernel from cupy
https://docs.cupy.dev/en/stable/user_guide/kernel.html

>Writing cuda kernel from numba
https://numba.readthedocs.io/en/stable/cuda/kernels.html
(CUDA grid concept https://towardsdatascience.com/cuda-by-numba-examples-1-4-e0d06651612f)



# Basic cupy and numpy comparison

In [1]:
import cupy as cp
import numpy as np

In [3]:
x_gpu = cp.array([1, 2, 3])
x_cpu = np.array([1, 2, 3])

In [4]:
type(x_gpu)

cupy.ndarray

In [5]:
type(x_cpu)

numpy.ndarray

In [6]:
x_gpu.device

<CUDA Device 0>

In [7]:
x_cpu.device

AttributeError: 'numpy.ndarray' object has no attribute 'device'

## Data transfer

### From CPU to GPU (from host to device)

In [8]:
x_cpu_from_gpu = cp.asnumpy(x_gpu)  # move the data to the current device.

In [9]:
type(x_cpu_from_gpu)

numpy.ndarray

In [10]:
x_cpu_from_gpu.device

AttributeError: 'numpy.ndarray' object has no attribute 'device'

### From GPU to CPU (from device to host)

In [11]:
x_cpu_from_gpu = cp.asnumpy(x_gpu)  # move the data to the current device.

In [12]:
type(x_cpu_from_gpu)

numpy.ndarray

In [13]:
x_cpu_from_gpu.device

AttributeError: 'numpy.ndarray' object has no attribute 'device'

## GPU calculation using CuPy

In [14]:
%time x_gpu = cp.arange(0,1e9)

CPU times: user 1.56 ms, sys: 17.4 ms, total: 19 ms
Wall time: 17.5 ms


In [15]:
%time x_cpu = np.arange(0,1e9)

CPU times: user 478 ms, sys: 549 ms, total: 1.03 s
Wall time: 1.03 s


In [16]:
%time cp.sum(x_gpu)

CPU times: user 1.67 ms, sys: 16.7 ms, total: 18.4 ms
Wall time: 17.4 ms


array(5.e+17)

In [17]:
%time np.sum(x_cpu)

CPU times: user 303 ms, sys: 703 µs, total: 304 ms
Wall time: 303 ms


4.999999995e+17

### Multiple run testing

In [10]:
%timeit -r 5 -n 100 x_gpu = cp.arange(0,1e9)

19 µs ± 8.35 µs per loop (mean ± std. dev. of 5 runs, 100 loops each)


In [11]:
%timeit -r 5 -n 100 x_cpu = np.arange(0,1e9)

1.06 s ± 3.36 ms per loop (mean ± std. dev. of 5 runs, 100 loops each)


In [12]:
x_gpu = cp.arange(0,1e9)
%timeit -r 5 -n 100 cp.sum(x_gpu)

14.2 µs ± 923 ns per loop (mean ± std. dev. of 5 runs, 100 loops each)


In [13]:
x_cpu = np.arange(0,1e9)
%timeit -r 5 -n 100 np.sum(x_cpu)

307 ms ± 4.58 ms per loop (mean ± std. dev. of 5 runs, 100 loops each)
