# High-performance and parallel computing for AI - Practical 8: NVIDIA-SMI, environment variables, and CuPy programming

IMPORTANT
=========

* CuPy behaves weirdly for me. Restart the kernel if you encounter weird errors.
* For these practicals we will be using a different `conda environment`. When opening a notebook or a terminal make sure you are using the **CuPy Kernel**!!!
* It's fine if you do not finish everything.

## Question 1 - CUDA driver and available GPUs

Open the launcher, open a terminal, and enter the bash command `nvidia-smi`. This command gives you information about the installed CUDA driver version, the GPUs available on the system (our server, Goliat), and their current utilization.

The three NVIDIA L40S (48 GB RAM) are the most powerful GPUs available. Can you find the following information online?

* Number of CUDA cores.
* Number of Tensor cores.
* Number of SMs.
* Size of L1 SM-shared cache.
* Size of L2 GPU-shared cache.

Use the above to roughly compute the number of CUDA cores and tensor cores per SM.

Then, run the command `nvidia-smi -q`. It will output lots of information about the GPUs. For the next question we will need the GPU UUID, which is needed to select the GPUs we want to use. You can obtain these more easily via `grep`: Run the command `nvidia-smi -q | grep UUID` and check that the first two correspond to the Quadro GPU and the latter three to the L40s (you can use `nvidia-smi -q | less` or `nvidia-smi -q | tee info.txt` followed by reading what's inside `info.txt`).

## Answer to Question 1

**On each L40s:**
* Number of CUDA cores: 18,176
* Number of Tensor cores: 568
* Number of SMs: 142.
* Size of L1 SM-shared cache: 128 KB.
* Size of L2 GPU-shared cache: 48 MB.

CUDA cores per SM: 128.
Tensor cores per SM: 4.

The UUIDs are shown in the next question.

## Question 2 - Memory movement in CuPy - Part 2

Before you start (and before running any other GPU code on the servers) please run the following code, which limits the maximum GPU memory usage to $1.5$ GB and picks an L40s GPU and a Quadro GPU at random. **Please only run the code below once every time you restart the kernel!** 

While it won't hurt to understand what the whole code does, it is very useful to at least understand the GPU selection part and the setting of the `CUDA_VISIBLE_DEVICES` environment variable since this is useful for whatever GPU software, not just CuPy.

In [1]:
import os

# CuPy-specific environment variables
os.environ["CUPY_GPU_MEMORY_LIMIT"] = "1573741824" # roughly 1.5 GB
os.environ["CUPY_ACCELERATORS"]="cutensor" # activates cutensor acceleration
os.environ["CUPY_TF32"] = "1" # activates tf32 tensor cores

## On goliat we have FIVE GPUs so here we pick two of those at random
## so that we do not overload the system.
## The way we do it is by figuring out the GPU UUIDs and then setting
## The CUDA_VISIBLE_DEVICES environment variable.
## Note: this is useful for other libraries as well (e.g., Jax, PyTorch, TF) in multi-GPU servers.

# To get these UUIDs you need to run nvidia-smi -q on the command line
quadro_UUIDs = ["GPU-4efa947b-abbd-7c6e-84f5-61241d34bb4b",
                "GPU-5eb524b0-2b1b-fe98-e6ed-b8fb5185e993"]

L40s_UUIDs = ["GPU-7bba1f33-03d2-016b-d42e-ced83c3ac243",
              "GPU-179d068a-3bea-91d7-1a8c-7017f55d6298",
              "GPU-ae634859-dd49-de46-9182-195639405eaa"]

from numpy.random import randint
# Picks an L40s and a Quadro GPU at random. The others will be invisible to CuPy
# NOTE: this only works if the environment variable is set BEFORE CuPy is first imported.
os.environ["CUDA_VISIBLE_DEVICES"] = L40s_UUIDs[randint(3)] + "," + quadro_UUIDs[randint(2)]

## CuPy will only see these GPUs and will assign them these device numbers:
L40sID = 0
quadro_ID = 1

First of all, give a quick look to Q3 in Practical 1 to remember how to move data between host and device (i.e., CPU and GPU). You won't be needing this here, but it is good to remember.

Then, create two random $n$-by-$n$ single precision matrices on the L40s GPU, where $n=4096$. Then compute and time their matrix product. Finally move both matrices to one of the Quadro GPUs and perform and time the matmat again. Which one is faster?

Finally, change the dtype so that computations use double precision. Time everything again. How do the timings change? This is why working in reduced precision is important!

### Hints - Please read

**Hint 1:** Always run the above code to select the GPUs to use. If you get weird memory errors, then restart the kernel (the circular arrow above) and run the GPU selection code again. To check a cupy array current device, use `myarray.device` (it will return the GPU number). In Practical 1 we saw how we move an array from the CPU to the GPU by using `cp.asarray`. This function can also be used to move data to another GPU, e.g.,
```python
# Make sure to keep track of where arrays live in memory!!!
with cp.cuda.Device(0):
    x_gpu_0 = cp.ndarray([1, 2, 3])  # create an array in GPU 0

with cp.cuda.Device(1):
    x_gpu_1 = cp.asarray(x_gpu_0)  # move the array to GPU 1 
```

**Hint 2:** I recommend using `with cp.cuda.Device(deviceID):` to ensure computations are actually done with device number `deviceID` and not with another one. What I mean is to use:
```python
def matmul(aa,bb):
    with cp.cuda.Device(aa.device): # It becomes superslow if I do not do this.
        return aa@bb
```
I do not know why this `with` is needed, but it seems like it will otherwise trigger some memory movement to and from devices which is slow. I feel like this is caused by trying to use multiple devices so it may be that doing it this way is not a good way.

**Hint 3:** To time each operation, put it into a function and use `cupyx.profiler.benchmark`, e.g.,
```python
from cupyx.profiler import benchmark

def my_func(a):
    return cp.sqrt(cp.sum(a**2, axis=-1))

a = cp.random.random((256, 1024))
print(benchmark(my_func, (a,), n_repeat=128, devices=(a.device,)))
```
Note that benchmark will output the time spent by both CPU and GPU.

**Hint 4:** It is always veery good practice to use variable names which help reminding you where is the variable stored in memory (e.g., use `_h` for host, `_d` for device).

## Solution of question 2

In [6]:
import numpy as np
import cupy as cp
from cupyx.profiler import benchmark

n = 4096
dtype = cp.float32
a = cp.random.randn(n, n, dtype=dtype)
b = cp.random.randn(n, n, dtype=dtype)

L40sID = 0
quadro_ID = 1

assert a.device.id == L40sID

with cp.cuda.Device(quadro_ID):
    aq = cp.asarray(a)
    bq = cp.asarray(b)

assert aq.device.id == quadro_ID and bq.device.id == quadro_ID
 
def matmul(aa,bb):
    with cp.cuda.Device(aa.device): # It becomes superslow if I do not do this.
        return aa@bb

# Note: print statements will be executed by the CPU!
print("\nL40s:\n")
print(benchmark(matmul, (a,b), n_repeat=50, devices=(a.device,)))

print("\n\nQuadro:\n")
print(benchmark(matmul, (aq,bq), n_repeat=50, devices=(aq.device,)))

# NOTE: The Quadro GPU seems to be faster, which is very very odd.


L40s:

matmul              :    CPU:    23.669 us   +/-  3.909 (min:    20.950 / max:    44.153) us     GPU-<CUDA Device 0>:  1300.982 us   +/- 85.719 (min:  1202.176 / max:  1408.704) us


Quadro:

matmul              :    CPU:    32.123 us   +/-  4.273 (min:    27.813 / max:    50.015) us     GPU-<CUDA Device 1>: 10880.613 us   +/- 687.561 (min: 10320.896 / max: 12567.552) us


## Tutorial 1 - An overview of CuPy kernels

Here you will learn a bit about how to write your own kernels in CuPy. It is good to be aware of these although my personal opinion is that they are not that great. Note that:

* You can always use CuPy built-in numpy-like functions without having to write any kernels.
* CuPy kernel writing functionalities are many, but most are limited and poorly documented.
* I am not sure I would recommend using CuPy kernels unless you use the Raw CUDA kernels (see below) or you are trying to do something simple.

To learn more, see the [official CuPy docs](https://docs.cupy.dev/en/latest/user_guide/kernel.html#).

### Possible CuPy Kernel writing strategies with my personal comments

* **Elementwise and reduction kernels (poor documentation)**. These let you write and JIT your own kernels using a simplified syntax. However, the set of allowed operations and syntax is poorly documented. There is some mention of having to use CUDA C/C++ fragments. Very unclear so I would skip this.

* **Kernel fusion (limited, but easy)**. This option is easy to use, yet its functionality is limited. CuPy provides a decorator, `cupy.fuse` which you can use to turn a set of CuPy operations into a single kernel which will then be JIT-ed and should then run faster. For instance:
```python
import cupy as cp
@cp.fuse
def myfun(x, y):
    return cp.sin(x - y)*x + cp.exp(y)
```
However, important operations such as `matmul` are unsupported. Worth keeping in mind, but not great.

* **Raw Kernels (Powerful, but requires C/C++)**. This option lets you write your own CUDA C/C++ kernels which will then be JITed so that you can pass cupy arrays into it. This is an easy option for wrapping CUDA code into Python and it is powerful since you can wrap whatever you want. However, it requires knowing how to write CUDA C/C++ code.

* **JIT Rawkernels (Perhaps a good compromise)**. This option is similar to the way numba works. CuPy provides a JIT decorator - `cupyx.jit.rawkernel` - which can be used like `numba.jit` to turn a Python function into a CUDA kernel which is then JIT-ed. The good thing is that doing so allows the user to write CUDA-like code without ever leaving Python and that CuPy maths functions can be used. The problem is that documentation is poor, array operations are unsupported, and the functionality is marked as experimental.

Example of `cupyx.jit.rawkernel`:

In [3]:
import cupy as cp
from cupyx import jit

@jit.rawkernel()
def myfun(x, y, size): # can pass arrays and scalar variables
    tid = jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x # thread ID as in CUDA
    ntid = jit.gridDim.x * jit.blockDim.x # total number of threads in the grid/stride
    for i in range(tid, size, ntid):
        y[i] = cp.sin(x[i])**2 # y gets overwritten

gridDim = (128,)
blockSize = (1024,)

n = gridDim[0]*blockSize[0]*2**5
x = cp.random.normal(size=(n,), dtype=cp.float32)
y = cp.zeros((n,), dtype=cp.float32)

# Kernel invocation as CuPy RawKernels
myfun(gridDim, blockSize, (x, y, n))  
assert (cp.sin(x)**2 == y).all()

# Kernel invocation as in Numba-CUDA
myfun[gridDim[0], blockSize[0]](x, y, n)
assert (cp.sin(x)**2 == y).all()

  cupy._util.experimental('cupyx.jit.rawkernel')


## Question 3 - Custom CuPy kernels

Write two CuPy kernels (one via `cupy.fuse` and the other via JIT-rawkernel) that compute the entrywise cosine of a matrix entrywise product between two square single-precision matrices of size $n=4096$ (intialise the matrix entries at random). For the JIT-rawkernel version, use entrywise for loops and 2D grid and block sizes with `gridDim = (64, 64)` and `blockSize = (32, 32)`. Time both functions using `cupyx.profiler.benchmark`.

Then, modify the JIT-rawkernel version:
* Try exchanging the order of the for loop. One of the two versions will be faster. Why?
* Try playing with the gridDim and (reducing the) blockSize and see how the timings change.

## Solution to question 3

In [4]:
import cupy as cp
from cupyx import jit
from cupyx.profiler import benchmark

@cp.fuse
def myfun_fused(a,b):
    return cp.cos(a*b)

gridDim = (128, 128)
blockSize = (32, 32)

#n = gridDim[0]*blockSize[0]*2
n = 4096 

a = cp.random.randn(n, n, dtype=cp.float32)
b = cp.random.randn(n, n, dtype=cp.float32)
c = cp.empty((n, n), dtype=cp.float32)

@jit.rawkernel()
def myfun_rk(a, b, c):
    tidx = jit.blockIdx.x * jit.blockDim.x + jit.threadIdx.x 
    tidy = jit.blockIdx.y * jit.blockDim.y + jit.threadIdx.y
    ntidx = jit.gridDim.x * jit.blockDim.x
    ntidy = jit.gridDim.y * jit.blockDim.y
    for i in range(tidy, n, ntidy):
        for j in range(tidx, n, ntidx):
            c[i,j] = cp.cos(a[i,j]*b[i,j])

cf = myfun_fused(a,b)
myfun_rk(gridDim, blockSize, (a,b,c))

assert (cp.cos(a*b) == c).all()
assert (cp.cos(a*b) == cf).all()

print(benchmark(myfun_fused, (a,b), n_repeat=20, devices=(a.device,)))
print(benchmark(myfun_rk, (gridDim, blockSize, (a,b,c)), n_repeat=20, devices=(a.device,)))

myfun_fused         :    CPU:    14.584 us   +/-  1.851 (min:    12.904 / max:    20.108) us     GPU-<CUDA Device 0>:   311.402 us   +/-  4.609 (min:   304.128 / max:   323.328) us
myfun_rk            :    CPU:    60.099 us   +/-  3.332 (min:    56.217 / max:    67.397) us     GPU-<CUDA Device 0>:   346.930 us   +/-  4.285 (min:   340.992 / max:   357.120) us


## Question 4 - Reductions

Implementing reductions in CUDA is non-trivial. We won't cover it in detail in these lectures, but luckily we can use CuPy built-in reduction kernels to do the work for us.

Write a kernel which does the same computation as in the previous exercise, but this time it returns the sum all the entries of the output. Again you have multiple ways of doing it. Do it in four ways:

* Using a lambda function and standard CuPy operations (i.e., `cp.cos(a*b).sum()`).
* Using `cupy.fuse`.
* Applying `cp.sum` to what the JIT-Rawkernel computes.
* (Optional) Using a CuPy [ElementwiseKernel](https://docs.cupy.dev/en/latest/user_guide/kernel.html#basics-of-elementwise-kernels) followed by a CuPy [ReductionKernel](https://docs.cupy.dev/en/latest/user_guide/kernel.html#reduction-kernels).

Time all of them. Which one is faster? For me surprisingly the fuse and the Elementwise/Reduction kernel options are extremely slow. I do not know why, but this shows that these options are not great in general even though they are an important part of CuPy key documentation.

## Solution to question 4

In [5]:
mysum_cp = lambda a,b : cp.cos(a*b).sum()

@cp.fuse
def mysum_fuse(a,b):
    return cp.sum(cp.cos(a*b))

def mysum_rk(a,b):
    c = cp.empty((n, n), dtype=cp.float32)
    myfun_rk(gridDim, blockSize, (a,b,c))
    return cp.sum(c)

myfun_ewk = cp.ElementwiseKernel(
   'float32 x, float32 y',
   'float32 z',
   'z = cos(x*y)',
   'myfun_ewk')

onlysum = cp.ReductionKernel(
    'float32 x',  # input params
    'float32 y',  # output params
    'x',  # map
    'a + b',  # reduce
    'y = a',  # post-reduction map
    '0',  # identity value
    'onlysum'  # kernel name
)

mysum_ewk = lambda a,b : onlysum(myfun_ewk(a,b))

print(benchmark(mysum_cp, (a,b), n_repeat=100, devices=(a.device,)))
print(benchmark(mysum_fuse, (a,b), n_repeat=100, devices=(a.device,)))
print(benchmark(mysum_rk, (a,b), n_repeat=100, devices=(a.device,)))
print(benchmark(mysum_ewk, (a,b), n_repeat=100, devices=(a.device,)))

<lambda>            :    CPU:    71.319 us   +/-  5.542 (min:    67.979 / max:   117.493) us     GPU-<CUDA Device 0>:   510.621 us   +/-  5.207 (min:   500.704 / max:   548.864) us
mysum_fuse          :    CPU:    25.876 us   +/-  2.829 (min:    23.545 / max:    43.252) us     GPU-<CUDA Device 0>: 10829.380 us   +/-  3.613 (min: 10821.632 / max: 10844.160) us
mysum_rk            :    CPU:   110.492 us   +/-  5.721 (min:   104.377 / max:   128.203) us     GPU-<CUDA Device 0>:   405.068 us   +/-  5.886 (min:   395.264 / max:   423.040) us
<lambda>            :    CPU:    40.066 us   +/- 23.753 (min:    34.756 / max:   274.409) us     GPU-<CUDA Device 0>:  6840.793 us   +/- 83.616 (min:  6689.792 / max:  7085.056) us
