# PyTorch handling of GPU devices

Generally, PyTorch is pretty transparent about how it handles devices. When using some high-level wrappers (like [PyTorch Lightning](https://pytorch-lightning.readthedocs.io/)) you do not need to do much work to transfer a deep learning model (and train it) to a GPU.

However, it's beneficial to have some basic understanding of how GPUs work, know some basic terminology and understand how to move things around manually. As a bonus, this notebook also brings some very basic GPGPU understanding.

# Parallel linear algebra

**G**raphical **P**rocessing **U**nit is a highly parallel type of computing accelerator. It can handle thousands of parallel computation in **data-parallel** manner. That being said, we need to clarify why is it important to have such a capability.

Deep learning models intrinsically perform linear algebra operations of various sorts.

For example, **feed-forward neural nets** (the one you had in HW3 was exactly that) contain matrix-vector multiplications. **Convolutional neural networks** contain a lot of **convolutions**.

Hence, in all practical occasions, we need to deal with vectors, matrices, and more-than-2-dimensional objects (generally called *arrays* in NumPy and *tensors* in PyTorch). The main property of such operations is that they are usually quite easy to parallelize (at least it is easy *conceptually*, although actual implementation may be quite complex).

Consider, for example, matrix-vector product:

$$u_i = A_{ij}v_j.$$

Or, rewriting it in a more explicit form:

$$u_i = \sum_j A_{ij}v_j = A_{i,0}v_0 + A_{i,1}v_1 + \cdots + A_{i,N-1}v_{N-1}.$$

Each component of $\vec u$ can be calculated independently of all the others. You just need to take **0-th row of $A$** (which is a vector on its own), calculate its **element-wise product** with $\vec v$, **reduce** the result with $+$ and here's the **0-th component** of $\vec u$. There's no reason why you cannot calculate 1-st (or 2-d, or etc.) component at the same time.

Ideally, given $N\times M$ matrix $A$, and $M\times 1$ vector $\vec v$, parallel computation should bring a speed-up of $N$, since all $N$ components of $\vec u$ are calculated in parallel. It is not that simple in reality due to many technical reasons, but for simple calculations it's almost true. In general, the actual speed-up depends on an operation.

# CPU vs. GPU

CPU will perform such operations with a very little (at least compared to GPU) level of parallelism. Yes, CPUs have multiple cores and vector extensions (like SSE or AVX), which allow to make this process somewhat parallel and utilize hardware to its full capacity.

But the main thing is that CPUs were designed to do various things - basically, everything imaginable. GPUs, in contrast, were designed to do one thing (compute) and do that thing very well and with a tremendous amount of parallelism (since graphics computations are intrinsically parallel, thank you, gamers!).

# CUDA

It was 2007 when NVIDIA introduced **CUDA** (**C**ompute **U**nified **D**evice **A**rchitecture) - first usable framework for **g**eneral-**p**urpose computing on **g**raphics **p**rocessing **u**nits (GPGPU). Before then, researchers used computing power of GPUs through smart trickeries over OpenGL shading language. The procedure was error-prone and extremely inconvenient. CUDA simplified all of that dramatically.

At the same time, CUDA was exactly what was needed for deep learning and led to outstanding progress in the field. Initially, you had to do everything in C (with some CUDA-specific extensions), but very quickly wrappers for Python and other programming languages came to market. Now you can use Python (through general-purpose Numba, or deep learning specific PyTorch and Tensorflow), Julia (through `CUDAnative.jl`, `CuArrays.jl` and other packages) and other languages.

It's important to understand that CUDA works only with NVIDIA GPUs (de-facto standard for GPGPU in general and deep learning in particular). Neither PyTorch, nor Tensorflow can handle other GPUs as well (there are efforts in the direction of using other GPU devices, the ones from AMD or Intel GPU cores, but it's very limited). This happened because NVIDIA treated CUDA as a first-class citizen and not as a byproduct of gaming and graphical applications.


# GPUs and CUDA programming model

*The following is a basic introduction to GPGPU with CUDA. If you're only interested in reading about PyTorch on GPUs, you can skip to the next section.*

CUDA programming model is very simple and straightforward. GPU chips:

- are comprised of **multiprocessors** (MP),
- have their own **global memory**,
- each multiprocessor, in turn, is comprised of a number of computing **cores** (cores are different and exact composition depends on the device),
- and has its own **shared memory**, available to all computing cores.

Those are main resources to consider when launching a computing **kernel**, i.e. a function, which will run on a GPU.

CUDA programming model reflects hardware architecture almost one-to-one:
- kernels are launched on a grid of **blocks**,
- each block contains a grid of **threads**,
- a block will run on one multiprocessor,
- a thread will run on one core,
- all threads have access to global memory,
- all threads *inside a block* have access to multiprocessor's shared memory.

Hence, it's a direct hardware to software correspondence: **multiprocessor** $\rightarrow$ **block**, **core** $\rightarrow$ **thread**.

But what does it mean to launch a *grid* of something? Let's consider an example. Imagine, that you want to add two vectors $\vec a$ and $\vec b$ of shape $N$ on a GPU (too simple thing to do on a GPU, but simple enough to explain the idea).

Our basic execution unit is a thread. A thread will take some component, say $i$, from each vector and add them together. Since our problem is intrinsically 1-dimensional, we will launch a $1-D$ grid of blocks, each containing a $1-D$ grid of threads. How many blocks should we have? GPUs have some limitations. The first to consider is **maximum number of threads per block**.

Below I use Julia for simplicity (syntax is self-evident and Julia REPL is the fastest way to do this, although we will use Numba and PyTorch later on, as this is the Python class, not Julia) and run this on my machine:

```julia
julia> using CUDA
julia> dev = CUDA.CuDevice(0)
CuDevice(0): GeForce GTX 1650

julia> CUDA.attribute(dev, CUDA.CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT)
16

julia> CUDA.attribute(dev, CUDA.CU_DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)
1024
```

Ok, we may try to push as many as 1024 threads per block. How many blocks do we need? `ceil(N/1024)`, of course. Imagine, for example, that $N=5466$: we'll need 6 blocks, with 1024 threads each. It seems that we'll launch more threads than needed. It's true. But CUDA kernels operate in **data-parallel** fashion. We *do not have any loops*, it's that each thread knows where it is and takes the corresponding element from each array to add together. So, we will just add an `if` to check if thread is in bounds, .

How is that possible? CUDA provides a way for each thread to calculate it's position in a grid. Four variables are available to each thread:

- `blockIdx` (a C structure, having `x`, `y` and `z` field), which tells the thread to which block it belongs,
- `threadIdx` (having a similar structure), which tells a thread where it is inside a block,
- `gridDim` (you got, it right?), which tells the dimensions of the grid of blocks,
- `blockDim` (...), which tells the dimensions of the grid of threads.

Let's go from bottom to top: `blockDim` is `(1024, 1, 1)`, since it's $1-D$ and we want a max number of threads per block. `gridDim` is `(6, 1, 1)`, since we need 6 blocks on a `1-D` grid. Then, each thread can calculate it's absolute position as the following:

```
thread_idx = blockIdx.x * blockDim.x + threadIdx.x
```

which is guaranteed by definition to be *unique*. So, each thread should just take component `thread_idx` from each vector and add them together.

Stay tuned, we're approaching the actual code for this. The last thing we need to understand is the memory hierarchy and how we `return` things (spoiler alert: we do not):

- CUDA kernels **cannot operate on arrays, which are in main memory**, we need to transfer them to GPU memory first,
- CPU-to-GPU copies are **costly** and should be minimized,
- we'll provide **three** arrays to our GPU kernel: two operands and an array to hold the result. Yes, you cannot directly return an array from a GPU kernel (actually, you can go on with pointers, but I would not recommend you to do this).

Back then, in C times, it was like this: you have CPU arrays, you allocate GPU memory, you copy from CPU memory to GPU memory (from pointer to pointer saying how many bytes you want to copy, very similar to `memcpy`), launch your kernel, copy the result back to CPU memory. With Python or Julia it's way simpler, as inner mechanics will do a lot of things for you.

So, let's start. Note, that for this to run you need to have NVIDIA GPU, driver and CUDA toolkit installed. This will:

- work as is in Colab without additional installations,
- will not work in Yandex DataSphere (to my knowledge),
- may not work properly on Windows (GPGPU is Linux-dominated). 

We'll use Numba, which provides convenient wrappers for CUDA:

In [1]:
import numpy as np
from numba import cuda

We first create our arrays to add together (I'm using much larger array to see at least some performance gain):

In [2]:
N = 100000

a = np.random.randn(N).astype(np.float32)
b = np.random.randn(N).astype(np.float32)
u = np.zeros(N).astype(np.float32)

u_cpu = a + b

Note that we're using `float32`, as GPUs are somewhat less performant with `float64`. It depends on the GPU, but mine is 32 times slower on `float64` compared to `float32`.

Simple enough so far. Now we need to get them on the GPU (you can skip this stage, as Numba will wrap this for you, but with a high performance overhead):

In [3]:
a_gpu = cuda.to_device(a)
b_gpu = cuda.to_device(b)
u_gpu = cuda.to_device(u)

We need 1024 threads per block:

In [4]:
threads_per_block = 1024
blocks_per_grid = int(np.ceil(N / threads_per_block))

print(f"Will launch {threads_per_block} threads in {blocks_per_grid} blocks")

Will launch 1024 threads in 98 blocks


Now we need to write the kernel itself:

In [5]:
@cuda.jit
def cu_add_vectors(a, b, u):
    # Getting this thread absolute position, i.e., which element this thread will calculate
    tidx = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x # or shortcut: cuda.grid(1) 

    # Is this thread in bounds?
    if tidx < a.shape[0]:
        u[tidx] += a[tidx] + b[tidx]

The main thing to understand here is that we **do not write any loops** (depending on your algorithm you may need them, but not for the "outer" loop).

Otherwise, it's pretty simple. On the first run, Numba will JIT-compile this for us, given the argument and will add copies, if operands are CPU arrays. This is not recommended, again, as copies are costly and it's better to do them once. In this case it's not important, as we anyway will have to calculate some very simple thing. Other algorithms, which have heavier computations will benefit from minimizing data transfers.

We will operate on GPU array we created to measure the **computational** performance of the GPU we have:

In [6]:
cu_add_vectors[blocks_per_grid, threads_per_block](a_gpu, b_gpu, u_gpu)

Now our kernel is compiled and next time will run immediately. Note, how **launch configuration** is specified: `[blocks_per_grid, threads_per_block]`. Ok, now we can measure the computational performance of the GPU (our kernel is compiled, our data is on GPU already, so there's no overhead). Note, that I will skip filling `u_gpu` with `0` each time (in general you have to do that, since it accumulates the results from the previous runs):

In [7]:
%timeit cu_add_vectors[blocks_per_grid, threads_per_block](a_gpu, b_gpu, u_gpu)

304 µs ± 25.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


And now Numpy:

In [8]:
%timeit -n 1000 -r 3 u = a + b

77.1 µs ± 1.76 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)


Of course, in this simple case it's useless to use GPU - overhead on launch is too high and computations are too small. But now you understand the point and will understand the following code (which multiplies a matrix to a vector):

In [9]:
N = 100000
M = 100

A = np.random.randn(N, M).astype(np.float32)
v = np.random.randn(M).astype(np.float32)
u = np.zeros(N).astype(np.float32)

u_cpu = np.dot(A, v)

In [10]:
threads_per_block = 1024
blocks_per_grid = int(np.ceil(N / threads_per_block))

print(f"Will launch {threads_per_block} threads in {blocks_per_grid} blocks")

Will launch 1024 threads in 98 blocks


In [11]:
@cuda.jit
def cu_mv_product(m, v, u):
    tidx = cuda.grid(1)

    # Is this thread in bounds?
    if tidx < m.shape[0]:
        # Creating temporary handle for the result
        tmp  = 0.

        # Doing scalar product between row `pos` in A and v
        for i in range(m.shape[1]):
            tmp += m[tidx, i] * v[i]

        # Putting the result into u
        u[tidx] = tmp

In [12]:
A_gpu = cuda.to_device(A)
v_gpu = cuda.to_device(v)
u_gpu = cuda.to_device(u)

In [13]:
cu_mv_product[blocks_per_grid, threads_per_block](A_gpu, v_gpu, u_gpu)

In [14]:
u_gpu.copy_to_host() # this will copy the data from GPU memory to host (CPU) memory and return a usual Numpy array

array([15.337545 , -4.719008 ,  6.903495 , ...,  4.0871673, 10.019787 ,
       -0.4688534], dtype=float32)

In [15]:
u_cpu

array([15.337545 , -4.7190094,  6.903495 , ...,  4.0871677, 10.019786 ,
       -0.4688539], dtype=float32)

Note that results are slightly different (Mrs. CUDA and Mr. IEEE 754 standard is [a complicated couple](https://docs.nvidia.com/cuda/floating-point/index.html)):

In [16]:
np.allclose(u_cpu, u_gpu.copy_to_host(), atol=1e-5) 

True

If we measure this one (which uses more computations), results will be very different:

In [17]:
%timeit -n 100 -r 3 cu_mv_product[blocks_per_grid, threads_per_block](A_gpu, v_gpu, u_gpu)

264 µs ± 18.3 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)


In [18]:
%timeit -n 100 -r 3 np.dot(A, v)

1.25 ms ± 64.9 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)


Now we can see the difference (about `9X`). GPU was utilized properly (meaning, that percentage of launch overhead was small compared to actual computations).

The last thing to cover here is a confusing situation: the actual number of MPs we have on the GPU is 16 (in my case), number of cores per MP is 32, but we launch 98 blocks with 1024 threads each. How does that work?

GPU will do the following:

- each block will be **assigned to an MP**,
- each MP will get **multiple blocks to run**,
- blocks will be **queued to run**,
- threads will be queued to run **on cores**,
- only 32 threads will run at the same time - a batch called **warp**.

Hence, each MP will get about 6 or 7 blocks, each block will contain 32 warps (as we only have 32 cores per MP), all the warped threads will run at the same time, one per core.

We do not consider *streams*, *shared memory*, *pinned memory* and other advanced topics here (like achieving best utilization, handling resources constraints and so on).

If you want to get a deeper understanding, try [CUDA Programming Guide](https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf): it's written in a clear and simple language. You can start with Chapters 1 and 2 and launch non-trivial code in several hours.

# PyTorch and GPU

If you followed the previous section, you probably have already realized that GPGPU is conceptually straightforward, but incorporates a huge number of technical details. Luckily enough, deep learning frameworks like PyTorch or Tensorflow abstract almost all of them.

The main idea in training deep learning models on a GPU is about placing the tensors properly. If you recall, a PyTorch tensor always has an associated storage object. Those tensors, which live on GPU, have the corresponding storage type. That's it, for most applications that all you need to know. From a user perspective GPU tensors are used in the same way as CPU ones. All the technical complications are handled by PyTorch itself.

Let's look at this further. The subpackage, which exposes some GPU convenient routines is `torch.cuda`:

In [19]:
import torch
torch.cuda.device_count()

0

Hence, we have a single GPU on my machine (you can see a different number depending on where you run this notebook). We can explicitly get GPU device:

In [22]:
device = torch.device("cuda")
device

device(type='cuda')

PyTorch doesn't expose deep details about GPU devices, but we can get

- GPU **name**

In [21]:
torch.cuda.get_device_name(device)

AssertionError: Torch not compiled with CUDA enabled

- **compute capability** (which determines MP composition and what GPU can and cannot do, see CUDA Programming guide for more details)

In [None]:
torch.cuda.get_device_capability(device)

AssertionError: Torch not compiled with CUDA enabled

- **memory** currently allocated on the device (by tensors in this Python kernel)

In [23]:
torch.cuda.memory_allocated(device)

0

Now, we only need to understand how to place a tensor on a GPU. It's very simple: most PyTorch routines, which create tensors, **allow to specify which device you'd like to use**:

In [25]:
t = torch.tensor([1., 2.], device="cuda")
t

AssertionError: Torch not compiled with CUDA enabled

You can use a **specific device** as well (in case you have multiple GPUs):

In [25]:
t = torch.tensor([1., 2.], device=device)
t

tensor([1., 2.], device='cuda:0')

Now, the storage for this tensor is different:

In [26]:
t.storage()

 1.0
 2.0
[torch.cuda.FloatStorage of size 2]

CPU tensors have storage of type `torch.<dtype>Storage`, while GPU tensors have `torch.cuda.<dtype>Storage`:

In [27]:
torch.tensor([1., 2.]).storage()

 1.0
 2.0
[torch.FloatStorage of size 2]

Now we can repeat the experiment we performed in the previous section without creating any kernel manually (note, that we're using `N` and `M` from the previous section):

In [28]:
x = torch.rand(N, M, device=device)
y = torch.rand(M, device=device)

PyTorch will perform proper dispatching based on storage type and data type and run an appropriate kernel for us (the actual kernel is buried somewhere under the hood of PyTorch):

In [29]:
torch.matmul(x, y)

tensor([26.7014, 24.7969, 28.0365,  ..., 29.2888, 27.5010, 27.9509],
       device='cuda:0')

In terms of performance, PyTorch is a clear winner due to highly optimized and specialized kernels it has:

In [30]:
%timeit -n 100 -r 3 torch.matmul(x, y)

10.4 µs ± 2.92 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)


CPU version, as a reminder, is about `100X` times slower (remember, that this number is specific to GPU):

In [31]:
x_cpu = x.cpu().numpy()
y_cpu = y.cpu().numpy()

In [32]:
%timeit  -n 100 -r 3  np.dot(x_cpu, y_cpu)

1.37 ms ± 123 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)


Note also, that you first need to take a tensor from the GPU in order to use it in a usual way (with `x.cpu()`).

Let's summarize this:

- based on tensor location and `dtype`, PyTorch will dispatch a proper call to CPU or GPU implementation of an operation,
- hence, the only thing you need to do to start using a GPU is to move all the tensors involved to the GPU.

Hence, the only thing you need to do to use the model from HW3 on GPU is to move input tensors and weights to GPU. Nothing else.

For multi-GPU training it's a bit more elaborated, but don't worry, PyTorch has a lot of automation and we never create a neural network in the way we did in HW3. You do not need to move each tensor manually: in real coding you'll use `torch.nn.Module` to create your deep learning models (or PyTorch Lightning), which allows to move entire model **at once**.