# GPU Computing

Most of the [top500](https://top500.org/lists/top500/2023/11/) systems have GPUs as accelerators and they are dominating!

<img src="./imgs/cpu_gpu_fraction.svg" width=768>

**Source:** [J. Apostolakis et al., *Detector simulation challenges for future accelerator experiments.*, Frontiers in Physics 10 (2022)](https://doi.org/10.3389/fphy.2022.913510)

<!-- <img src="./imgs/Julia-code-cpu-gpu.png" width=768>

* **host**: CPU + system memory (host memory)
* **device**: GPU with its memory (device memory) -->

## Hardware topology

<img src="./imgs/gpu_topology.svg" width=500px>

* **host**: CPU + system memory (host memory)
* **device**: GPU with its memory (device memory)
* **SM**: Streaming Multiprocessor

Communication bottleneck:
* Host-device bandwidth (PCIe): **31.5 GB/s**

#### NVIDIA A100

##### Streaming Multiprocessor

Source: [NVIDIA whitepaper](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf)

<img src="./imgs/a100_SM.png" width=512px>

**An entire NVIDIA A100 has 108 of these SMs.**

## CPU vs GPU

|                   | CPU                               | GPU                                 |
|:-----------------:|:---------------------------------:|:-----------------------------------:|
| optimized for     | latency and per-core performance  | computational throughput            |
| cores             | complex                           | rather simple                       |

### Peak arithmetic performance

|               | compute units   | maximum clock frequency [GHz] | FP32 peak performance [TFLOPS] |
|:-------------:|:---------------:|:-----------------------------:|:------------------------------:|
| AMD EPYC 7763 |  64 x86 cores   |  3.50                         |  5.0                           |
| NVIDIA A100   | 6912 CUDA cores |  1.41                         | 19.5 (155.9 for Tensor cores)                     |


That's a **factor of ~31** (with tensor cores) in favor of the GPU.

### Peak memory bandwidth

The peak memory bandwidth of GPUs is much higher than for CPUs (we're even considering a dual-CPU node here):

|               | peak memory bandwidth [GB/s] |
|:-------------:|:---------------:|
| 2x AMD EPYC 7763 |  400   |
| NVIDIA A100   | 1560 |

That's a **factor of ~4** in favor of the GPU.

### What matters more?

Taking the ratio of the peak values:

$$
\dfrac{19.5 \ [\textrm{TFLOPS}]}{1.56 \ [\textrm{TB/s}]} \cdot 4 \ \textrm{B} \approx 50
$$

That's 50 floating point operations that can be done per each FP32 number read from memory. For tensor cores, it's even 400.

For most scientific codes this means that they are **memory-bound** (bound by how fast numbers can be gathered) rather than compute-bound (how fast arithmetics can be performed).

→ **Floating point arithmetics is essentially free!**

(→ exercise **saxpy_gpu** and **daxpy_cpu** etc.)

## CUDA.jl

(We'll focus on NVIDIA GPUs but there is [support](https://juliagpu.org/) for GPUs by other vendors (AMD, Intel, etc.) as well.)

Relevant links:
* [CUDA language extension](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)
* [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl)

**CUDA.jl provides:**

* High-level abstraction `CuArray`
* Tools for writing CUDA kernels
* Wrappers to proprietary NVIDIA libraries (e.g. cuBLAS, cuFFT, cuSOLVER, cuSPARSE)

CUDA.jl leverages LLVM to compile **native GPU code** (compare to `nvcc`).

### CUDA toolkit (binary dependency)

By default, the CUDA toolkit is installed automatically when using CUDA.jl for the first time.

**Note:** You can readily add and precompile CUDA.jl on a machine without GPUs, say, a login node.

#### Using a system-wide CUDA installation

```
CUDA.set_runtime_version!(v"12.2"; local_toolkit=true)
```

In [None]:
using CUDA

In [None]:
CUDA.versioninfo()

In [None]:
CUDA.functional() # if this works, you're good to go 👍

In [None]:
device() # the currently selected GPU

## `CuArray`: High-level array programming

The simplest way to use a GPU is via **vectorized array operations** (e.g. broadcasting). Each of these operations will be backed by one or more GPU kernels, either natively written in Julia or from some application library.

You use the `CuArray` type, which serves a dual purpose:

* a managed container that represents GPU memory
* a way to dispatch to operations that execute on the GPU

In [None]:
x_gpu = CuArray{Float32}(undef, 4)

In [None]:
CUDA.rand(4) # Note: defaults to Float32

In [None]:
CUDA.zeros(4)

 We can readily move data to the GPU by converting to `CuArray`.

 <img src="./imgs/cpu_gpu_transfer.svg" width=180px>

In [None]:
x_cpu = [1,2,3,4]
x_gpu = CuArray(x_cpu)

(or by using `copyto!` or `copy!` to move it into already allocated memory)

For better performance the data movement between CPU and GPU should be minimized as much as possible.

### Array computations on the GPU

In [None]:
CuArray <: AbstractArray

Therefore, we should be able to do all kind of operations with it!

#### Example: Matrix multiplication

In [None]:
N = 2048
A_gpu = CUDA.rand(N,N);
B_gpu = CUDA.rand(N,N);

In [None]:
CUDA.@sync A_gpu * B_gpu # we need CUDA.@sync because GPU operations are typically asynchronous

In [None]:
using BenchmarkTools

@btime CUDA.@sync(A_cpu * B_cpu) setup=(A_cpu = rand(Float32, N,N); B_cpu = rand(Float32, N,N););
@btime CUDA.@sync(A_gpu * B_gpu) setup=(A_gpu = CUDA.rand(N,N); B_gpu = CUDA.rand(N,N););

#### More examples: Broadcasting, `map`, `reduce`, etc.

In [None]:
CUDA.@sync A_gpu .+ B_gpu; # runs on the GPU!

In [None]:
CUDA.@sync sqrt.(A_gpu.^2 + B_gpu.^2); # runs on the GPU!

In [None]:
CUDA.@sync mapreduce(sin, +, A_gpu); # runs on the GPU!

**The power of simple GPU array programming can not be underestimated!**

Entire codes (like machine learning frameworks etc.) can be ported to GPU without ever writing a single CUDA kernel manually.

#### "Counter-example:" Scalar indexing

Scalar access contradicts the inherent parallelism of the GPU.

In [None]:
A_gpu[1]

In [None]:
CUDA.@allowscalar A_gpu[1]

You must express arithmetic operations in terms of arrays and treat the `CuArray` array as a whole entity, e.g.

```julia
CUDA.@sync C .= A .* B
```

### Memory management

`CuArray`s are managed by Julia's **garbage collector**. However, the GC is CPU-focused and isn't good at sensing GPU memory pressure.

By default CUDA.jl uses a **memory pool** to speed up future allocations. So it might sometimes appear as if the objects have not been freed.

In [None]:
CUDA.pool_status()

In [None]:
x_gpu = CUDA.rand(10_000_000);

In [None]:
sizeof(x_gpu) |> Base.format_bytes

In [None]:
CUDA.pool_status()

In [None]:
x_gpu = nothing; GC.gc(true)

In [None]:
CUDA.pool_status()

We can use `CUDA.unsafe_free!(x_gpu)` to more agressively release the memory.

In [None]:
x_gpu = CUDA.rand(10_000_000);

In [None]:
CUDA.unsafe_free!(x_gpu)

But there is a reason the function is named "unsafe": One still has the handle `x_gpu` that now points to free'd memory.

Resonable application:

```julia
function myfunction(x::CuArray)
    tmp_memory = similar(x)
    expensive_operation!(x, tmp_memory)
    CUDA.unsafe_free!(tmp_memory)
    return x
end
```

In [None]:
x_gpu = nothing # to be safe :)

## Kernel programming

High-level array programming doesn't cover all kinds of computations and doesn't always give the (absolute) best performance. In these cases, you can manually write a CUDA kernel directly in Julia.

### Our first CUDA kernel
**CUDA kernel**: a function that will be executed by all *GPU threads* in parallel.

Based on the index of a thread we can make them operate on different pieces of given data.

(It might be helpful to think of the CUDA kernel as being the body of a parallel loop.)

In [None]:
function cuda_kernel!(x)
    i = threadIdx().x # the thread index ("loop index")
    x[i] += 1
    return # CUDA kernels may not return anything
end

One can (asynchronously) launch the kernel on the GPU with the `@cuda` macro. (Think of it being a batch version of `@spawn` for the GPU.)

In [None]:
x = CUDA.zeros(1024)

CUDA.@sync @cuda threads=length(x) cuda_kernel!(x)

In [None]:
x

### Caveats of kernel programming
Kernel programming has significant limitations and can become more difficult, especially if you care about performance. Some of the reasons are:

* some computations require **communication between GPU threads** (e.g. reductions)
* you need to respect **hardware limitations** of the GPU
* kernels execute on the GPU where the **Julia runtime isn't available**

In particular due to the last point, kernel code has limitations:
  * no GC
  * no `print` etc. (→ `@cuprint`)
  * code must be fully type inferred (no dynamic dispatch allowed)
  * no `try ... catch ... end`
  * ...

**You can't just write arbitrary Julia code in kernels.** Fortunately though, many things just work and can get you very far.

#### Example: Hardware limitation

In [None]:
x = CUDA.zeros(1025) # one more element than before

CUDA.@sync @cuda threads=length(x) cuda_kernel!(x)

### CUDA programming model

<!-- <img src="./imgs/CUDA_programming_model.png" width=1024> -->
<img src="imgs/cuda_prog_model.svg" width=1024>

Conceptual mapping:

* **Grid** of blocks → entire GPU
* **Blocks** of threads → SMs
* **Threads** → CUDA cores

**Note**: up to three dimensions, $(x, y, z)$, can be used to organize the thread blocks and threads in each block.

In [None]:
CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)

In [None]:
function cuda_kernel_blocks!(x)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x # global thread index
    if i <= length(x) # make sure that we're inbounds (c.f. "loop" iteration range)
        @inbounds x[i] += 1
    end
    return nothing
end

In [None]:
x = CUDA.zeros(1025);

**Execution configuration** for a CUDA kernel:
* `threads`: number of threads **in each block**
* `blocks`: number of blocks in the grid

In [None]:
CUDA.@sync @cuda threads=1024 blocks=2 cuda_kernel_blocks!(x)

In [None]:
x

#### How does our CUDA kernel compare to broadcasting?

In [None]:
x = CUDA.rand(1024*1024);

function add_one_broadcasting(x)
    CUDA.@sync x .+ 1
end

# same CUDA kernel, but with different execution configurations
function add_one_kernel_1024_1k(x)
    CUDA.@sync @cuda threads=1024 blocks=1024 cuda_kernel_blocks!(x)
end

function add_one_kernel_256_4k(x)
    CUDA.@sync @cuda threads=256 blocks=4*1024 cuda_kernel_blocks!(x)
end

function add_one_kernel_64_16k(x)
    CUDA.@sync @cuda threads=64 blocks=16*1024 cuda_kernel_blocks!(x)
end

@btime add_one_broadcasting(x)   setup=(x = CUDA.zeros(1024););
@btime add_one_kernel_1024_1k(x) setup=(x = CUDA.zeros(1024););
@btime add_one_kernel_256_4k(x)  setup=(x = CUDA.zeros(1024););
@btime add_one_kernel_64_16k(x)  setup=(x = CUDA.zeros(1024););

The performance of a CUDA kernel is affected by the execution configuration (because of the runtime scheduling of thread blocks and threads).

#### How to obtain a good execution configuration for a CUDA kernel? → [Occupancy API](https://developer.nvidia.com/blog/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/)

The occupancy API is an automatic tool that can be used to obtain *reasonably good* execution configurations.

**Occupancy** measures the ratio of the number of active *warps* per SM to the maximum number of possible warps per SM.

*warp*: a group of 32 parallel GPU threads executing the same instructions.

* Low occupancy usually implies low performance (because of underutilized hardware resources).
* High occupancy, however, may not necessarily imply the best performance.

In [None]:
kernel = @cuda launch=false cuda_kernel_blocks!(x) # don't launch the kernel

In [None]:
config = CUDA.launch_configuration(kernel.fun)

Here, the number `blocks` indicates how many blocks we would need to fully occupy the GPU. For a given input `x`, we might need fewer or more blocks.

In [None]:
threads = min(length(x), config.threads)

In [None]:
blocks = cld(length(x), threads)

Launching the kernel with the dynamic launch parameters:

In [None]:
kernel(x; threads=threads, blocks=blocks) # calling `kernel` like a regular function with keyword arguments

In [None]:
@btime CUDA.@sync(kernel(x; threads=$threads, blocks=$blocks)) setup=(x = CUDA.zeros(1024););

## Introspection

Similar to the `@code_*` macros for CPU there are `@device_code_*` macros for GPU. (The GPU pendant for `@code_native` is `@device_code_ptx`, though).

**PTX**: a low-level **p**arallel **t**hread e**x**ecution virtual machine and instruction set architecture used in NVIDIA CUDA programming.

In [None]:
@device_code_warntype @cuda threads=1024 blocks=1024 cuda_kernel_blocks!(x)

In [None]:
@device_code_llvm debuginfo=:none @cuda threads=1024 blocks=1024 cuda_kernel_blocks!(x)

In [None]:
@device_code_ptx @cuda threads=1024 blocks=1024 cuda_kernel_blocks!(x)

## Multi-GPU computing

**(Note: You can't run this part because you only have a single GPU.)**

In a GPU node of a supercomputer there might be more than one GPU:

<!-- <img src="./imgs/Noctua2_GPU_node.png" width=320px> -->
<img src="imgs/Noctua2_GPU_node.svg" width=320px>

### Multi-GPU via Tasks

Each Julia task gets its own local CUDA execution environment.

Hence, it is easy to use multiple GPUs in parallel by launching GPU computations on different GPUs from different Julia tasks.

In [None]:
using Base.Threads

function gpu_computation(A, B, C)
    for i in 1:512
        C = A * B
        A = B * C
        B = C * A
    end
    sin.(B)
    return B
end

function multi_gpu()
    n = 2048
    @sync begin
        # Julia task for the 1st GPU
        @spawn begin
            device!(0) # first GPU
            A = CUDA.rand(n, n)
            B = CUDA.rand(n, n)
            C = CUDA.zeros(n, n)
            println("GPU 1: running gpu_computation")
            CUDA.@sync gpu_computation(A,B,C)
            println("GPU 1: done")
        end
        # Julia task for the 2nd GPU
        @spawn begin
            device!(1) # second GPU
            A = CUDA.rand(n, n)
            B = CUDA.rand(n, n)
            C = CUDA.zeros(n, n)
            println("GPU 2: running gpu_computation")
            CUDA.@sync gpu_computation(A,B,C)
            println("GPU 2: done")
        end
    end
    return nothing
end

In [None]:
multi_gpu()

In [None]:
# call CUDA.memory_status() for all GPUs
for dev in devices()
    device!(dev)
    println()
    CUDA.pool_status()
end
device!(0);

### Outlook: Multi-GPU via MPI

**Strategy:** One GPU per MPI rank (Julia process)

More information + exercise tomorrow (→ **diffusion_2d_mpi_gpu**).

## Benchmarking and profiling GPU code

In [None]:
A = CUDA.rand(1024, 1024)
B = CUDA.rand(1024, 1024)

@btime CUDA.@sync A .* B;

Note that "allocations" here means CPU allocations.

### `CUDA.@time`

In [None]:
CUDA.@time A .* B;

### `CUDA.@profile`

In [None]:
CUDA.@profile A .* B

In [None]:
CUDA.@profile trace=true A .* B

### [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems)

Use [**NVTX.jl**](https://github.com/JuliaGPU/NVTX.jl) to annotate (i.e. label and colorize) code blocks.

<img src="imgs/nsight.png" width=800px>