In [None]:
gethostname()

# GPU Computing

## Overview: GPU topology

<img src="./imgs/gpu_topology.png" width=1300px>

**Source:** [Sivalingam, Karthee. "GPU Acceleration of a Theoretical Particle Physics Application." Master's Thesis, The University of Edinburgh (2010).](https://static.epcc.ed.ac.uk/dissertations/hpc-msc/2009-2010/Karthee%20Sivalingam.pdf)

* **SM** = Streaming Multiprocessor
* **SP** = Streaming Processor

### NVIDIA A100 SXM4

<img src="./imgs/a100_front.png" width=600px>
<br>

**Streaming Multiprocessor:**

<img src="./imgs/a100_SM.png" width=600px>

**Source:** [NVIDIA whitepaper](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf)

| Kind                       | Count            |
|----------------------------|------------------|
| **SMs**                    | 108              |
| **CUDA cores** / FP32 ALUs | 6912 (64 per SM) |
| **Tensor cores**           | 432 (4 per SM)   |

* **ALU** = Arithmetic Logical Unit


## Quick comparison: CPU vs GPU

### AMD EPYC 7763 vs NVIDIA A100

|               | number of cores    | maximum clock frequency [GHz] | FP32 peak performance [TFLOPS] |
|:-------------:|:------------------:|:-----------------------------:|:------------------------------:|
| AMD EPYC 7763 |   64               |  3.50                         |  5.0                           |
| NVIDIA A100   | 6912               |  1.41                         | 19.5 (**155.9** for Tensor cores)  |

The computing power of the leading [top500](https://top500.org/lists/top500/2023/06/) systems lies in GPUs.

### Differences between CPU and GPU

|                   | CPU                               | GPU                                 |
|:-----------------:|:---------------------------------:|:-----------------------------------:|
| designed for      | task parallelism (MIMD/MISD)      | **data parallelism (SIMD)**         |
| optimized for     | latency and per-core performance  | computational throughput            |
| cores             | complex                           | rather simple                       |
| number of threads | O(100)                            | **O(10000) (millions can be scheduled)** |
| thread pinning    | a must for good performance       | not required                        |

### Memory-bound scientific computing

The performance of most scientific codes **memory-bound** (memory access speed) rather than compute-bound (how fast computations can be done). In a certain time interval, GPUs (and CPUs) can perform more computations than read numbers from memory.

**Peak performance over peak memory bandwidth** (for A100)

$$
\dfrac{19.5 \ [\textrm{TFlop/s}]}{1.5 \ [\textrm{TB/s}]} \cdot 4 \ \textrm{B} = 52
$$

An A100 (using only CUDA cores) can thus perfrom 52 FLOPS per each number read (4 byte, i.e. `Float32`) from memory.

**Floating point operations are essentially "free"** in this regime!

Crucially, the peak memory bandwidth of GPUs is much higher than for CPUs: **~1.5 TB/s** (A100) vs **~400 GB/s** (2x AMD EPYC 7763).

(→ exercise **saxpy_gpu** and **daxpy_cpu** from yesterday)

### GPU acceleration

<img src="./imgs/Julia-code-cpu-gpu.png" width=900px>

**host**: System CPU(s) + system memory (host memory) etc.

**device**: the GPU with its own memory (device memory)



## Julia + GPU (NVIDIA)

Website: https://juliagpu.org/

We'll focus on **NVIDIA GPUs** but there is [support for other GPUs](https://juliagpu.org/) (AMD, Intel, etc.) as well.

The interface to NVIDIA GPU computing is the [CUDA language extension](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html). In Julia there is [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl).

It leverages LLVM, specifically parts of the Julia compiler as well as [GPUCompiler.jl](https://github.com/JuliaGPU/GPUCompiler.jl), to compile **native GPU code**. (compare to `nvcc`)

It provides:

* **High-level abstraction `CuArray`**
* **Tools for writing custom CUDA kernels**
* **Wrappers to proprietary NVIDIA libraries (e.g. CUBLAS, CUFFT, CUSPARSE)**

### Getting CUDA

By default, **it's automatic**. The CUDA toolkit is installed automatically when **using** CUDA.jl for the first time. The only requirement is a working NVIDIA driver.

**Note:** You can readily add CUDA.jl to a Julia environment on a machine without GPUs, say, a login node. See [Precompiling CUDA.jl without CUDA](https://cuda.juliagpu.org/stable/installation/overview/#Precompiling-CUDA.jl-without-CUDA) for more information.

#### System CUDA

You can opt-out of the automatic system by setting a Julia preference, e.g.

```julia
CUDA.set_runtime_version!(v"12.2"; local_toolkit=true)
```

In [None]:
using CUDA

In [None]:
CUDA.versioninfo()

In [None]:
CUDA.functional() # if this works, you're good to go 👍

In [None]:
device() # the currently selected GPU

## Array programming: `CuArray`

The simplest way to use a GPU is via **vectorized array operations** (e.g. broadcasting). Each of these operations will be backed by one or more GPU kernels, either natively written in Julia or from some application library. As long as your data is large enough you should be able to get nice speed-ups in many cases.

You use the `CuArray` type, which serves a dual porpose:

* a managed container for GPU memory
* a way to dispatch to operations that execute on the GPU

A `CuArray` is a **CPU object representing GPU memory**.

In [None]:
x_gpu = CuArray{Float32}(undef, 3)

In [None]:
CUDA.rand(3) # Note: defaults to Float32

In [None]:
CUDA.zeros(3)

 We can readily move data to the GPU by converting to `CuArray`.

 <img src="./imgs/cpu_gpu_transfer.svg" width=180px>

In [None]:
x_cpu = [1,2,3]
x_gpu = CuArray(x_cpu) 

(or by using `copyto!` to move it into already allocated memory)

### Array computations on GPU

In [None]:
CuArray <: AbstractArray

Therefore, we should be able to do all kind of operations with it, that we'd also do with regular `Array`s. (**duck typing**)

#### Example: Matrix multiplication

In [None]:
N = 1000
A_gpu = CUDA.rand(N,N)
B_gpu = CUDA.rand(N,N)

In [None]:
CUDA.@sync A_gpu * B_gpu # we need CUDA.@sync because GPU operations are typically asynchronous

In [None]:
using BenchmarkTools

@btime CUDA.@sync(A_cpu * B_cpu) setup=(A_cpu = rand(Float32, N,N); B_cpu = rand(Float32, N,N););
@btime CUDA.@sync(A_gpu * B_gpu) setup=(A_gpu = CUDA.rand(N,N); B_gpu = CUDA.rand(N,N););

(Note: `*` for `CuArray`s uses a cuBLAS kernel under the hood)

#### More examples: Broadcasting, `map`, `reduce`, etc.

In [None]:
CUDA.@sync A_gpu .+ B_gpu # runs on the GPU!

In [None]:
CUDA.@sync sqrt.(A_gpu.^2 + B_gpu.^2) # runs on the GPU!

In [None]:
CUDA.@sync mapreduce(sin, +, A_gpu) # runs on the GPU!

**The power of simple GPU array programming can not be underestimated!** Entire codes (like deep learning frameworks etc.) could be ported to GPU without ever writing a single CUDA kernel manually.

(Of course, it isn't always as easy or performance can be improved by writing custom kernels. (-> exericse **heat_diffusion**)

#### "Counter-example:" Scalar indexing

In [None]:
A_gpu[1]

In [None]:
CUDA.@allowscalar A_gpu[1]

In [None]:
function gpu_not_actually!(C, A, B)
    CUDA.@sync CUDA.@allowscalar for i in eachindex(A,B)
        C[i] = A[i] * B[i] # multiplication will happen on CPU!
    end
end

function gpu_broadcasting!(C, A, B)
    CUDA.@sync C .= A .* B
end

In [None]:
using BenchmarkTools

N = 10
@btime gpu_not_actually!(C, A, B) setup=(A = CUDA.rand(10,10); B = CUDA.rand(10,10); C = CUDA.rand(10,10););
@btime gpu_broadcasting!(C, A, B) setup=(A = CUDA.rand(10,10); B = CUDA.rand(10,10); C = CUDA.rand(10,10););

##### Side note: CUDA executor for FLoops

```julia
using FLoops, FoldsCUDA

function gpu_floops!(C, A, B)
    CUDA.@sync @floop CUDAEx() for i in eachindex(A,B,C)
        C[i] = A[i] * B[i]
    end
end
```

#### A few words on memory management

`CuArray`s are managed by Julia's **garbage collector**. If they are unreachable, they will get cleaned up automatically during a GC run. However, keep in mind that the (CPU-focused) GC isn't good at sensing GPU memory pressure.

In [None]:
CUDA.memory_status()

In [None]:
x_gpu = CUDA.rand(10_000_000);

In [None]:
sizeof(x_gpu) |> Base.format_bytes

In [None]:
CUDA.memory_status()

In [None]:
x_gpu = nothing; GC.gc(true)

In [None]:
CUDA.memory_status()

What's going on?

By default CUDA.jl uses a **memory pool** to speed up future allocations. So it might appear as if the objects have not been free'd. (You can disable the pool with `JULIA_CUDA_MEMORY_POOL=none`.)

We can use `CUDA.unsafe_free!(x_gpu)` and `CUDA.reclaim()` to more aggressively suggest the freeing of the memory.

In [None]:
x_gpu = CUDA.rand(10_000_000);

In [None]:
CUDA.memory_status()

In [None]:
CUDA.unsafe_free!(x_gpu)

In [None]:
CUDA.memory_status()

Of course, one must be careful with `CUDA.unsafe_free!` because one still has the handle `x_gpu` that now points to free'd memory. But it is fine and very useful in a pattern like this:

```julia
function myfunction(x::CuArray)
    tmp_memory = similar(x)
    expensive_operation!(x, tmp_memory)
    CUDA.unsafe_free!(tmp_memory)
    return x
end
```

## Kernel programming: Writing CUDA kernels

A CUDA kernel is a function that will be executed by all GPU *threads* in parallel.

Based on the index of a thread we can make them operate on different pieces of given data (SPMD/SIMD programming model similar to MPI).

(It might be helpful to think of the CUDA kernel as being the body of a loop (that you never write).)

In [None]:
function cuda_kernel!(x)
    i = threadIdx().x # the thread index ("loop index")
    x[i] += 1
    return nothing # CUDA kernels should never return anything
end

One can launch the kernel on the GPU with the `@cuda` macro (non-blocking, asynchronous):

In [None]:
x = CUDA.zeros(1024)

CUDA.@sync @cuda threads=length(x) cuda_kernel!(x)

In [None]:
x

As you can imaging, kernel programming can of course become (much) more difficult, especially if you care about performance. A few reasons:

* you need to respect **hardware limitations** of the GPU
* **not all operations can readily be expressed as scalar kernels** (Example: reduction)
* kernels execute on the GPU where the **Julia runtime isn't available**

In particular due to the last point, kernel code has limitations
  * no GC
  * no `print` etc. (-> `@cuprint`)
  * code must be fully type inferred (no dynamic dispatch allowed)
  * no `try ... catch ... end`
  * ...

**You can't just write arbitrary Julia code in kernels.** Fortunately though, many things just work and can get you far (see e.g. exercises).

#### Example: Hardware limitation

In [None]:
x = CUDA.zeros(1025) # one more element than before

CUDA.@sync @cuda threads=length(x) cuda_kernel!(x)

In [None]:
CUDA.attribute(device(), CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK)

### CUDA programming model

<img src="./imgs/cuda_blocks_threads.png" width=700px>

(Note: in Julia indices start at 1)

**Source:** [Sivalingam, Karthee. "GPU Acceleration of a Theoretical Particle Physics Application." Master's Thesis, The University of Edinburgh (2010).](https://static.epcc.ed.ac.uk/dissertations/hpc-msc/2009-2010/Karthee%20Sivalingam.pdf)

Conceptual mapping:

* **Threads** → CUDA cores
* **Blocks** of threads → SMs
* **Grid** of blocks → entire GPU

In [None]:
function cuda_kernel_blocks!(x)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x # global thread index
    if i <= length(x) # make sure that we're inbounds (c.f. "loop" iteration range)
        @inbounds x[i] += 1
    end
    return nothing
end

In [None]:
x = CUDA.zeros(1025)

CUDA.@sync @cuda threads=1024 blocks=2 cuda_kernel_blocks!(x)

In [None]:
x

#### How does our custom CUDA kernel compare to broadcasting?

In [None]:
function add_one_kernel(x)
    CUDA.@sync @cuda threads=1024 blocks=1 cuda_kernel_blocks!(x)
end

function add_one_broadcasting(x)
    CUDA.@sync x .+ 1
end

@btime add_one_kernel(x) setup=(x = CUDA.zeros(1024););
@btime add_one_broadcasting(x) setup=(x = CUDA.zeros(1024););

### Simplifying kernel launches: [Occupancy API](https://developer.nvidia.com/blog/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/)

Hardcoding limits (1024 above) is rarely a good idea. A few reasons:

* In reality, the actual maximal number of threads can depend on kernel details, like how many resources the kernel is using.
* You might want to support different GPUs with different hardware limitations.

**The occupancy API is an automatic tool that can be used to obtain good launch parameters.**

In [None]:
kernel = @cuda launch=false cuda_kernel_blocks!(x) # don't launch the kernel

In [None]:
config = CUDA.launch_configuration(kernel.fun)

Here, the number `blocks` indicates how many blocks we would need to fully occupy the GPU. For a given input `x`, we might need fewer or more blocks.

In [None]:
threads = min(length(x), config.threads)

In [None]:
blocks = cld(length(x), threads)

Launching the kernel with the dynamic launch parameters:

In [None]:
kernel(x; threads, blocks) # calling `kernel` like a regular function

### Introspection

Similar to `@code_*` for CPU there are `@device_code_*` macros. However, the GPU pendant for `@code_native` is `@device_code_ptx`.

In [None]:
@device_code_warntype @cuda threads=1024 blocks=1 cuda_kernel_blocks!(x)

In [None]:
@device_code_llvm debuginfo=:none @cuda threads=1024 blocks=1 cuda_kernel_blocks!(x)

In [None]:
@device_code_ptx @cuda threads=1024 blocks=1 cuda_kernel_blocks!(x)

## Tasks + GPU

Each Julia task gets its own local CUDA execution environment. That makes it easy to use, e.g., one task per device, or to use tasks for independent operations that can be overlapped.

**Note:** In the following we will use `Threads.@spawn`. Since multithreading support is a rather recent addition to CUDA.jl, one might use `@async` instead.

In [None]:
using Base.Threads

### Overlapping CPU and GPU operations

In [None]:
function overlap_cpu_and_gpu()
    @sync begin
        @spawn begin
            println("GPU task: begin")
            A = CUDA.rand(1024, 1024)
            B = CUDA.rand(1024, 1024)
            A * B
            println("GPU task: wait")
            CUDA.synchronize()
            println("GPU task: end")
        end
        @spawn begin
            println("CPU task: begin")
            for x in 1:10
                A = rand(2048, 2048)
                B = rand(2048, 2048)
                A .* B
            end
            println("CPU task: end")
        end
    end
    return nothing
end

In [None]:
overlap_cpu_and_gpu()

### Overlapping GPU operations

With modern GPUs becoming more and more powerful, it's getting harder to have every kernel use all of the device's hardware resources.

Potential solution: overlap multiple (streams of) GPU computations such that the GPU can overlap operations whenever possible.

In [None]:
using LinearAlgebra

function gpu_computation(A, B, C)
    mul!(C, A, B)
    sin.(C)
    return C
end

function overlap_gpu()
    A = CUDA.rand(1024, 1024)
    B = CUDA.rand(1024, 1024)
    C = CUDA.zeros(1024, 1024)

    D = CUDA.rand(1024, 1024)
    E = CUDA.rand(1024, 1024)
    F = CUDA.zeros(1024, 1024)

    @sync begin
        println("Spawning gpu_computation on GPU")
        @spawn CUDA.@sync gpu_computation(A, B, C)
        println("Spawning gpu_computation on GPU")
        @spawn CUDA.@sync gpu_computation(D, E, F)
        println("Waiting...")
    end
    println("Everything done.")
    return nothing
end

In [None]:
overlap_gpu()

### Multi-GPU (same machine)

In [None]:
function multi_gpu()
    @sync begin
        @spawn begin
            device!(0) # first GPU
            A = CUDA.rand(1024, 1024)
            B = CUDA.rand(1024, 1024)
            C = CUDA.zeros(1024, 1024)

            println("GPU 1: running gpu_computation")
            CUDA.@sync gpu_computation(A,B,C)
            println("GPU 1: done")
        end

        @spawn begin
            device!(1) # second GPU
            A = CUDA.rand(1024, 1024)
            B = CUDA.rand(1024, 1024)
            C = CUDA.zeros(1024, 1024)

            println("GPU 2: running gpu_computation")
            CUDA.@sync gpu_computation(A,B,C)
            println("GPU 2: done")
        end
    end
    return nothing
end

In [None]:
multi_gpu()

In [None]:
# call CUDA.memory_status() for all GPUs
for dev in devices()
    device!(dev)
    println()
    CUDA.memory_status()
end
device!(0);

## Benchmarking + Profiling (comments)

In [None]:
using CUDA

In [None]:
device!(0)

In [None]:
A = CUDA.rand(1024, 1024)
B = CUDA.rand(1024, 1024)

@btime CUDA.@sync A .* B;

Note that "allocations here" means CPU allocations. For GPU allocations you can e.g. use `CUDA.@time`.

In [None]:
CUDA.@time A .* B;

### Integrated profiler: `CUDA.@profile`

In [None]:
CUDA.@profile A .* B;

In [None]:
CUDA.@profile trace=true A .* B;

### External profiler: [NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems)

See https://cuda.juliagpu.org/stable/development/profiling/#External-profilers

**Command**: `CUDA.@profile external=true`

Use [**NVTX.jl**](https://github.com/JuliaGPU/NVTX.jl) to annotate (i.e. label and colorize) code blocks.

<img src="./imgs/nsight_systems.png" width=800px>

**Note: also great for MPI profiling**

<img src="./imgs/report1.png" width=800px>

(see `notebooks/backup/mpi_profiling_nsys`)

## Case study: Three ways to SAXPY on the GPU

**SAXPY** = **S**ingle precision **A** times **X** **P**lus **Y**

→ exercise **saxpy_gpu**

<img src="./imgs/a100_saxpy_results.png" width=1000px>