# GPU Computing with Julia

## Topology of a GPU

<img src="../imgs/gpu_topology.png" width=1300px>

**Source:** [Sivalingam, Karthee. "GPU Acceleration of a Theoretical Particle Physics Application." Master's Thesis, The University of Edinburgh (2010).](https://static.epcc.ed.ac.uk/dissertations/hpc-msc/2009-2010/Karthee%20Sivalingam.pdf)

* **SM** = Streaming Multiprocessor
* **SP** = Streaming Processor

### NVIDIA A100 SXM4

<img src="../imgs/a100_front.png" width=800px>

<img src="../imgs/a100_SM.png" width=400px>

**Source:** [NVIDIA whitepaper](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf)

| Kind                       | Count            |
|----------------------------|------------------|
| **SMs**                    | 108              |
| **CUDA cores** / FP32 ALUs | 6912 (64 per SM) |
| **Tensor cores**           | 432 (4 per SM)   |

* **ALU** = Arithmetic Logical Unit


In [10]:
# using GPUInspector
# gpuinfo()

## JuliaGPU

Website: https://juliagpu.org/

GitHub Org: https://github.com/JuliaGPU




(We'll focus on Nvidia GPUs but there is [support for other GPUs](https://juliagpu.org/) as well.)

The interface to NVIDIA GPU computing in Julia is [CUDA.jl](https://github.com/JuliaGPU/CUDA.jl).



It provides:

* **High-level abstraction `CuArray`**
* **Tools for writing custom CUDA kernels**
* **Wrappers to proprietary NVIDIA libraries (e.g. CUBLAS, CUFFT, CUSPARSE)**

In [1]:
using CUDA

In [3]:
CUDA.versioninfo() # automatically downloads CUDA framework if necessary

CUDA toolkit 11.7, artifact installation
NVIDIA driver 510.47.3, for CUDA 11.6
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+510.47.3
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.3
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

8 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  1: NVIDIA A100-SXM4-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  2: NVIDIA A100-SXM4-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  3: NVIDIA A100-SXM4-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  4: NVIDIA A100-SXM4-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  5: NVIDIA A100-SXM4-40GB (sm_80, 39.406 GiB / 40.000 GiB available)
  6: NVIDIA A100-SXM4-40GB 

In [4]:
CUDA.functional()

true

## High-level abstraction: `CuArray`

### GPU memory

A `CuArray` is a CPU handle to GPU memory.

In [33]:
x_gpu = CuArray{Float32}(undef, 3)

3-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
  7.925843f-12
  3.7553988
 -1.0145275f-18

In [28]:
CUDA.rand(3) # Note: defaults to Float32

3-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.20692769
 0.71898437
 0.28365776

In [29]:
CUDA.zeros(3) # Note: defaults to Float32

3-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 0.0
 0.0
 0.0

 We can readily move data to the GPU by converting to `CuArray`.

 <img src="../imgs/cpu_gpu_transfer.svg" width=180px>

In [37]:
x_cpu = [1,2,3]
x_gpu = CuArray(x_cpu) 

3-element CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}:
 1
 2
 3

(or by using `copyto!` to move it into already allocated memory)

#### Memory management

`CuArray`s are managed by Julia's **garbage collector**. In principle, if they are unreachable, they get cleaned up automatically.

However, by default CUDA.jl uses a **memory pool** to speed up allocations. So it might appear as if the objects have not been free'd. (You can disable the pool with `JULIA_CUDA_MEMORY_POOL=none`.)

In [101]:
CUDA.memory_status()

Effective GPU memory usage: 2.24% (901.938 MiB/39.409 GiB)
Memory pool usage: 30.518 MiB (64.000 MiB reserved)

In [59]:
x_gpu = CUDA.rand(10_000_000);

In [60]:
Base.format_bytes(sizeof(x_gpu))

"38.147 MiB"

In [41]:
x_gpu = nothing; GC.gc(true)

In [68]:
CUDA.memory_status()

Effective GPU memory usage: 2.12% (853.938 MiB/39.409 GiB)
Memory pool usage: 19.074 MiB (64.000 MiB reserved)

We can use `CUDA.unsafe_free!(x_gpu)` and `CUDA.reclaim()` to more aggressively suggest the freeing of the memory.

In [70]:
CUDA.unsafe_free!(x_gpu)

In [71]:
CUDA.memory_status()

Effective GPU memory usage: 2.12% (853.938 MiB/39.409 GiB)
Memory pool usage: 19.074 MiB (64.000 MiB reserved)

### GPU computation

In [86]:
CuArray <: AbstractArray

true

Therefore, we should be able to do all kind of operations with it, that we'd also do with regular `Array`s. (**duck typing**)

#### Example: Matrix multiplication

In [2]:
N = 1000
A_gpu = CUDA.rand(N,N)
B_gpu = CUDA.rand(N,N)

1000×1000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.056945   0.839401    0.813055   …  0.163615   0.114515   0.119574
 0.0182427  0.00105949  0.594533      0.234975   0.69579    0.788363
 0.473413   0.30368     0.273531      0.885249   0.402535   0.915592
 0.903833   0.688429    0.982785      0.944259   0.897768   0.711426
 0.835315   0.305328    0.764077      0.751928   0.0361518  0.170335
 0.617305   0.710764    0.150943   …  0.268591   0.665839   0.623263
 0.74807    0.0914348   0.0219869     0.0830913  0.159481   0.623509
 0.668722   0.0886      0.119556      0.770323   0.742287   0.2565
 0.945929   0.0817244   0.924021      0.220601   0.913536   0.139473
 0.181236   0.378013    0.247321      0.903421   0.647854   0.421166
 ⋮                                 ⋱                        
 0.339104   0.553526    0.300516      0.381663   0.243767   0.940985
 0.11222    0.545936    0.0460838     0.23775    0.505224   0.0717938
 0.295375   0.535546    0.0783167     0.35869    0.371998 

In [3]:
A_gpu * B_gpu

1000×1000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 245.87   246.475  250.683  256.683  …  252.145  252.805  250.717  258.358
 236.848  241.92   239.251  246.927     246.033  249.34   242.422  252.599
 240.154  237.077  235.73   250.582     243.024  246.821  246.604  253.176
 253.03   255.178  247.617  258.726     252.536  255.284  256.25   263.432
 242.494  235.547  238.986  244.049     249.6    244.453  244.351  251.282
 240.303  234.647  242.09   249.004  …  244.226  242.981  242.436  249.887
 241.933  243.031  245.059  255.049     251.519  247.378  249.662  257.829
 242.441  242.162  240.999  245.55      244.424  246.555  244.995  257.67
 248.442  247.131  249.131  250.933     254.358  248.887  253.567  260.742
 243.909  243.63   251.266  256.549     254.509  251.602  252.063  257.38
   ⋮                                 ⋱                             
 249.256  243.497  250.247  250.393     251.865  252.332  255.247  258.112
 252.923  246.637  244.23   254.123     251.577  250.08

In [110]:
using BenchmarkTools

@btime A_cpu * B_cpu setup=(A_cpu = rand(Float32, N,N); B_cpu = rand(Float32, N,N););
@btime A_gpu * B_gpu setup=(A_gpu = CUDA.rand(N,N); B_gpu = CUDA.rand(N,N););

  5.468 ms (2 allocations: 3.81 MiB)
  17.814 μs (29 allocations: 592 bytes)


Note how the timescales change: **milliseconds -> microseconds**!

(`*` for `CuArray`s uses a cuBLAS kernel under the hood)

In [84]:
A_gpu .+ B_gpu

1000×1000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 1.23208   0.944068  1.38154   1.08824   …  0.87209   0.672663  1.58735
 0.984475  1.5205    0.716376  1.15109      1.27782   1.61814   1.07899
 0.942673  0.548932  1.47714   1.54724      0.902197  1.20323   0.834586
 0.497924  0.712583  1.22956   0.885402     1.36429   1.54674   0.79038
 0.871517  1.31615   1.4691    1.55913      1.39146   0.750631  1.02665
 0.394154  1.30235   0.922221  1.75687   …  1.2703    0.493146  0.447018
 1.22646   1.1364    0.387659  0.527573     0.77348   0.510331  1.03629
 1.63194   1.04339   1.32871   0.530629     1.15481   1.22      0.865854
 1.07494   0.596293  0.566374  0.423238     0.791183  1.80143   0.859054
 0.640312  1.16007   1.23231   1.56074      1.21121   1.71157   0.88117
 ⋮                                       ⋱                      
 0.616944  0.325943  0.985325  0.959406     1.09151   1.54474   1.78911
 1.1108    1.27737   1.51201   0.923711     1.06877   0.79366   0.88223
 0.732747  1.

#### Example: Broadcasting, `map`, `reduce`, etc.

In [88]:
A_gpu .+ B_gpu # runs on the GPU!

1000×1000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 1.23208   0.944068  1.38154   1.08824   …  0.87209   0.672663  1.58735
 0.984475  1.5205    0.716376  1.15109      1.27782   1.61814   1.07899
 0.942673  0.548932  1.47714   1.54724      0.902197  1.20323   0.834586
 0.497924  0.712583  1.22956   0.885402     1.36429   1.54674   0.79038
 0.871517  1.31615   1.4691    1.55913      1.39146   0.750631  1.02665
 0.394154  1.30235   0.922221  1.75687   …  1.2703    0.493146  0.447018
 1.22646   1.1364    0.387659  0.527573     0.77348   0.510331  1.03629
 1.63194   1.04339   1.32871   0.530629     1.15481   1.22      0.865854
 1.07494   0.596293  0.566374  0.423238     0.791183  1.80143   0.859054
 0.640312  1.16007   1.23231   1.56074      1.21121   1.71157   0.88117
 ⋮                                       ⋱                      
 0.616944  0.325943  0.985325  0.959406     1.09151   1.54474   1.78911
 1.1108    1.27737   1.51201   0.923711     1.06877   0.79366   0.88223
 0.732747  1.

In [98]:
sqrt.(A_gpu.^2 + B_gpu.^2) # runs on the GPU!

1000×1000 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}:
 0.882729  0.905507  1.04831   0.919461   …  0.624264  0.53254   1.13817
 0.975172  1.10301   0.70086   0.819059      0.904389  1.1475    0.772294
 0.721972  0.445355  1.066     1.10104       0.776077  0.953321  0.762243
 0.467867  0.678417  0.880172  0.869848      1.0625    1.09393   0.620383
 0.753092  1.04051   1.03993   1.10247       1.06317   0.64698   0.804795
 0.379931  0.988231  0.696314  1.25082    …  0.923021  0.482442  0.322914
 1.02472   0.925946  0.287411  0.382556      0.642757  0.424038  0.981718
 1.16397   0.811652  1.0435    0.455143      0.889739  0.991057  0.703174
 0.764239  0.519799  0.566025  0.381462      0.56915   1.27814   0.622201
 0.499043  0.894359  0.990353  1.11118       0.878017  1.2255    0.816925
 ⋮                                        ⋱                      
 0.500579  0.289495  0.945863  0.695202      0.993257  1.09441   1.26558
 0.78599   0.942759  1.08715   0.653199      0.755846  0.565297  0.7

In [39]:
mapreduce(sin, +, A_gpu) # runs on the GPU!

459653.28f0

#### "Counter-example:" Scalar indexing

In [107]:
A_gpu[1]

ErrorException: Scalar indexing is disallowed.
Invocation of getindex resulted in scalar indexing of a GPU array.
This is typically caused by calling an iterating implementation of a method.
Such implementations *do not* execute on the GPU, but very slowly on the CPU,
and therefore are only permitted from the REPL for prototyping purposes.
If you did intend to index this array, annotate the caller with @allowscalar.

In [108]:
CUDA.@allowscalar A_gpu[1]

0.5155314f0

In [16]:
function gpu_not_actually!(C, A, B)
    CUDA.@allowscalar for i in eachindex(A,B)
        C[i] = A[i] * B[i]
    end
end

function gpu_broadcasting!(C, A, B)
    C .= A .* B
end

gpu_broadcasting! (generic function with 1 method)

In [14]:
using BenchmarkTools

N = 10
@btime gpu_not_actually!(C, A, B) setup=(A = CUDA.rand(10,10); B = CUDA.rand(10,10); C = CUDA.rand(10,10););
@btime gpu_broadcasting!(C, A, B) setup=(A = CUDA.rand(10,10); B = CUDA.rand(10,10); C = CUDA.rand(10,10););

  4.834 ms (901 allocations: 140.64 KiB)
  4.181 μs (11 allocations: 576 bytes)


##### FLoops: CUDA executor

In [13]:
using FLoops, FoldsCUDA

function gpu_floops!(C, A, B)
    @floop CUDAEx() for i in eachindex(A,B,C)
        C[i] = A[i] * B[i]
    end
end

gpu_floops! (generic function with 1 method)

In [14]:
@btime gpu_floops!(C, A, B) setup=(A = CUDA.rand(10,10); B = CUDA.rand(10,10); C = CUDA.rand(10,10););

  40.577 μs (237 allocations: 19.67 KiB)


## Kernel programming: Writing CUDA kernels

A CUDA kernel is a function that will be executed by all GPU *threads* separately. Based on the index of a thread we can make them operate on different pieces of given data (Single Program Multiple Data (SPMD) programming model similar to MPI).

In [36]:
x = CUDA.zeros(1024)

function cuda_kernel!(x)
    i = threadIdx().x
    x[i] += 1
    return nothing # CUDA kernels should never return anything
end

cuda_kernel! (generic function with 2 methods)

In [37]:
@cuda threads=length(x) cuda_kernel!(x)

CUDA.HostKernel{typeof(cuda_kernel!), Tuple{CuDeviceVector{Float32, 1}}}(cuda_kernel!, CuFunction(Ptr{Nothing} @0x000000000b751240, CuModule(Ptr{Nothing} @0x000000000b7e6ad0, CuContext(0x0000000006f118a0, instance 8462ee47003f6eed))), CUDA.KernelState(Ptr{Nothing} @0x00007f2642400000))

In [38]:
CUDA.@sync @cuda threads=length(x) cuda_kernel!(x)

CUDA.HostKernel{typeof(cuda_kernel!), Tuple{CuDeviceVector{Float32, 1}}}(cuda_kernel!, CuFunction(Ptr{Nothing} @0x000000000b751240, CuModule(Ptr{Nothing} @0x000000000b7e6ad0, CuContext(0x0000000006f118a0, instance 8462ee47003f6eed))), CUDA.KernelState(Ptr{Nothing} @0x00007f2642400000))

In [39]:
x

1024-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 ⋮
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0
 2.0

Kernel programming can quickly become (much) more difficult though because
* you need to respect **hardware limitations** of the GPU
* **not all operations can readily be expressed as scalar kernels** (e.g. reductions)
* since kernels execute on the GPU, the Julia runtime isn't available and kernel code has limitations (**you can't just write arbitrary Julia code in kernels**)
  * no GC / no allocations
  * must be fully type inferred
  * no `try ... catch ... end`
  * no strings
  * ...

Simple example for a hardware limitation: **A100 supports a maximal number of 1024 threads.**

In [75]:
x = CUDA.zeros(1025)
CUDA.@sync @cuda threads=length(x) cuda_kernel!(x)

CuError: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)


What if we want to go larger?

<img src="../imgs/cuda_blocks_threads.png" width=700px>

(Note: in Julia indices start at 1)

**Source:** [Sivalingam, Karthee. "GPU Acceleration of a Theoretical Particle Physics Application." Master's Thesis, The University of Edinburgh (2010).](https://static.epcc.ed.ac.uk/dissertations/hpc-msc/2009-2010/Karthee%20Sivalingam.pdf)

In [60]:
function cuda_kernel_blocks!(x)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i <= length(x)
        @inbounds x[i] += 1
    end
    return nothing # CUDA kernels should never return anything
end

cuda_kernel_blocks! (generic function with 1 method)

In [53]:
x = CUDA.zeros(1025)
CUDA.@sync @cuda threads=1024 blocks=2 cuda_kernel_blocks!(x)

CUDA.HostKernel{typeof(cuda_kernel_blocks!), Tuple{CuDeviceVector{Float32, 1}}}(cuda_kernel_blocks!, CuFunction(Ptr{Nothing} @0x0000000009150f20, CuModule(Ptr{Nothing} @0x000000000b793d90, CuContext(0x0000000006f118a0, instance 8462ee47003f6eed))), CUDA.KernelState(Ptr{Nothing} @0x00007f2642400000))

In [54]:
x

1025-element CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 ⋮
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

How does our custom CUDA kernel compare to broadcasting?

In [63]:
function launch_kernel(x)
    CUDA.@sync @cuda threads=1024 blocks=1 cuda_kernel_blocks!(x)
end

@btime launch_kernel(x) setup=(x = CUDA.zeros(1024););
@btime CUDA.@sync(x .+ 1) setup=(x = CUDA.zeros(1024););

  12.724 μs (5 allocations: 304 bytes)
  15.779 μs (30 allocations: 1.58 KiB)


## Case study: Three ways to SAXPY on the GPU

**SAXPY** = **S**ingle precision **A** times **X** **P**lus **Y**

In [65]:
"Computes the SAXPY on the CPU using broadcasting"
function saxpy_broadcast_cpu!(a, x, y)
    y .= a .* x .+ y
end

saxpy_broadcast_cpu!

In [66]:
"Computes the SAXPY on the GPU using broadcasting"
function saxpy_broadcast_gpu!(a, x, y)
    CUDA.@sync y .= a .* x .+ y
end

saxpy_broadcast_gpu!

In [68]:
"CUDA kernel for computing the SAXPY on the GPU"
function _saxpy_kernel!(a, x, y)
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i <= length(y)
        @inbounds y[i] = a * x[i] + y[i]
    end
    return nothing
end

"Computes the SAXPY on the GPU using the custom CUDA kernel `_saxpy_kernel!`"
function saxpy_cuda_kernel!(a, x, y; nthreads, nblocks)
    CUDA.@sync @cuda(
        threads = nthreads,
        blocks = nblocks,
        _saxpy_kernel!(a, x, y)
    )
end

saxpy_cuda_kernel!

In [70]:
function saxpy_cublas!(a, x, y)
    CUDA.@sync CUBLAS.axpy!(length(x), a, x, y)
end

saxpy_cublas! (generic function with 1 method)

In [73]:
using PrettyTables

"Computes the GFLOP/s from the vector length `len` and the measured runtime `t`."
saxpy_flops(t; len) = 2.0 * len * 1e-9 / t # GFLOP/s

"Computes the GB/s from the vector length `len`, the vector element type `dtype`, and the measured runtime `t`."
saxpy_bandwidth(t; dtype, len) = 3.0 * sizeof(dtype) * len * 1e-9 / t # GB/s

function main()
    if !contains(lowercase(name(device())), "a100")
        @warn("This script was tuned for a NVIDIA A100 GPU. Your GPU: $(name(device())).")
    end
    dtype = Float32
    nthreads = 1024 # CUDA.DEVICE_ATTRIBUTE_MAX_THREADS_PER_BLOCK
    nblocks = 500_000
    len = nthreads * nblocks # vector length
    a = convert(dtype, 3.1415)
    x = ones(dtype, len)
    y = ones(dtype, len)
    xgpu = CUDA.ones(dtype, len)
    ygpu = CUDA.ones(dtype, len)

    t_broadcast_cpu = @belapsed saxpy_broadcast_cpu!($a, $x, $y) samples = 10 evals = 2
    t_broadcast_gpu = @belapsed saxpy_broadcast_gpu!($a, $xgpu, $ygpu) samples = 10 evals = 2
    t_cuda_kernel = @belapsed saxpy_cuda_kernel!($a, $xgpu, $ygpu; nthreads=$nthreads, nblocks=$nblocks) samples = 10 evals = 2
    t_cublas = @belapsed saxpy_cublas!($a, $xgpu, $ygpu) samples = 10 evals = 2
    times = [t_broadcast_cpu, t_broadcast_gpu, t_cuda_kernel, t_cublas]

    flops = saxpy_flops.(times; len)
    bandwidths = saxpy_bandwidth.(times; dtype, len)

    labels = ["Broadcast (CPU)", "Broadcast (GPU)", "CUDA kernel", "CUBLAS"]
    data = hcat(labels, 1e3 .* times, flops, bandwidths)
    pretty_table(data; header=(["Variant", "Runtime", "FLOPS", "Bandwidth"], ["", "ms", "GFLOP/s", "GB/s"]))
    println("Theoretical Memory Bandwidth: 1555 GB/s")
    return nothing
end

main (generic function with 1 method)

In [74]:
main()

┌─────────────────┬─────────┬─────────┬───────────┐
│[1m         Variant [0m│[1m Runtime [0m│[1m   FLOPS [0m│[1m Bandwidth [0m│
│[90m                 [0m│[90m      ms [0m│[90m GFLOP/s [0m│[90m      GB/s [0m│
├─────────────────┼─────────┼─────────┼───────────┤
│ Broadcast (CPU) │ 202.666 │ 5.05265 │   30.3159 │
│ Broadcast (GPU) │ 5.09245 │ 201.082 │   1206.49 │
│     CUDA kernel │ 4.68801 │  218.43 │   1310.58 │
│          CUBLAS │ 4.67729 │  218.93 │   1313.58 │
└─────────────────┴─────────┴─────────┴───────────┘
Theoretical Memory Bandwidth: 1555 GB/s


<img src="../imgs/a100_saxpy_results.png" width=1000px>

## Remarks

* [KernelAbstractions.jl](https://github.com/JuliaGPU/KernelAbstractions.jl): Writing hardware agnostic computational kernels.
* [Tullio.jl](https://github.com/mcabbott/Tullio.jl): Also supports NVIDIA GPUs and can produce more efficient kernels than simple broadcasting.

More on GPU computing in Julia? See e.g. https://www.youtube.com/watch?v=Hz9IMJuW5hU