# Exercise: SAXPY on NVIDIA A100 GPU

In this exercise, you will implement two GPU-variants of the **SAXPY** kernel (`y[i] = a * x[i] + y[i]`):

1) A version using array abstractions, i.e. `CuArrays` and simple broadcasting.
2) A hand-written SAXPY CUDA kernel.

Afterwards, you'll benchmark the performance of the variants and compare it to the CUBLAS implementation by NVIDIA (that ships with CUDA). Since SAXPY is memory bound, we'll consider the achieved memory bandwidth (GB/s) as the performance metric.

The exercise tasks are marked in the code cells below.

In [1]:
using CUDA
using BenchmarkTools
using PrettyTables

In [4]:
"Computes the SAXPY via broadcasting"
function saxpy_broadcast_gpu!(a, x, y)
    # --------
    #
    # Task 1: Use broadcasting to implement a SAXPY kernel. Since we will
    #         run the kernel on the GPU, don't forget to synchronize!
    #
    # --------
    CUDA.@sync y .= a .* x .+ y
end

saxpy_broadcast_gpu!

In [5]:
"CUDA kernel for computing SAXPY on the GPU"
function _saxpy_kernel!(a, x, y)
    # --------
    #
    # Task 2: Define the "scalar" SAXPY kernel here. Make sure to check that
    #         the global index `i` is within the bounds of `y` (and `x`).
    #
    # --------
    i = (blockIdx().x - 1) * blockDim().x + threadIdx().x
    if i <= length(y)
        @inbounds y[i] = a * x[i] + y[i]
    end
    return nothing
end

_saxpy_kernel!

In [6]:
"Computes SAXPY on the GPU using the custom CUDA kernel `_saxpy_kernel!`"
function saxpy_cuda_kernel!(a, x, y; nthreads, nblocks)
    # --------
    #
    # Task 3: Use the `@cuda` macro to run the kernel defined above (`_saxpy_kernel!`).
    #         Spawn the kernel with `nthreads` many threads and `nblocks` many blocks.
    #         Don't forget to synchronize :)
    #
    # --------
    CUDA.@sync @cuda(threads=nthreads,
                     blocks=nblocks,
                     _saxpy_kernel!(a, x, y))
end

saxpy_cuda_kernel!

In [7]:
"Computes SAXPY using the CUBLAS function `CUBLAS.axpy!` provided by NVIDIA"
function saxpy_cublas!(a, x, y)
    CUDA.@sync CUBLAS.axpy!(length(x), a, x, y)
end

saxpy_cublas!

In [8]:
"Computes the GFLOP/s from the vector length `len` and the measured runtime `t`."
saxpy_flops(t; len) = 2.0 * len * 1e-9 / t # GFLOP/s

"Computes the GB/s from the vector length `len`, the vector element type `dtype`, and the measured runtime `t`."
saxpy_bandwidth(t; dtype, len) = 3.0 * sizeof(dtype) * len * 1e-9 / t # GB/s

function saxpy_gpu_bench()
    if !contains(lowercase(name(device())), "a100")
        @warn("This script was tuned for a NVIDIA A100 GPU. Your GPU: $(name(device())).")
    end
    dtype = Float32
    nthreads = 1024
    nblocks = 500_000
    len = nthreads * nblocks # vector length
    a = convert(dtype, 3.1415)
    xgpu = CUDA.ones(dtype, len)
    ygpu = CUDA.ones(dtype, len)

    t_broadcast_gpu = @belapsed saxpy_broadcast_gpu!($a, $xgpu, $ygpu) samples=10 evals=2
    t_cuda_kernel = @belapsed saxpy_cuda_kernel!($a, $xgpu, $ygpu; nthreads = $nthreads,
                                                 nblocks = $nblocks) samples=10 evals=2
    t_cublas = @belapsed saxpy_cublas!($a, $xgpu, $ygpu) samples=10 evals=2
    times = [t_broadcast_gpu, t_cuda_kernel, t_cublas]

    flops = saxpy_flops.(times; len)
    bandwidths = saxpy_bandwidth.(times; dtype, len)

    labels = ["Broadcast", "CUDA kernel", "CUBLAS"]
    data = hcat(labels, 1e3 .* times, flops, bandwidths)
    pretty_table(data;
                 header = (["Variant", "Runtime", "FLOPS", "Bandwidth"],
                           ["", "ms", "GFLOP/s", "GB/s"]))
    println("Theoretical Memory Bandwidth of NVIDIA A100: 1555 GB/s")
    return nothing
end

saxpy_gpu_bench (generic function with 1 method)

In [9]:
# --------
#
# Task 4: Run the benchmark and interpret the results.
#         How does the performance of the different variants compare?
#
# --------
saxpy_gpu_bench()

┌─────────────┬─────────┬─────────┬───────────┐
│[1m     Variant [0m│[1m Runtime [0m│[1m   FLOPS [0m│[1m Bandwidth [0m│
│[90m             [0m│[90m      ms [0m│[90m GFLOP/s [0m│[90m      GB/s [0m│
├─────────────┼─────────┼─────────┼───────────┤
│   Broadcast │ 4.92587 │ 207.882 │   1247.29 │
│ CUDA kernel │ 4.53087 │ 226.005 │   1356.03 │
│      CUBLAS │  4.5067 │ 227.217 │    1363.3 │
└─────────────┴─────────┴─────────┴───────────┘
Theoretical Memory Bandwidth of NVIDIA A100: 1555 GB/s
