# JuliaHEP 2023 Workshop -  HPC Tutorial

**When:** November 7, 2023

**Where:** Erlangen Centre for Astroparticle Physics (ECAP)

**GitHub repository:** https://github.com/carstenbauer/juliahep-hpctutorial

## What's the plan for this tutorial?

* Use Julia on an HPC cluster (maybe for the first time?)
* Study the node-level performance scaling of a simple computational kernel
* Learn about thread pinning, NUMA, and how to control both from within Julia
* (If time permits: move the computation to an NVIDIA A100 GPU)

## Julia interactively on HPC clusters. How?

* Terminal approach (SSH + e.g. vim + REPL)
* **VS Code** → Remote SSH Extension
  * login node (easy)
  * compute node (tricky, sometimes impossible): [PC2 docs](https://upb-pc2.atlassian.net/wiki/spaces/PC2DOK/pages/1902225/Access+for+Applications+like+Visual+Studio+Code#Compute-nodes) and/or README.md

**For this tutorial, we'll use the PC2 JupyterHub for simplicity.**

### [PC2 JupyterHub](https://jh.pc2.uni-paderborn.de/)

**Link:** https://pc2.de/go/jupyterhub

Most participants have access to the [Noctua 2](https://pc2.uni-paderborn.de/hpc-services/available-systems/noctua2) cluster through the [PC2 JupyterHub](https://jh.pc2.uni-paderborn.de/hub/home).

In this case, **a browser is all that's needed!**

#### Getting started

* Login to [PC2 JupyterHub](https://pc2.de/go/jupyterhub) with the provided credentials.
* After login, click on the "Start Server" button.
* Select the **"JuliaHEP - HPC Tutorial (full CPU node)"** preset (should already be the default) and click on "Start". This will start a Jupyter server on a Noctua 2 compute node (might take a little while).
* Once in Jupyter, you should see a folder with your username in the left side bar. Navigate into this folder. In it you'll find a local copy of this git repository that you can use for the tutorial.
* To make Julia (and the IJulia kernel) available, click on the little blue hexagon in the side bar on the left. Then, type "jupyter" into the search bar at the top. Hover over `JupyterKernel-Julia/1.9.3-foss-2022a-CUDA-11.7.0` and click on the appearing "Load" button.
* You should be all set up! Feel free to open the first notebook `1_axpy_cpu.ipynb` and, in the top right corner, select the kernel **"Julia (8 threads) 1.9.3"**.

## Computational kernel: AXPY

"*A time X plus Y*"

$$ \vec{y} = a \cdot \vec{x} + \vec{y} $$

Depending on the data type / precision:

* **S**AXPY (S = single precision, i.e. `Float32`)
* **D**AXPY (D = double precision, i.e. `Float64`)

In [None]:
function axpy_serial!(y, a, x)
    #
    # TODO: Implement the (serial) AXPY kernel.
    #
    for i in eachindex(x,y)
        @inbounds y[i] = a * x[i] + y[i]
    end
    return nothing
end

### Why AXPY?

#### What limits performance of computations?
* memory access speed (*memory-bound*)
* how fast floating-point operations (flops) can be done (*compute-bound*)

The performance of most scientific codes is **memory-bound** these days!

**CPU (AMD EPYC 7763)**

*Peak compute performance over peak memory bandwidth*

$$
\dfrac{3.5 \ [\textrm{TFlop/s}]}{200 \ [\textrm{GB/s}]} \cdot 8 \ \textrm{B} = 140
$$

140 flops per number read, i.e. 8 bytes for `Float64`

**GPU (NVIDIA A100)**

*Peak compute performance over peak memory bandwidth* (only using CUDA cores)

$$
\dfrac{19.5 \ [\textrm{TFlop/s}]}{1.5 \ [\textrm{TB/s}]} \cdot 4 \ \textrm{B} = 52
$$

52 flops per number read, i.e. 4 bytes for `Float32`

#### Questions
* How many **bytes** are transferred per iteration in AXPY?
* How many **flops** (floating point operations) are performed per iteration in AXPY?
* Is AXPY compute- or memory-bound?

**"Trick" question:** How many **bytes** would be transferred in a non-inplace variant, i.e. `z[i] = a * x[i] + y[i]`? (Hint: It's likely not what you think 😉)

Let's benchmark the performance of our AXPY kernel.

In [None]:
using BenchmarkTools

const N = 2^30

a = 3.141
x = rand(N)
y = rand(N)

@btime axpy_serial!($y, $a, $x) samples=5 evals=3;

Is this fast? What should we compare it to?

Let's look at the **memory bandwidth** (data transfer to/from memory per unit time) and the compute performance (flops per unit time) instead.

In [None]:
using BenchmarkTools

function generate_input_data(; N, dtype, kwargs...)
    a = dtype(3.141)
    x = rand(dtype, N)
    y = rand(dtype, N)
    return a,x,y
end

function measure_perf(f::F; N=2^30, dtype=Float64, verbose=true, kwargs...) where {F}  
    # input data
    a,x,y = generate_input_data(; N, dtype, kwargs...)

    # time measurement
    t = @belapsed $f($y, $a, $x) evals = 2 samples = 10
    
    # compute memory bandwidth and flops
    bytes = 3 * sizeof(dtype) * N # TODO: num bytes transferred in AXPY kernel (all iterations)
    flops = 2 * N # TODO: num flops performed in AXPY kernel (all iterations)
    mem_rate = bytes * 1e-9 / t # TODO: memory bandwidth in GB/s
    flop_rate = flops * 1e-9 / t # TODO: flops in GFLOP/s
    
    if verbose
        println("Dtype: $dtype")
        println("\tMemory Bandwidth (GB/s): ", round(mem_rate; digits=2))
        println("\tCompute (GFLOP/s): ", round(flop_rate; digits=2))
    end
    return mem_rate, flop_rate
end

In [None]:
measure_perf(axpy_serial!);

This is about 20% of the theoretical value for one entire CPU (with 64 cores). This will serve as our single-core performance reference point.

## Node-level parallelisation (multithreading)

**SIMD:** `axpy_serial!` is already *parallel* at instruction level

In [None]:
@code_native debuginfo=:none axpy_serial!(y,a,x)

We want to parallelize our AXPY kernel via **multithreading**.

Julia provides the `@threads` macro to multithread for-loops.

**Make sure that you actually have multiple threads in this Julia session!** (I recommend 8 threads on Noctua 2.)

In [None]:
using Base.Threads: @threads, nthreads

@assert nthreads() > 1
nthreads()

In [None]:
function axpy_multithreading_dynamic!(y, a, x)
    #
    # TODO: Implement a naive multithreaded AXPY kernel (with @threads).
    #
    @threads for i in eachindex(x,y)
        @inbounds y[i] = a * x[i] + y[i]
    end
    return nothing
end

In [None]:
measure_perf(axpy_multithreading_dynamic!);

🙁 **What's going on?! Why no (or not much) speedup?!** 😢

### Pinning Julia threads

**Why** pin threads?

* stable performance (e.g. avoid fluctuations in benchmarks)
* avoid double occupation of CPU-cores / CPU-threads
* fixed memory locality
* (hardware performance monitoring → [LIKWID.jl](https://github.com/JuliaPerf/LIKWID.jl))

**How** pin Julia threads? → [ThreadPinning.jl](https://github.com/carstenbauer/ThreadPinning.jl)

What about external tools like `numactl`, `taskset`, etc.? Doesn't work reliably because they often [can't distinguish](https://discourse.julialang.org/t/thread-affinitization-pinning-julia-threads-to-cores/58069/5) between Julia threads and other internal threads.

<br>
<img src="./imgs/threadpinning_pinthreads.svg" width=700>
<br>

(More? See my short talk at JuliaCon2023 @ MIT: https://youtu.be/6Whc9XtlCC0)

In [None]:
using ThreadPinning

In [None]:
threadinfo()

In [None]:
pinthreads(:cores)

In [None]:
threadinfo()

In [None]:
pinthreads(:sockets)

In [None]:
threadinfo()

#### Benchmark with pinned threads

In [None]:
pinthreads(:cores)
measure_perf(axpy_multithreading_dynamic!);

**Still the same performance?!** 😢

### Data placement (NUMA)

NUMA = **n**on-**u**niform **m**emory **a**ccess

One (of two) AMD Milan CPUs in Noctua 2:

<img src="./imgs/amd_milan_cpu_die.svg" width=800>

**Image source:** AMD, [High Performance Computing (HPC) Tuning Guide for AMD EPYCTM 7003 Series Processors](https://www.amd.com/system/files/documents/high-performance-computing-tuning-guide-amd-epyc7003-series-processors.pdf)

<img src="./imgs/noctua2_topo.svg" width=1000>

In [None]:
threadinfo(; groupby=:numa) # switch from socket/CPU grouping to NUMA grouping

#### How to control data placement (explicitly)?
→ [NUMA.jl](https://github.com/JuliaPerf/NUMA.jl)

`Vector{Float64}(numanode(i), length)` (kind of similar to `Vector{Float64}(undef, length)`)

In [None]:
using NUMA, Random

In [None]:
data = Vector{Float64}(numanode(1), 100); rand!(data);

In [None]:
which_numa_node(data)

In [None]:
data = Vector{Float64}(numanode(8), 100); rand!(data);

In [None]:
which_numa_node(data)

Let's do a quick and dirty benchmark to get an idea how much this matters for performance.

In [None]:
node1 = current_numa_node()
node2 = mod1(current_numa_node() + nnumanodes()÷2, nnumanodes()) # numa node in other CPU/socket

println("local NUMA node")
x = Vector{Float64}(numanode(node1), N); rand!(x)
y = Vector{Float64}(numanode(node1), N); rand!(y)

@btime axpy_serial!($y, $a, $x) samples=5 evals=3;

println("distant NUMA node")
x = Vector{Float64}(numanode(node2), N); rand!(x)
y = Vector{Float64}(numanode(node2), N); rand!(y)

@btime axpy_serial!($y, $a, $x) samples=5 evals=3;

Note that the performance issue will be mouch more pronounced in multithreaded cases, where different threads might try to access the same non-local data over the same memory channel(s).

#### How to control data placement (implicitly)?

→ **"First-touch" policy**

```julia
x = Vector{Float64}(undef, 10)   # allocation, no "touch" yet
rand!(x)                         # first touch == first write
```

In [None]:
pinthreads(:numa)
threadinfo(; groupby=:numa)

In [None]:
for tid in 1:8
    @sync @tspawnat tid begin            # ThreadPinning.@tspawnat creates *sticky* tasks that don't migrate between threads
        x = Vector{Float64}(undef, 10)   # allocation, no "touch" yet
        rand!(x)                         # first touch
        @show tid, which_numa_node(x)
    end
end

##### NUMA-optimized AXPY

**Question**
* How can we modify our AXPY benchmark to optimize for local memory accesses (based on the first-touch policy)?

In [None]:
using Random

function generate_input_data(; N, dtype, parallel=false, kwargs...)
    #
    # TODO: introduce a new keyword argument that, when set to true, initializes the data in parallel
    #       (in the same way as we'll later use it)
    #
    a = dtype(3.141)
    x = Vector{dtype}(undef, N)
    y = Vector{dtype}(undef, N)
    if !parallel
        rand!(x)
        rand!(y)
    else
        @threads for i in eachindex(x,y)
            x[i] = rand()
            y[i] = rand()
        end
    end
    return a,x,y
end

In [None]:
pinthreads(:numa)
measure_perf(axpy_multithreading_dynamic!; parallel=false);
measure_perf(axpy_multithreading_dynamic!; parallel=true);

**Speedup! Yeah!** 😄 🎉

But.... less than expected!? 😕

**Question**
* What kind of speedup would we expect (ideally)?

In [None]:
threadinfo(; groupby=:numa)

### Tasks vs Threads

Conceptually, Julia implements **task-based multithreading**.

**A user shouldn't care about threads but tasks!**

<img src="./imgs/julia_tasks_vs_threads.png" width=1000>

In **"traditional" HPC**, we typically care about threads directly, i.e. we tell every thread what it should do.

In Julia's **task-based multithreading**, a task - e.g. a computational piece of a code - is only marked for **parallel execution** (`@spawn`, `@threads`) on **any** of the available Julia threads. Julias **dynamic scheduler** will then take care of running the task on any of the threads (the task might even migrate!).

*Advantages:*
* high-level abstraction
* **composability / nestability** (Multithreaded code can call multithreaded code can call multithreaded code ....)

*Disadvantages:*
* potential scheduling overhead
* **task → thread assignment uncertain (can vary dynamically + task migration)**
* can get in the way when performance engineering
  * scheduler has limited information (e.g. about the system topology)
  * low-level profiling (e.g. with LIKWID) requires fixed `task → thread → core` mapping.

#### Opt-out of dynamic scheduling

We can pt-out of Julia's dynamic scheduling and get **guarantees about the task-thread assignment** (and the iterations → task mapping).

Syntax: `@threads :static for ...`

 * splits up the iteration space into `nthreads()` even, contiguous blocks (in-order) and creates precisely one task per block
 * **statically** maps tasks to threads, specifically: task 1 -> thread 1, task 2 -> thread 2, etc.
   * no task migration, i.e. **fixed task-thread mapping** 👍
   * only little overhead 👍
   * not composable / nestable 👎
     

**In short:**

Dynamic scheduling: `@spawn`, `@threads :dynamic` (default)

Static scheduling (i.e. fixed task → thread mapping): `ThreadPinning.@tspawnat`, `@threads :static`

#### Statically scheduled AXPY

In [None]:
function axpy_multithreading_static!(y, a, x)
    #
    # TODO: Implement a statically scheduled multithreaded AXPY kernel (with @threads :static).
    #
    @threads :static for i in eachindex(x,y)
        @inbounds y[i] = a * x[i] + y[i]
    end
    return nothing
end

We also need to adapt the input data generation.

In [None]:
function generate_input_data(; N, dtype, parallel=false, static=false, kwargs...)
    #
    # TODO: introduce a new keyword argument `static` that, when set to true, initializes the data in parallel with static scheduling
    #       (in the same way as we'll later use it)
    #
    a = dtype(3.141)
    x = Vector{dtype}(undef, N)
    y = Vector{dtype}(undef, N)
    if !parallel
        rand!(x)
        rand!(y)
    else
        if !static
            @threads for i in eachindex(x,y)
                x[i] = rand()
                y[i] = rand()
            end
        else
            @threads :static for i in eachindex(x,y)
                x[i] = rand()
                y[i] = rand()
            end
        end
    end
    return a,x,y
end

In [None]:
pinthreads(:numa)
measure_perf(axpy_multithreading_static!; parallel=false, static=true);
measure_perf(axpy_multithreading_static!; parallel=true, static=true);

**Finally, we're in the ballpark of the expected speedup!** 😄 🎉