# Multithreading

## What are threads?
Threads are **execution units within a process** that can run simultaneously. While processes are separate, threads run in a **shared memory** space (heap).

<!-- <img src="./imgs/what-are-threads.png" width=500px> -->

<br>
<img src="imgs/stack_heap_threads.svg" width=450px>
<br>

## Starting Julia with multiple threads

By default, Julia starts with a single *user thread*. We must tell it explicitly to start multiple user threads. There are two ways to do this:

* Environment variable: `JULIA_NUM_THREADS=4`
* Command line argument: `julia -t 4`

**PC2 Jupyter Hub**

- Select a kernel with e.g. **8 threads**:

<img src="imgs/kernels.png" width=300px>

**It is currently not (easily) possible to change the number of threads at runtime!**

We can readily check how many threads we are running:

In [None]:
Threads.nthreads()

### User threads vs default threads

Technically, the Julia process is also spawning multiple threads already in "single-threaded" mode, like
* a thread for unix signal listening
* multiple OpenBLAS threads for BLAS/LAPACK operations
* GC threads

We call the threads that we can actually run computations on *user threads* or *Julia threads*.

In [None]:
using LinearAlgebra
BLAS.get_num_threads()

## Where are my threads running?

### Noctua 2 compute node

In [None]:
using ThreadPinning

In [None]:
threadinfo()

`lstopo_no_graphics`

<img src="imgs/lstopo_noctua2.svg" width=80%>

##### "Hyperthreading" (not active on Noctua 2)

<img src="imgs/threadinfo.png" width=1000px>

## Task-based multithreading

In traditional HPC, one typically cares about threads directly. Using e.g. OpenMP, one essentially tells each thread what to do.

Conceptually, Julia takes a different approach and implements **task-based** multithreading. In this paradigm, a task - e.g. a computational piece of a code - is marked for **parallel** execution on **any** of the available Julia threads. Julia's **dynamic scheduler** will automatically put the task on one of the threads and trigger the execution of the task on said thread.

<br>
<!-- <img src="imgs/task-based-parallelism.png" width=768px> -->
<img src="imgs/tasks_threads_cores.svg" width=650px>
</br>

Generally speaking, the user should **think about tasks and not threads**.
* The scheduler is controlling on which thread a task will eventually run.
* It might even dynamically [migrate tasks](https://docs.julialang.org/en/v1/manual/multi-threading/#man-task-migration) between threads.

**Advantages:**
* high-level abstraction
* nestability / composability (especially important for libraries)

**Disadvantages:**
* scheduling overhead
* uncertain and potentially suboptimal task → thread assignment
  * **can get in the way when performance engineering** because
    * scheduler has limited information (e.g. about the system topology)
    * profiling tools often don't know anything about tasks but monitor threads (or even CPU-cores) instead (e.g. LIKWID).

### Tasks

By default, Julia waits for commands to finish ("**blocking**") and runs everything sequentially.

**Tasks** are a feature that allows (parts of) computations to be scheduled (suspended and resumed) in a flexible manner to implement **concurrency** and **parallelism**.

* Concurrency
    * Dealing with lots of things *in a time period* ("multi-tasking").
    * Can be used on a single thread.
* Parallelism
    * Doing lots of things *at the same instant*.
    * Needs multiple threads (or processes).

Example (concurrency): **asynchronous I/O** like
  * **multiple user input** (Why not already process some of the input?)
  * **data dumping to disk** (Maybe it's possible to continue a calculation?)
 
Example (parallelism): **multithreading, distributed computing**

### Spawning parallel tasks: `Threads.@spawn`
`Threads.@spawn` spawns a task to be run on any Julia thread. Specifically, it creates a `Task` and schedules it for execution on an available Julia thread (we don't control which one!).

Note that `Threads.@spawn` is **asynchronous** and **non-blocking**, that is, it doesn't wait for the task to actually run but immediately returns a `Task`.

In [None]:
using Base.Threads # afterwards we can just write @spawn instead of Threads.@spawn

In [None]:
@spawn 3+3

We can fetch the result of a task with `fetch`.

In [None]:
t = @spawn 3+3
fetch(t)

While `@spawn` returns right away, `fetch` is **blocking** as it has to wait for the task to actually finish.

In [None]:
@time t = @spawn begin
    sleep(3)
    return 3+3
end
@time fetch(t)

We can use the macro `@sync` to synchronize all encompassed asynchronous operations (`@spawn`).

In [None]:
@time @sync t = @spawn begin
    sleep(3)
    return 3+3
end
@time fetch(t)

#### Example: multithreaded `map`

`tmap`: *threaded map*

In [None]:
function tmap(fn, itr)
    tasks = map(i -> @spawn(fn(i)), itr)  # for each i ∈ itr, spawn a task to compute fn(i)
    return fetch.(tasks)                  # fetch and return all the results
end

In [None]:
M = [rand(200,200) for i in 1:8];

In [None]:
tmap(svdvals, M)

In [None]:
using BenchmarkTools

In [None]:
@btime tmap($svdvals, $M) samples=10 evals=3;
@btime map($svdvals, $M) samples=10 evals=3;

**performance issue**:

* Using Julia multithreading + BLAS multithreading
    - CPU cores may be *overscribed*, e.g. 256 total threads on 128 CPU cores! (red bars in `htop`)

If you use BLAS, it is important to carefully consider and configure the [interplay between Julia threads and BLAS threads](https://carstenbauer.github.io/ThreadPinning.jl/stable/explanations/blas/).

In [None]:
BLAS.set_num_threads(1)

In [None]:
@btime tmap($svdvals, $M) samples=10 evals=3;
@btime map($svdvals, $M) samples=10 evals=3;

#### Example: multithreading for-loops

In [None]:
using ThreadPinning: taskid

In [None]:
@sync for i in 1:2*nthreads()
    @spawn println("Task ", taskid(), " is running iteration ", i, " on thread ", threadid())
end

##### `@threads`

* **Splits up the iteration space into `nthreads()` contiguous chunks**
* Creates a task for each of them.

In [None]:
# creates nthreads() many tasks

@threads for i in 1:2*nthreads()
    println("Task ", taskid(), " is running iteration ", i, " on thread ", threadid())
end

### Load-balancing

If there are many tasks (e.g. many more than available threads), Julia's scheduler balances the load of these tasks among threads. (**non-uniform workloads**)

In [None]:
using StatsPlots

In [None]:
function compute_nonuniform_spawn!(a, workloads = [Int[] for _ in 1:nthreads()])
    @sync for i in 1:length(a)
        Threads.@spawn begin
            a[i] = sum(abs2, rand() for j in 1:(2^14*i))   # workload proportional to i
            push!(workloads[threadid()], i)                # (poor-man's) bookkeeping
        end
    end
    return workloads
end

In [None]:
a = zeros(nthreads()*10)
workloads = compute_nonuniform_spawn!(a)

# plotting
thread_workloads = zeros(Int, nthreads(), maximum(length, workloads))
for th in eachindex(workloads)
    for (i, w) in enumerate(workloads[th])
        thread_workloads[th, i] = w
    end
end
b = groupedbar(thread_workloads, xlab="threadid", ylab="workload", title="@spawn", legend=false, bar_position=:stack)
# b = bar(sum.(workloads), xlab="threadid", ylab="workload", title="Workload (@spawn)", legend=false, color=:green)
display(b)

#### No load-balancing with `@threads`

`@threads` doesn't give load-balancing because when it **divides the iteration interval into `nthreads()` tasks** there is no flexibility left to give a thread more than a single task.

In [None]:
function compute_nonuniform_threads!(a, workloads = [Dict() for _ in 1:nthreads()])
    @threads for i in 1:length(a)
        a[i] = sum(abs2, rand() for j in 1:(2^14*i)) # workload proportional to i

        # poor-man's bookkeeping
        d = workloads[threadid()]
        d[taskid()] = get!(d, taskid(), 0) + i
    end
    return workloads
end

In [None]:
a = zeros(nthreads()*10)
workloads = compute_nonuniform_threads!(a)

# plotting
@assert length(workloads) == nthreads()
b = bar([only(values(w)) for w in workloads], xlab="threadid", ylab="workload", title="@threads", legend=false)
display(b)

**Note:**
* There will likely be a scheduling option for `@threads` that implements load-balancing in the future (see e.g. https://github.com/JuliaLang/julia/pull/52096).

### Nestability / Composability

#### Example: Recursive Fibonacci series

$$ F(n) = F(n-1) + F(n-2), \qquad F(1) = F(2) = 1$$

We can nest `@spawn` calls freely!

In [None]:
function fib(n)
    n < 2 && return n
    t = @spawn fib(n-2)
    return fib(n-1) + fetch(t)
end

In [None]:
fib(20)

(Note: Algorithmically, this is a highly inefficient implementation of the Fibonacci series, of course!)

## Multithreading: Things to be aware of

### Instructive example: parallel summation

In [None]:
data = rand(1_000_000 * Threads.nthreads());

#### Naive approach

In [None]:
function sum_threads_naive(data)
    s = zero(eltype(data))
    @threads for x in eachindex(data)
        s += x
    end
    return s
end

In [None]:
@show sum(data);
@show sum_threads_naive(data);
@show sum_threads_naive(data);

**Wrong** result! Even worse, it's **non-deterministic** and different every time!

There is a [race condition](https://en.wikipedia.org/wiki/Race_condition) which typically appear when multiple tasks are modifying a shared value simultaneously.

→ **Don't modify shared "global" state!**

Sometimes things can be more subtle. Examples: random number generation, `Dict`. Note that not all of Julia and its packages in the ecosystem are thread-safe! In general, it is safer to assume that they're not unless documented/proven otherwise. (`rand()` is thread-safe, `Dict` isn't!)

#### Thread-focused partial sums (unsafe)

Our strategy:
* One accumulator variable per thread.

You might be inclined to write something similar to the following (intentionally written in a slightly more verbose form):

In [None]:
function sum_threads_unsafe(data)
    psums = zeros(eltype(data), nthreads())
    @threads for i in eachindex(data)
        current_sum = psums[threadid()] # read
        new_sum = current_sum + data[i] # "work"
        psums[threadid()] = new_sum     # write
    end
    return sum(psums)
end

Such an approach is generally **unsafe** because Julia's scheduler may **migrate tasks between threads**!
  * For example, a task might start on thread 1, is then paused (say, after "work") and migrated to thread 3, where it finishes execution.
  * → The output of `threadid()` might change within a task! To be safe, [don't use `threadid()`](https://julialang.org/blog/2023/07/PSA-dont-use-threadid/) at all!
  
It also goes against the idea of task-based multithreading, as we're **thinking about threads rather than tasks**.

(Note that, in spite of the comments above, the `threadid()` pattern will often still work correctly. This is because as of Julia 1.10 task migrations are very rare. Importantly, **you can't rely on it though!**)

#### Chunk-focused partial sums (safe)

Our strategy:
* Divide the data (indices) into **chunks** and use **one accumulator per chunk**.

The package [ChunkSplitters.jl](https://github.com/m3g/ChunkSplitters.jl) is helpful for chunking (`Iterators.partition` is a built-in alternative).

In [None]:
using ChunkSplitters

In [None]:
collect(chunks(data; n=nthreads())) # number of chunks chosen as nthreads()

In [None]:
function sum_threads_chunks(data; nchunks=nthreads())
    psums = zeros(eltype(data), nchunks)
    @threads for (c, idcs) in enumerate(chunks(data; n=nchunks))
        for i in idcs
            psums[c] += data[i]
        end
    end
    return sum(psums)
end

In [None]:
sum_threads_chunks(data) ≈ sum(data)

In [None]:
@btime sum($data);
@btime sum_threads_chunks($data);

Safe, but (horribly) slow?! Why?

* manual loop doesn't SIMD (need to add manual `@simd` + `@inbounds`)
* But more importantly: **false sharing**

##### Performance issue: [False sharing](https://en.wikipedia.org/wiki/False_sharing)

Why does `sum_threads_chunks` above have bad performance? Although argubaly subtle, this is because different tasks mutate shared data (`psums`) in parallel. There is no *logical* sharing: Tasks access different slots of `psums` and there is no data race. However, CPU cores work on the basis of **cache lines** instead of single elements leading to *implicit* sharing of cache lines.

**Despite its subtlety, false sharing can lead to dramatic slowdown!**

In [None]:
using CpuId

In [None]:
cachelinesize() ÷ sizeof(Float64)

<img src="imgs/false_sharing.svg" width=850px>

Different tasks modify the same cache line
* need for synchronization to ensure cache coherency
* performance decreases (dramatically).

**The less you modify non-local state, the better!**

#### Chunk-focused task-local partial sums (good)

In [None]:
function sum_threads_chunks_local(data; nchunks=nthreads())
    psums = zeros(eltype(data), nchunks)
    @threads for (c, idcs) in enumerate(chunks(data; n=nchunks))
        local s = zero(eltype(data))
        @simd for i in idcs
            @inbounds s += data[i]
        end
        psums[c] = s
    end
    return sum(psums)
end

* each task/iteration computes a local sum (`s`) independently
* no *frequent* non-local mutation

In [None]:
sum(data) ≈ sum_threads_chunks_local(data)

In [None]:
@btime sum($data);
@btime sum_threads_chunks_local($data);

#### Task-focused version (even better)

**Key questions for task-based parallelisation:**
* How to divide the computation into seperate **tasks**?
* How many **tasks** should we create?

In [None]:
# Conceptually, this is just `tmap(mysum, chunks)`

function sum_map_spawn(data; nchunks=nthreads())
    ts = map(chunks(data, n=nchunks)) do idcs
        @spawn @views sum(data[idcs])
    end
    return sum(fetch.(ts))
end

<details>
    <summary>Loop analogue (click to unfold)</summary>
    
<br>
    
```julia
function sum_loop_spawn(data; nchunks=nthreads())
    ts = Vector{Task}(undef, nchunks)
    for (c, idcs) in enumerate(chunks(data; n=nchunks))
        ts[c] = @spawn @views sum(data[idcs])
    end
    return sum(fetch.(ts))
end
```
</details>

* This version is task-focused → We're **explicitly** spawning one task per chunk.
* In this form, we don't need a manual pre-allocation (it is hidden in the map operation)
  * → no explicit indexing necessary (and thus no `enumerate` around `chunks`).
  * We have automatically circumvented the false sharing performance issue!

In [None]:
sum_map_spawn(data) ≈ sum(data)

In [None]:
@btime sum_map_spawn($data);

Still interesting performance improvements possible. However, this is beyond the scope of the course. 😄

## Opt out of dynamic scheduling

For "traditional HPC", where you tell each thread what to do, you might in some cases want/need a **guaranteed task-thread mapping**. This is possible to achieve with the following tools.

### `@spawnat`

We can opt-out of task migration and **spawn *sticky* tasks on specific threads**. 

Base Julia doesn't have a built-in macro for this but many packages, including [ThreadPinning.jl](https://github.com/carstenbauer/ThreadPinning.jl), provide a variant.

In [None]:
using ThreadPinning: @spawnat

In [None]:
@spawnat 4 println("Task ", taskid(), " is running on thread ", threadid());

### `@threads :static`

For `@threads` there is the `:static` scheduling option to opt-out of Julia's dynamic scheduling.

Syntax: `@threads :static for ...`

 * **statically** maps tasks/chunks to threads, specifically: task 1 → thread 1, task 2 → thread 2, and so on.
   * no task migration, i.e. **fixed task-thread mapping** 👍
   * only little overhead 👍
   * not composable / nestable 👎

In [None]:
@threads :dynamic for i in 1:2*nthreads() # :dynamic is the default
    println("Task ", taskid(), " is running iteration ", i, " on thread ", threadid());
end

In [None]:
@threads :static for i in 1:2*nthreads()
    println("Task ", taskid(), " is running iteration ", i, " on thread ", threadid());
end

For `@threads :static`, every thread handles precisely two iterations!

## Additional comments

### Tools for multi-threading

* [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl): Simple tools for basic multithreading.
* [ThreadsX.jl](https://github.com/JuliaFolds2/ThreadsX.jl): Parallelized Base functions
* [Tullio.jl](https://github.com/mcabbott/Tullio.jl): Tullio is a very flexible einsum macro ([Einstein notation](https://en.wikipedia.org/wiki/Einstein_notation))
* [(LoopVectorization.jl)](https://github.com/JuliaSIMD/LoopVectorization.jl): Macro(s) for vectorizing loops.
* [(FLoops.jl)](https://github.com/JuliaFolds/FLoops.jl): Fast sequential, threaded, and distributed for-loops for Julia

#### [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl)

In [None]:
using OhMyThreads: treduce, tmap

In [None]:
treduce(+, data)

In [None]:
@btime treduce($+, $data);

In [None]:
tmap(sin, data)

### Pinning Julia threads to CPU threads/cores

A compute node has a complex topology (two sockets, multiple memory channels/domains). Placing the Julia threads systematically on CPU-threads matters for

* the computation performance of your Julia codes
* fluctuations/noises in benchmarks
* hardware-level performance monitoring

What about external tools like `numactl`, `taskset`, etc.? Doesn't work reliably because they often [can't distinguish](https://discourse.julialang.org/t/thread-affinitization-pinning-julia-threads-to-cores/58069/5) between Julia threads and other internal threads.

**Options:**

* `JULIA_EXCLUSIVE=1`
* [ThreadPinning.jl](https://github.com/carstenbauer/ThreadPinning.jl)

#### ThreadPinning.jl

<!-- <br>
<img src="imgs/threadpinning_pinthreads.svg" width=600px>
</br> -->

`pinthreads(strategy)`
* `:cputhreads` pin to CPU threads (incl. "hypterthreads") one after another
* `:cores:` pin to CPU cores one after another
* `:numa:` alternate between NUMA domains so, e.g., 0, 16, 32, 48, 64, .... (if a NUMA domain has 16 cores)
* `:sockets:` alternate between sockets so, e.g., 0, 64, 1, 65, 2, 66, .... (if a socket has 64 cores)
* `:affinitymask`: pin according to an external affinity mask (e.g. set by SLURM)

(More? See my short talk at JuliaCon2023 @ MIT: https://youtu.be/6Whc9XtlCC0)

In [None]:
pinthreads(:affinitymask)
threadinfo(; slurm=true)

### Garbage collection

If it gets triggered, it stops the world (all threads) for clearing up memory.

Hence, when using multithreading, it is even more important to **avoid heap allocations!**

(If you can't avoid allocations, consider using multiprocessing instead.)

### Atomic operations and locks

See [Atomic Operations](https://docs.julialang.org/en/v1/manual/multi-threading/#Atomic-Operations) and/or [Data-race freedom](https://docs.julialang.org/en/v1/manual/multi-threading/#Data-race-freedom) in the Julia doc for more information. In general, one should avoid using them as much as possible since they actually limit the parallelization by serialized executions (especially if you don't know what you're doing). That said, locks can be an effective way to use a data structures that themselves aren't thread safe, e.g. `Dict`.

We'll explore the effect of thread pinning on performance in more detail later → **daxpy_cpu exercise**