# High-performance computing with Julia

In this notebook we'll be looking at Julia's functionality for distributing work over multiple shared-memory workers (threads/cores). Julia also has two main packages for doing distributed-memory parallelism:

- [Distributed.jl](https://docs.julialang.org/en/v1/stdlib/Distributed/) - this is a standard library that comes shipped with the language. 
- [MPI.jl](https://juliaparallel.org/MPI.jl/stable/) - a Julia wrapper for MPI.

Additionally, Julia supports GPU programming via the [CUDA.jl](https://cuda.juliagpu.org/stable/) package. There is also ongoing work for supporting AMD, Intel, and Apple GPUs. You can read more about that [here](https://github.com/JuliaGPU/KernelAbstractions.jl).

# Multithreading in Julia

Julia comes packaged with a standard library called `Threads` for working with multiple shared-memory workers. However, before starting we need to make sure that Julia is started with more than one thread. To check how many Julia threads are currently running, call

In [22]:
Threads.nthreads()

16

If you've just installed Julia without changing anything, this number will likely be one. There are multiple way to ensure that Julia starts with multiple threads.
- Set the environment variable `JULIA_NUM_THREADS` to some number, for example to "$(nproc)" to use all the cores in your computer.
- Start Julia with the `--threads` or `-t` option followed by the number of threads. I.e. `julia --threads 4`.
- In VS Code, set the `"julia.NumThreads": NUMBER` option in `settings.json`

We can readily check again how many threads we are running:

In [23]:
Threads.nthreads()

16

## What are threads?
Threads are **execution units within a process** that can run simultaneously. While processes are separate, threads run in a **shared memory** space (heap).

<!-- <img src="./imgs/what-are-threads.png" width=500px> -->

<br>
<img src="../figures/stack_heap_threads.svg" width=450px>
<br>

**It is currently not (easily) possible to change the number of threads at runtime!**

### User threads vs default threads

Technically, the Julia process is also spawning multiple threads already in "single-threaded" mode, like
* a thread for unix signal listening
* multiple OpenBLAS threads for BLAS/LAPACK operations
* GC threads

We call the threads that we can actually run computations on *user threads* or *Julia threads*.

In [24]:
using LinearAlgebra
BLAS.get_num_threads()

1

## Where are my threads running?

In [25]:
using ThreadPinning

In [26]:
threadinfo()

Hostname: 	CasparWB-API
CPU(s): 	1 x 12th Gen Intel(R) Core(TM) i7-1260P
CPU target: 	alderlake
Cores: 		8 (16 CPU-threads due to 2-way SMT)
NUMA domains: 	1 (8 cores each)

[32m[1mJulia threads: 	16[22m[39m

[36m[1mCPU socket 1[22m[39m
  [39m0,[95m[1m1[22m[39m, [31m[1m2[22m[39m,[95m[1m3[22m[39m, [33m[1m4[22m[39m,[95m[1m5[22m[39m, [33m[1m6[22m[39m,[95m[1m7[22m[39m, [33m[1m8[22m[39m,[90m9[39m, [33m[1m10[22m[39m,[95m[1m11[22m[39m, [33m[1m12[22m[39m,[95m[1m13[22m[39m, [33m[1m14[22m[39m,[95m[1m15[22m[39m


[33m[1m#[22m[39m = Julia thread, [95m[1m#[22m[39m = Julia thread on HT, [31m[1m#[22m[39m = >1 Julia thread

[90m(Mapping:[39m[90m 1 => 4,[39m[90m 2 => 2,[39m[90m 3 => 14,[39m[90m 4 => 5,[39m[90m 5 => 1,[39m[90m ...[39m)


## Task-based multithreading

In traditional HPC, one typically cares about threads directly. Using e.g. OpenMP, one essentially tells each thread what to do.

Conceptually, Julia takes a different approach and implements **task-based** multithreading. In this paradigm, a task - e.g. a computational piece of a code - is marked for **parallel** execution on **any** of the available Julia threads. Julia's **dynamic scheduler** will automatically put the task on one of the threads and trigger the execution of the task on said thread.

<br>
<!-- <img src="imgs/task-based-parallelism.png" width=768px> -->
<img src="../figures/tasks_threads_cores.svg" width=650px>
</br>

Generally speaking, the user should **think about tasks and not threads**.
* The scheduler is controlling on which thread a task will eventually run.
* It might even dynamically [migrate tasks](https://docs.julialang.org/en/v1/manual/multi-threading/#man-task-migration) between threads.

**Advantages:**
* high-level abstraction
* nestability / composability (especially important for libraries)

**Disadvantages:**
* scheduling overhead
* uncertain and potentially suboptimal task → thread assignment
  * **can get in the way when performance engineering** because
    * scheduler has limited information (e.g. about the system topology)
    * profiling tools often don't know anything about tasks but monitor threads (or even CPU-cores) instead (e.g. LIKWID).

### Tasks

By default, Julia waits for commands to finish ("**blocking**") and runs everything sequentially.

**Tasks** are a feature that allows (parts of) computations to be scheduled (suspended and resumed) in a flexible manner to implement **concurrency** and **parallelism**.

* Concurrency
    * Dealing with lots of things *in a time period* ("multi-tasking").
    * Can be used on a single thread.
* Parallelism
    * Doing lots of things *at the same instant*.
    * Needs multiple threads (or processes).

Example (concurrency): **asynchronous I/O** like
  * **multiple user input** (Why not already process some of the input?)
  * **data dumping to disk** (Maybe it's possible to continue a calculation?)
 
Example (parallelism): **multithreading, distributed computing**

### Spawning parallel tasks: `Threads.@spawn`
`Threads.@spawn` spawns a task to be run on any Julia thread. Specifically, it creates a `Task` and schedules it for execution on an available Julia thread (we don't control which one!).

Note that `Threads.@spawn` is **asynchronous** and **non-blocking**, that is, it doesn't wait for the task to actually run but immediately returns a `Task`.

In [27]:
using Base.Threads # afterwards we can just write @spawn instead of Threads.@spawn

In [28]:
@spawn 3+3

Task (done) @0x00007f78d5c26d60

We can fetch the result of a task with `fetch`.

In [29]:
t = @spawn 3+3
fetch(t)

6

While `@spawn` returns right away, `fetch` is **blocking** as it has to wait for the task to actually finish.

In [30]:
@time t = @spawn begin
    sleep(3)
    return 3+3
end
@time fetch(t)

  0.000393 seconds (1.25 k allocations: 62.406 KiB)
  2.930254 seconds (104 allocations: 4.188 KiB)


6

We can use the macro `@sync` to synchronize all encompassed asynchronous operations (`@spawn`).

In [31]:
@time @sync t = @spawn begin
    sleep(3)
    return 3+3
end
@time fetch(t)

  3.008581 seconds (11.64 k allocations: 605.484 KiB, 0.22% compilation time)
  0.000009 seconds


6

#### Example: multithreaded `map`

`tmap`: *threaded map*

In [32]:
function tmap(fn, itr)
    tasks = map(i -> @spawn(fn(i)), itr)  # for each i ∈ itr, spawn a task to compute fn(i)
    return fetch.(tasks)                  # fetch and return all the results
end

tmap (generic function with 1 method)

In [33]:
M = [rand(200,200) for i in 1:8];

In [34]:
tmap(svdvals, M)

8-element Vector{Vector{Float64}}:
 [100.20838520782826, 8.074398919833639, 7.9198275218987995, 7.747850965071262, 7.637500376625816, 7.610284365656103, 7.486828646174716, 7.405870161493915, 7.359989378697254, 7.275497028052682  …  0.27572533232943347, 0.24370274844701004, 0.20832024053757534, 0.18645182730288187, 0.16180357010162813, 0.14113474037036663, 0.09942087943232351, 0.07009203025062351, 0.053065402524928376, 0.011392981709151801]
 [100.11385819555302, 8.180389220880404, 7.908630658700188, 7.726741914252684, 7.687339405827408, 7.496328331426023, 7.410523868186568, 7.37484661285418, 7.3089415863336304, 7.273974314097562  …  0.29951460156031257, 0.2801283461701338, 0.24302044310150994, 0.22874156070567966, 0.19938947628434303, 0.1670554200328792, 0.11083854291073707, 0.05382323932997336, 0.040080836350062085, 0.002093292409368463]
 [100.22917249696874, 8.0338897377666, 7.832750252510585, 7.792933293413332, 7.74572093071225, 7.607049665751441, 7.440680341270445, 7.332727562141124

In [35]:
using BenchmarkTools

In [36]:
@btime tmap($svdvals, $M) samples=10 evals=3;
@btime map($svdvals, $M) samples=10 evals=3;

  3.945 ms (155 allocations: 3.38 MiB)
  17.142 ms (106 allocations: 3.37 MiB)


**performance issue**:

* Using Julia multithreading + BLAS multithreading
    - CPU cores may be *overscribed*, e.g. 256 total threads on 128 CPU cores! (red bars in `htop`)

If you use BLAS, it is important to carefully consider and configure the [interplay between Julia threads and BLAS threads](https://carstenbauer.github.io/ThreadPinning.jl/stable/explanations/blas/).

In [37]:
BLAS.set_num_threads(1)

In [38]:
@btime tmap($svdvals, $M) samples=10 evals=3;
@btime map($svdvals, $M) samples=10 evals=3;

  4.734 ms (155 allocations: 3.38 MiB)
  16.049 ms (106 allocations: 3.37 MiB)


#### Example: multithreading for-loops

In [39]:
using ThreadPinning.Utility: taskid

In [40]:
@sync for i in 1:2*nthreads()
    @spawn println("Task ", taskid(), " is running iteration ", i, " on thread ", threadid())
end

Task 11488916986724442474 is running iteration 1 on thread 2
Task 15234522458667997043 is running iteration 26 on thread 5
Task 957369398481928481 is running iteration 18 on thread 15
Task 3284671388240129601 is running iteration 24 on thread 1
Task 6811353691080217303 is running iteration 30 on thread 11
Task 7405514637394808702 is running iteration 29 on thread 11
Task 2403952435711093788 is running iteration 15 on thread 6
Task 14155726864905622120 is running iteration 28 on thread 5
Task 2187117897820264066 is running iteration 11 on thread 5
Task 2063126083167385598 is running iteration 16 on thread 1
Task 6368343915477012251 is running iteration 6 on thread 7
Task 8779815644241969064 is running iteration 9 on thread 3
Task 6401274383233523539 is running iteration 3 on thread 15
Task 12220232873917988113 is running iteration 31 on thread 13
Task 8556070160371430310 is running iteration 2 on thread 13
Task 17417801917760207650 is running iteration 25 on thread 16
Task 1048135248865

##### `@threads`

* **Splits up the iteration space into `nthreads()` contiguous chunks**
* Creates a task for each of them.

In [41]:
# creates nthreads() many tasks

@threads for i in 1:2*nthreads()
    println("Task ", taskid(), " is running iteration ", i, " on thread ", threadid())
end

Task 3265126592589086464 is running iteration 1 on thread 2
Task 1843399858258310311 is running iteration 21 on thread 10
Task 3265126592589086464 is running iteration 2 on thread 2
Task 15414827125662656820 is running iteration 25 on thread 16
Task 4639816724120330272 is running iteration 11 on thread 8
Task 9070916526061227709 is running iteration 23 on thread 7
Task 2446670846823866856 is running iteration 13 on thread 13
Task 4639816724120330272 is running iteration 12 on thread 8
Task 12709182576872726629 is running iteration 29 on thread 1
Task 2216687187093315125 is running iteration 15 on thread 9
Task 2446670846823866856 is running iteration 14 on thread 13
Task 9630774442592466691 is running iteration 5 on thread 14
Task 1843399858258310311 is running iteration 22 on thread 10
Task 15414827125662656820 is running iteration 26 on thread 16
Task 13489389001739313653 is running iteration 9 on thread 11
Task 13626781883161134767 is running iteration 3 on thread 15
Task 9630774442

### Nestability / Composability

#### Example: Recursive Fibonacci series

$$ F(n) = F(n-1) + F(n-2), \qquad F(1) = F(2) = 1$$

We can nest `@spawn` calls freely!

In [42]:
function fib(n)
    n < 2 && return n
    t = @spawn fib(n-2)
    return fib(n-1) + fetch(t)
end

fib (generic function with 1 method)

In [43]:
fib(20)

6765

(Note: Algorithmically, this is a highly inefficient implementation of the Fibonacci series, of course!)

## Multithreading: Things to be aware of

### Instructive example: parallel summation

In [44]:
data = rand(1_000_000 * Threads.nthreads());

#### Naive approach

In [45]:
function sum_threads_naive(data)
    s = zero(eltype(data))
    @threads for x in eachindex(data)
        s += x
    end
    return s
end

sum_threads_naive (generic function with 1 method)

In [46]:
@show sum(data);
@show sum_threads_naive(data);
@show sum_threads_naive(data);

sum(data) = 8.001388345539851e6
sum_threads_naive(data) = 1.3508063908667e13
sum_threads_naive(data) = 7.500400619616e12


**Wrong** result! Even worse, it's **non-deterministic** and different every time!

There is a [race condition](https://en.wikipedia.org/wiki/Race_condition) which typically appear when multiple tasks are modifying a shared value simultaneously.

→ **Don't modify shared "global" state!**

Sometimes things can be more subtle. Examples: random number generation, `Dict`. Note that not all of Julia and its packages in the ecosystem are thread-safe! In general, it is safer to assume that they're not unless documented/proven otherwise. (`rand()` is thread-safe, `Dict` isn't!)

## Additional comments

### Tools for multi-threading

* [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl): Simple tools for basic multithreading.
* [ThreadsX.jl](https://github.com/JuliaFolds2/ThreadsX.jl): Parallelized Base functions
* [Tullio.jl](https://github.com/mcabbott/Tullio.jl): Tullio is a very flexible einsum macro ([Einstein notation](https://en.wikipedia.org/wiki/Einstein_notation))
* [(LoopVectorization.jl)](https://github.com/JuliaSIMD/LoopVectorization.jl): Macro(s) for vectorizing loops.
* [(FLoops.jl)](https://github.com/JuliaFolds/FLoops.jl): Fast sequential, threaded, and distributed for-loops for Julia

# Exercises

Re-do the exercise `Counting nucleotides` from the `basics.ipynb` notebook by implementing a multithreaded version. Compare the performance with a single-threaded. Try generating your own strings with different lengths and compare the results. 