# High-performance computing with Julia

In this notebook we'll be looking at Julia's functionality for distributing work over multiple shared-memory workers (threads/cores). Julia also has two main packages for doing distributed-memory parallelism:

- [Distributed.jl](https://docs.julialang.org/en/v1/stdlib/Distributed/) - this is a standard library that comes shipped with the language. 
- [MPI.jl](https://juliaparallel.org/MPI.jl/stable/) - a Julia wrapper for MPI.

Additionally, Julia supports GPU programming via the [CUDA.jl](https://cuda.juliagpu.org/stable/) package. There is also ongoing work for supporting AMD, Intel, and Apple GPUs. You can read more about that [here](https://github.com/JuliaGPU/KernelAbstractions.jl).

# Multithreading in Julia

Julia comes packaged with a standard library called `Threads` for working with multiple shared-memory workers. However, before starting we need to make sure that Julia is started with more than one thread. To check how many Julia threads are currently running, call

In [None]:
Threads.nthreads()

If you've just installed Julia without changing anything, this number will likely be one. There are multiple way to ensure that Julia starts with multiple threads.
- Set the environment variable `JULIA_NUM_THREADS` to some number, for example to "$(nproc)" to use all the cores in your computer.
- Start Julia with the `--threads` or `-t` option followed by the number of threads. I.e. `julia --threads 4`.
- In VS Code, set the `"julia.NumThreads": NUMBER` option in `settings.json`

We can readily check again how many threads we are running:

In [None]:
Threads.nthreads()

## What are threads?
Threads are **execution units within a process** that can run simultaneously. While processes are separate, threads run in a **shared memory** space (heap).

<!-- <img src="./imgs/what-are-threads.png" width=500px> -->

<br>
<img src="../figures/stack_heap_threads.svg" width=450px>
<br>

**It is currently not (easily) possible to change the number of threads at runtime!**

### User threads vs default threads

Technically, the Julia process is also spawning multiple threads already in "single-threaded" mode, like
* a thread for unix signal listening
* multiple OpenBLAS threads for BLAS/LAPACK operations
* GC threads

We call the threads that we can actually run computations on *user threads* or *Julia threads*.

In [None]:
using LinearAlgebra
BLAS.get_num_threads()

## Where are my threads running?

In [25]:
using ThreadPinning

In [None]:
threadinfo()

## Task-based multithreading

In traditional HPC, one typically cares about threads directly. Using e.g. OpenMP, one essentially tells each thread what to do.

Conceptually, Julia takes a different approach and implements **task-based** multithreading. In this paradigm, a task - e.g. a computational piece of a code - is marked for **parallel** execution on **any** of the available Julia threads. Julia's **dynamic scheduler** will automatically put the task on one of the threads and trigger the execution of the task on said thread.

<br>
<!-- <img src="imgs/task-based-parallelism.png" width=768px> -->
<img src="../figures/tasks_threads_cores.svg" width=650px>
</br>

Generally speaking, the user should **think about tasks and not threads**.
* The scheduler is controlling on which thread a task will eventually run.
* It might even dynamically [migrate tasks](https://docs.julialang.org/en/v1/manual/multi-threading/#man-task-migration) between threads.

**Advantages:**
* high-level abstraction
* nestability / composability (especially important for libraries)

**Disadvantages:**
* scheduling overhead
* uncertain and potentially suboptimal task → thread assignment
  * **can get in the way when performance engineering** because
    * scheduler has limited information (e.g. about the system topology)
    * profiling tools often don't know anything about tasks but monitor threads (or even CPU-cores) instead (e.g. LIKWID).

### Tasks

By default, Julia waits for commands to finish ("**blocking**") and runs everything sequentially.

**Tasks** are a feature that allows (parts of) computations to be scheduled (suspended and resumed) in a flexible manner to implement **concurrency** and **parallelism**.

* Concurrency
    * Dealing with lots of things *in a time period* ("multi-tasking").
    * Can be used on a single thread.
* Parallelism
    * Doing lots of things *at the same instant*.
    * Needs multiple threads (or processes).

Example (concurrency): **asynchronous I/O** like
  * **multiple user input** (Why not already process some of the input?)
  * **data dumping to disk** (Maybe it's possible to continue a calculation?)
 
Example (parallelism): **multithreading, distributed computing**

### Spawning parallel tasks: `Threads.@spawn`
`Threads.@spawn` spawns a task to be run on any Julia thread. Specifically, it creates a `Task` and schedules it for execution on an available Julia thread (we don't control which one!).

Note that `Threads.@spawn` is **asynchronous** and **non-blocking**, that is, it doesn't wait for the task to actually run but immediately returns a `Task`.

In [27]:
using Base.Threads # afterwards we can just write @spawn instead of Threads.@spawn

In [None]:
@spawn 3+3

We can fetch the result of a task with `fetch`.

In [None]:
t = @spawn 3+3
fetch(t)

While `@spawn` returns right away, `fetch` is **blocking** as it has to wait for the task to actually finish.

In [None]:
@time t = @spawn begin
    sleep(3)
    return 3+3
end
@time fetch(t)

We can use the macro `@sync` to synchronize all encompassed asynchronous operations (`@spawn`).

In [None]:
@time @sync t = @spawn begin
    sleep(3)
    return 3+3
end
@time fetch(t)

#### Example: multithreaded `map`

`tmap`: *threaded map*

In [None]:
function tmap(fn, itr)
    tasks = map(i -> @spawn(fn(i)), itr)  # for each i ∈ itr, spawn a task to compute fn(i)
    return fetch.(tasks)                  # fetch and return all the results
end

In [33]:
M = [rand(200,200) for i in 1:8];

In [None]:
tmap(svdvals, M)

In [35]:
using BenchmarkTools

In [None]:
@btime tmap($svdvals, $M) samples=10 evals=3;
@btime map($svdvals, $M) samples=10 evals=3;

**performance issue**:

* Using Julia multithreading + BLAS multithreading
    - CPU cores may be *overscribed*, e.g. 256 total threads on 128 CPU cores! (red bars in `htop`)

If you use BLAS, it is important to carefully consider and configure the [interplay between Julia threads and BLAS threads](https://carstenbauer.github.io/ThreadPinning.jl/stable/explanations/blas/).

In [37]:
BLAS.set_num_threads(1)

In [None]:
@btime tmap($svdvals, $M) samples=10 evals=3;
@btime map($svdvals, $M) samples=10 evals=3;

#### Example: multithreading for-loops

In [39]:
using ThreadPinning.Utility: taskid

In [None]:
@sync for i in 1:2*nthreads()
    @spawn println("Task ", taskid(), " is running iteration ", i, " on thread ", threadid())
end

##### `@threads`

* **Splits up the iteration space into `nthreads()` contiguous chunks**
* Creates a task for each of them.

In [None]:
# creates nthreads() many tasks

@threads for i in 1:2*nthreads()
    println("Task ", taskid(), " is running iteration ", i, " on thread ", threadid())
end

### Nestability / Composability

#### Example: Recursive Fibonacci series

$$ F(n) = F(n-1) + F(n-2), \qquad F(1) = F(2) = 1$$

We can nest `@spawn` calls freely!

In [None]:
function fib(n)
    n < 2 && return n
    t = @spawn fib(n-2)
    return fib(n-1) + fetch(t)
end

In [None]:
fib(20)

(Note: Algorithmically, this is a highly inefficient implementation of the Fibonacci series, of course!)

## Multithreading: Things to be aware of

### Instructive example: parallel summation

In [44]:
data = rand(1_000_000 * Threads.nthreads());

#### Naive approach

In [None]:
function sum_threads_naive(data)
    s = zero(eltype(data))
    @threads for x in eachindex(data)
        s += x
    end
    return s
end

In [None]:
@show sum(data);
@show sum_threads_naive(data);
@show sum_threads_naive(data);

**Wrong** result! Even worse, it's **non-deterministic** and different every time!

There is a [race condition](https://en.wikipedia.org/wiki/Race_condition) which typically appear when multiple tasks are modifying a shared value simultaneously.

→ **Don't modify shared "global" state!**

Sometimes things can be more subtle. Examples: random number generation, `Dict`. Note that not all of Julia and its packages in the ecosystem are thread-safe! In general, it is safer to assume that they're not unless documented/proven otherwise. (`rand()` is thread-safe, `Dict` isn't!)

## Additional comments

### Tools for multi-threading

* [OhMyThreads.jl](https://github.com/JuliaFolds2/OhMyThreads.jl): Simple tools for basic multithreading.
* [ThreadsX.jl](https://github.com/JuliaFolds2/ThreadsX.jl): Parallelized Base functions
* [Tullio.jl](https://github.com/mcabbott/Tullio.jl): Tullio is a very flexible einsum macro ([Einstein notation](https://en.wikipedia.org/wiki/Einstein_notation))
* [(LoopVectorization.jl)](https://github.com/JuliaSIMD/LoopVectorization.jl): Macro(s) for vectorizing loops.
* [(FLoops.jl)](https://github.com/JuliaFolds/FLoops.jl): Fast sequential, threaded, and distributed for-loops for Julia

# Exercises

Re-do the exercise `Counting nucleotides` from the `1-basics.ipynb` notebook by implementing a multithreaded version. Compare the performance with the single-threaded version. Try generating your own strings with different lengths and compare the results. 

_Hint: You can generate `N` random elements selected from a collection by calling `rand(collection, N)`._