# Distributed Computing: Message Passing Interface (MPI)

## MPI and MPI.jl

* **[MPI](https://www.mpi-forum.org/)**: A [standard](https://www.mpi-forum.org/docs/) with several specific implementations (e.g. [OpenMPI](https://www.open-mpi.org/), [IntelMPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html#gs.73krlr), [MPICH](https://www.mpich.org/))
* **[MPI.jl](https://github.com/JuliaParallel/MPI.jl)**: Julia package and interface to (most) MPI implementations ([paper](https://proceedings.juliacon.org/papers/10.21105/jcon.00068))

### Getting MPI

By default, an appropriate MPI will be automatically downloaded when adding MPI.jl to a Julia environment (see e.g. MPICH_jll.jl)). Works out of the box almost all of the time.

However, in particular on HPC clusters, one sometimes wants/needs to use a **system-wide MPI** installation. Potential reasons include:

* Vendor-specific MPI required for MPI to work at all.
* Fine-tuned MPI configuration necessary for best performance.
* CUDA-aware or ROCm-aware MPI

#### How to use a system MPI?

```julia
using MPIPreferences
MPIPreferences.use_system_binary()
```
If you do this before adding MPI.jl, no MPI will be downloaded.

For more, check out the [MPI.jl documentation](https://juliaparallel.org/MPI.jl/stable/configuration/#configure_system_binary).

### MPI: Programming model

Sinlge Program Multiple Data (SPMD):
* **all processes execute the same code** but have different IDs (rank).
* **conditionals can be used to get different behavior** for different MPI ranks
* individual processes flow at there own pace, **they can (and will) get out of sync**
* selecting the concrete number of processes is deferred to "runtime"

#### Basic example: Hello World

```julia
# file: mpi_examples/mpi_hello.jl
using MPI

MPI.Init()

comm = MPI.COMM_WORLD

print("Hello world, I am rank $(MPI.Comm_rank(comm)) of $(MPI.Comm_size(comm))\n")

MPI.Finalize()
```

#### Fundamental MPI functions

* `MPI.Init()` and `MPI.Finalize()`: Always at the top or bottom of your code, respectively.

* `MPI.COMM_WORLD`: default *communicator* that includes all processes (MPI ranks)

* `MPI.Comm_rank(comm)`: unique rank of the process calling this function

* `MPI.Comm_size(comm)`: total number of processes for the given communicator

#### Running an MPI code

There are `mpirun` and `mpiexec` (or `srun` on SLURM clusters) to run MPI applications. These work fine with Julia.

However, MPI.jl provides a [`mpiexecjl` wrapper](https://juliaparallel.org/MPI.jl/stable/configuration/#Julia-wrapper-for-mpiexec) that you can/should use if
* you want to use an MPI that has been installed automatically by MPI.jl
* want to use different MPIs for different Julia environments

**In this course, you should use the following command to run your MPI application (both on the laptops and on Hawk):**

`mpiexecjl --project -n N julia mycode.jl`

Here, `N` is the desired number of MPI ranks.

<img src="./imgs/julia_mpi_example.png">

## MPI.jl vs MPI in C

General rule: `MPI_*` in C → `MPI.*` in Julia, e.g.

  * `MPI_COMM_WORLD` → `MPI.COMM_WORLD`
  * `MPI_Comm_size` → `MPI.Comm_size`
  
In principle, you can use the "low-level" API under `MPI.API.*`, which is essentially identical to what you might know from C.

### Conveniences of MPI.jl

* Julia MPI functions can have **less function arguments** than C counterparts if some of them are deducible/optional.
* MPI functions can often handle data of **built-in and custom Julia types** (i.e. custom `struct`s)
  * `MPI.Types.create_*` constructor functions (`create_vector`, `create_subarray`, `create_struct`, etc.) get automatically called under the hood.
* MPI Functions can often handle **built-in and custom Julia functions**, e.g. as a reducer function in `MPI.Reduce`. (see `mpi_examples/6_mpi_custom.jl`)

## Basic MPI functions

### Point-to-point communication (**blocking**)

* **Sending:** `MPI.Send(buf, comm; dest, tag=0)`
  * `buf`: data buffer, typically an array
  * `communicator`
  * `destination`: target rank (to receive the data)
  * `tag` (optional)

<br>

* **Receiving:** `MPI.Recv!(recvbuf, comm; source=MPI.ANY_SOURCE, tag=MPI.ANY_TAG)`
  * `recvbuf`: buffer to store the received data in, typically an array
  * `communicator`
  * `source`: source rank (whos sending the data)
  * `tag` (optional)
  
Basic example → see `mpi_examples/2_mpi_basic_communication.jl`

**Blocking can lead to deadlocks!**

<img src="./imgs/MPI-deadlock.png" width=400px>

### Point-to-point communication (**non-blocking**)

(essentially the same function signatures, just different names and behavior)

* **Sending:** `MPI.Isend(buf, comm; dest, tag=0)`
  * `buf`: data buffer, typically an array
  * `communicator`
  * `destination`: target rank (to receive the data)
  * `tag` (optional)
  * Returns a request object (which one may use for waiting operations)

<br>

* **Receiving:** `MPI.Irecv!(recvbuf, comm; source=MPI.ANY_SOURCE, tag=MPI.ANY_TAG)`
  * `recvbuf`: buffer to store the received data in, typically an array
  * `communicator`
  * `source`: source rank (whos sending the data)
  * `tag` (optional)
  * Returns a request object (which one may use for waiting operations)
  
Basic example → see `mpi_examples/3_mpi_basic_communication_nonblocking.jl`

### Collective communication

* **Explicit synchronization**
  * `MPI.Barrier(comm)`: stop execution until all ranks have reached the barrier
* **Data movement**:
    * Example: Broadcasting, see [here](https://juliaparallel.org/MPI.jl/stable/reference/collective/#Broadcast) and exercise `mpi_bcast`
* **Collective computation**:
    * Example: Reduction, see [here](https://juliaparallel.org/MPI.jl/stable/reference/collective/#Reduce/Scan) and `mpi_examples/5_mpi_reduction.jl` for an example

### Other

* **Time measurement**
    * `MPI.Wtime()`: the difference between two `MPI.Wtime()` calls is the elapsed wall-clock time

### High-level tools

* [PartitionedArrays.jl](https://github.com/fverdugo/PartitionedArrays.jl): Data-oriented parallel implementation of partitioned vectors and sparse matrices needed in FD, FV, and FE simulations.
* [Elemental.jl](https://github.com/JuliaParallel/Elemental.jl): A package for dense and sparse distributed linear algebra and optimization.
* [PETSc.jl](): Suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. ([original website](https://petsc.org/release/))