# Lecture 6: Parallel Computing II: Message Passing Interface (MPI)

## NOTES

* For these practicals we will be using a different `conda environment`. When opening a notebook or a terminal make sure you are using the **CuPy Kernel**!!!
 
## IMPORTANT!!!

MPI code must typically be invoked from the **command line** and not from a notebook. However, there is a workaround. I will demonstrate both approaches.

## Practical 1 - Hello World!

This practical shows you the simplest possible MPI script and teaches you how to run it.

### The non-standard way

First, let's start with the non-standard way by using the notebook.

Adding the line `%%writefile FILENAME.py` at the beginning of a cell will create and overwrite a file called `FILENAME.py`. We can thus write: 


In [1]:
%%writefile hello_world.py
from mpi4py import MPI

comm = MPI.COMM_WORLD # this is the standard MPI communicator (there are others). It is the WORLD COMMUNICATOR, i.e. it allows to communicate across all processes.
rank = comm.Get_rank()
size = comm.Get_size()

print("I am rank: %d. There are: %d of us." % (rank,size), flush=True)

Overwriting hello_world.py


Running the above will create a file called `hello_world.py` with the cell contents inside. Warning: if the file exists already it will be overwritten!

Once the file is created we can call it. However, **it must be called from the terminal**, not from Python. Jupyter has a trick that allows us to do so:
any command starting with an exclamation mark will be executed in bash:

In [2]:
!mpiexec -n 4 python3 hello_world.py

I am rank: 3. There are: 4 of us.
I am rank: 0. There are: 4 of us.
I am rank: 2. There are: 4 of us.
I am rank: 1. There are: 4 of us.


**AN IMPORTANT CATCH WITH THE NON-STANDARD WAY**

Do not run mpi4py code in Jupyter directly without using `writefile` and a separate bash command. If you really must, then please interrupt the kernel afterwards (but please avoid doing this in the first place).

The long explanation (you do not have to read this):

Any MPI execution must be initialised and terminated with matching MPI_Init and MPI_Finalize calls. You never see this in Python since `mpi4py` instantiates this calls for you under the hood. However,  the moment you import `mpi4py` interactively in Jupyter the kernel will call MPI_Init with a single process (you are not using `mpiexec` so it is like you were doing it with `-n 1`). However, MPI_Finalize won't be called until the kernel is interrupted. I cannot fully explain why, but I noticed that if you then go on and try to run parallel code with `mpiexec` this seems to cause trouble and a deadlock.

### The standard way - A premise

Frankly, for the purposes of this lectures I advise you to use the non-standard way, but let it be clear that **you should be using the standard way in general**.

The reason is that Jupyter was designed for serial work and it interacts a bit weirdly with MPI (e.g., see above note). Depending on Jupyter for coding with MPI is therefore silly.

**Remember:** If you ever need to use MPI again, invoke `mpiexec` (or `mpirun`) directly from terminal, not from jupyter.

### The standard way

The standard way works as follows:

1- Create and save a file called `hello_world.py` with the content of the above cell, but without the writefile line.

2- Open a terminal (in Jupyter press on the $+$ sign).

3- Make sure you are in the correct conda environment:
```bash
    conda init
    conda activate cupy
```

4- Run the script in the terminal using `mpiexec`:
```bash
    mpiexec -n 4 python3 hello_world.py
```

**Note:** From now on I will be using the non-standard way since for teaching purposes it is just too convenient. However, I warned you!

### Explanation of the `hello_world.py` code and a few comments

* comm is a MPI communicator. It allows communication between processes (we will see how in the next slides). It also holds rank, size and other information.
* Here we are passing flush=True to the print function so that the stdout is flushed when print is called. Otherwise all print statements may execute only after the program has finished.

## Interlude - Which MPI installation am I using?

You can check which MPI version and installation you are using by querying the version of the MPI executable:

In [3]:
!mpiexec --version

HYDRA build details:
    Version:                                 4.0
    Release Date:                            Fri Jan 21 10:42:29 CST 2022
    CC:                              gcc -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security  -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro 
    Configure options:                       '--with-hwloc-prefix=/usr' '--with-device=ch4:ofi' 'FFLAGS=-O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -fallow-invalid-boz -fallow-argument-mismatch' '--prefix=/usr' 'CFLAGS=-g -O2 -ffile-prefix-map=/build/mpich-0xgrG5/mpich-4.0=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security' 'LDFLAGS=-Wl,-Bsymbolic-functions -flto=auto

## Practical 2 - Distributed memory and process management

Remember the rules of the MPI club: each process reads and executes the same code and will store a full independent copy of each variable:

In [4]:
%%writefile temp.py
import numpy as np
from mpi4py import MPI

# Useful to always include this in your code
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

a = np.arange(1000)
b = np.linalg.norm(a)

print("Rank: %d. I hold a copy of a of size %d and of norm %.3f" % (rank,a.size, b), flush=True)

Overwriting temp.py


In [5]:
!mpiexec -n 4 python3 temp.py

Rank: 0. I hold a copy of a of size 1000 and of norm 18243.725
Rank: 3. I hold a copy of a of size 1000 and of norm 18243.725
Rank: 1. I hold a copy of a of size 1000 and of norm 18243.725
Rank: 2. I hold a copy of a of size 1000 and of norm 18243.725


As you can see, each process will have its own copy and all copies will be the same!

**WARNING!** It is important to keep this in mind: If you create or load a very large array and you load it on many processes you may make the computer run out of memory!!!

**Note:** I do not know if you noticed already, but the order in which each process prints out its message is completely arbitrary. In fact you cannot even decide which CPU thread will be assigned which rank!

### Question: How can I tell different processes to do different things then?

###

<details>
    <summary> <b>Want a hint?</b> Click here!</summary>
    Think about it: There is only one variable that has a different value across processes!
</details>

###

<details>
    <summary> <b>Solution</b></summary>
    Different processes have different ranks so we can use that single information!
</details>

In [6]:
%%writefile temp.py
import numpy as np
from mpi4py import MPI

# Useful to always include this in your code
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

M = 10
a = np.arange(M*size)[rank*M:(rank+1)*M]
b = np.linalg.norm(a)

print("Rank: %d. I hold a copy of a of size %d and of norm %.3f" % (rank, a.size, b), flush=True)

Overwriting temp.py


In [7]:
!mpiexec -n 4 python3 temp.py

Rank: 0. I hold a copy of a of size 10 and of norm 16.882
Rank: 1. I hold a copy of a of size 10 and of norm 46.744
Rank: 2. I hold a copy of a of size 10 and of norm 78.006
Rank: 3. I hold a copy of a of size 10 and of norm 109.476


Now each process holds a smaller copy of the array. This sort of appreach is very useful to prevent running out of memory and is the basis of domain decomposition methods used in scientific computing to solve partial differential equations.