# High-performance and parallel computing for AI - Practical 7: MPI

## NOTES

* For these practicals we will be using a different `conda environment`. When opening a notebook or a terminal make sure you are using the **CuPy Kernel**!!!
* MPI code must typically be invoked from the command line and not from a notebook. However, for this lectures I will be using a workaround. Both approaches are explained in the first MPI lecture Jupyter notebook.
* mpi4py documentation [here](https://mpi4py.readthedocs.io/en/stable/index.html).

## Question 1 - Finding what I need in the MPI documentation.

In this question you will learn how to use the mpi4py documentation. Your taks is to open the [mpi4py docs](https://mpi4py.readthedocs.io/en/stable/index.html) and see whether you can find a list of all possible communication functions. Once you find the list, choose $2$-$3$ functions and click on them: it will show you the syntax for calling them and what they return. Finally, for each of these functions, open the official [MPI docs](https://www.mpi-forum.org/) and search for their syntax in the doc pdf. Alternatively, you can google "MPI parallel FUNCTION_NAME" and you will likely find useful results. Note that typically all MPI info is provided in either C/C++ or Fortran syntax, but it should be easy enough to understand.

**Hint:** In the `mpi4py` docs you do not need to click on search. Remember that communication functions are member functions of the Communicator class, which is part of the `mpi4py.MPI` module, which you can find under "Contents" (the menu on the left in the documentation).

**Note:** Remembering all MPI functions by heart is difficult if you do not use them often. I typically open the list of available functions and their documentation to remind myself.

## Practical 1 - Some reading

Read and understand the following section about random number generation in a parallel environment.

## Random number generation in a parallel environment

Random numbers are used ubiquitously in AI:
* Before training. Initialisation of neural network biases and weights requires random numbers.
* During training. Some applications require selecting a random batch for stochastic gradient descent.
* As part of the method, e.g., if you work with Gaussian mixture models or stochastic processes.
In a parallel environment there is an additional complication arising when generating random numbers.

**Computer-generated random numbers are not random**

To begin with, note that random numbers are generated on your computer by a specific class of software called random number generators (RNGs). Numbers sampled from RNGs are not really random (non-quantum computers typically work deterministically so they cannot generate truly random sequences): they are only pseudo-random, a term which indicates that they are actually deterministic, yet carefully designed to behave as closely as possible like actual random numbers.

**Generators and random seeds**

RNGs are typically initialised with a *seed* that defines the initial *state* of the generator. From this initial state, RNGs generate a pseudo-random sequence which depends on the method used. All these sequences are typically periodic and have a *period* which indicates after how many numbers the sequence will repeat itself. The period is typically very large, e.g., $2^{128}$. Specifying a seed is very important for results reproducibility.

**Why you need to be careful with MPI and RNGs**

Here comes the problem with MPI: each process will have their own indentical copy of every variable including the initial RNG state, which means that every rank will typically have the same RNG generating the same exact numbers. This is a big issue since the generated numbers will be correlated, which is very bad for whatever method based on independent samples!

**Possible generic solutions (software-dependent)**

* **Skipahead.** This is the safest option. Some sequences allow to skip a chunk of numbers. This can be exploited by having each rank access the same sequence from a diffent initial state so that the generated numbers never overlap (provided the period is large enough and that the ranks are not too many this is typically the safest option).
* **Change seed.** Potentially unsafe (software-dependent), but typically fine. This is not the optimal solution, but it is often good enough. If the RNG is initialised with a different seed for each rank, then the generated numbers are likely to be uncorrelated (but not guaranteed, hence why the other method is to be preferred).

**Specific `numpy.random` solutions**

The numpy team has thought long and hard about random number generation and the `numpy.random` module is now quite advanced. However, one needs to be aware and know how to use these functionalities.

**Important:** Do not use `numpy.random.rand` and `numpy.random.randn` functions when using MPI. These are not supposed to be used in a parallel environment.

When working with numpy you have multiple options:

* **Recommended:** Using `spawn` to generate independent RNGs:
```python
    from mpi4py import MPI
    import numpy as np

    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size = comm.Get_size()

    seed = 1234 # whatever integer.
    parent_RNG = np.random.default_rng(seed)
    children_RNG = parent_RNG.spawn(size) # will create n=size independent RNGs.
    RNG_to_use = children_RNG[rank] # only use the one corresponding to your rank.
```

* Using seeds which are guaranteed to lead to independent RNGs:
```python
    from mpi4py import MPI
    import numpy as np
    from secrets import randbits # NOTE: the name of this function changed to getrandbits in Python 3.13


    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size = comm.Get_size()
    
    root_seed = randbits(128) # this option requires generating high-quality seeds
    RNG = default_rng([rank, root_seed]) # With high-probability the RNGs will now be independent.
```

* Using skipahead through the use of the `jumped` functionality:
```python
    from mpi4py import MPI
    import numpy as np
    from secrets import randbits # NOTE: the name of this function changed to getrandbits in Python 3.13


    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size = comm.Get_size()
    
    root_seed = randbits(128) # this option requires generating high-quality seeds
    base_RNG = default_rng(root_seed)
    RNG = base_RNG.jumped(rank) # With high-probability the RNGs will now be independent.
```
This option requires understanding what happens under the hood. Perhaps I would not recommend it to a beginner.

**Further reading:** Numpy parallel random number generation [documentation](https://numpy.org/doc/stable/reference/random/parallel.html).

**General advice:** When using MPI with whatever RNG it is always *very important* to read the documentation of the random number generator to understand well how it works!

## Question 2 - Point-to-point communiation: Ping pong

Write a simulation of Ping Pong according to the following rules:
* Ranks 0 and 1 participate
* Rank 0 starts with the ball
* The rank with the ball sends it to the other rank
* With probability 0.1 a rank misses and the other rank scores a point.
* The rank that scores a point starts with the ball next.
* The first rank to score 21 points wins.

**Hint:** I did it as follows, but you can try with other strategies. You can send and receive a `ball` variable which takes value 1 if the ball is moving, 0 whenever the ball is missed, and -1 if one of the ranks has reached 21.

## Question 3 - Sharing the workload across workers

**Note:** For this question you do not need `mpi4py` nor `mpiexec`. Only use `numpy`.

Let $a\in\mathbb{R}^n$ be given by `a = np.arange(n)`. Write a Python script that divides the entries of $a$ as equally as possible into $m$ chunks (think of each chunk has being held on one rank). Then sum the entries of each of the $m$ chunks of $a$ and store the result into the variables $s_i$ for $i=1,\dots,m$ (in MPI there would be only one $s_i$ on each rank). Finally, sum 
the $s_i$ together and double check that the answer you obtain is correct (in MPI this final sum would be a reduce operation). Code this up with $n=59$ and $m=4$. If you can, avoid ever creating the full array $a$.

**Note:** This is a very common operation. If you need to divide $n$ tasks equally across $m$ processors this is what you have to do.

## Question 4 - Monte Carlo simulation

Let $z$ be an $n$-dimensional standard Gaussian vector. Use MPI to estimate $\mathbb{E}[\lVert z \rVert_2^2]$ for $n=10$ by using the Monte Carlo method (what is the exact answer?). Use 4 processes and take $1000$ samples per process so that they overall Monte Carlo estimator will be an average of $M=4000$ samples. Implement this by using collective MPI communication. Do it in two ways: 1) So that each process has the value of the final Monte Carlo mean. 2) So that only rank 0 has it instead. Think (but do not do it) about how you would change the code if the question had asked for a total of $4003$ samples (we saw this in the previous question).

## Question 5

In this questions you will use 4 processes. Have rank 1 create a random array of length $1000$ and then use the buffer version of scatter to divide this vector across all $4$ processes. Then use gather (buffer version) on rank 3 to reconstruct the array on rank 3. Then, use send and recv to have rank 3 send the gathered array directly to rank 1 and have rank 1 check that it matches with the original array.

## Question 6

Use $n$ processes with $n=3$. Create a list $a$ of $n$ Python objects of your choice on rank 0, then use scatter to send each object to a different rank. Then, implement an allgather operation "by hand" by combining gather with bcast on root 0 and check that the answer is correct.

## Question 7

This is to show that sum reduction operations in object mode will apply the Python `+` operation, whatever it means for that object.

Use $n$ processes with $n=3$. On each rank, create a list with $1$-$2$ objects of your choice, then use allreduce with `MPI.SUM` so that each process obtains a list with all objects. This is because in Python a sum of lists is a list concatenation operator.

## Question 8 - Parallel matrix-vector product

For $n=100$ let $A$ be an $n$-by-$n$ square matrix and $b$ be a length $n$ vector. For this question only use $2$ processes.

Do the following:
* Construct $A$ and $b$ on rank 0, then use scatter to divide the rows of $A$ and the entries of $b$ equally across the two ranks.
* Write a function that implements a parallel matrix vector product between $A$ and $b$ so that the resulting vector is also divided equally across the two processors. To minimize memory movement, never communicate matrix entries between ranks.
* Check that what you have computed is correct (perhaps by keeping a copy of the correct matvec done in serial on rank 0).

Finally, repeat the exercise by dividing the columns of $A$ instead. You can do this in (at least) two ways: either with an `Allreduce` or an `Alltoall`. Try both.