# MPI4py Hands On

What is MPI4py? Basically a python implementation of the MPI Standard. (See presentation of HC)

It is a bit more object-oriented than the C API (e.g. `MPI_ -> MPI.` or `comm_ -> comm.`), but conceptually close.

## Prerequisite for this course

In the Merlin jupyter notebook, open a terminal then run

In [None]:
$ module load anaconda/2024.08
$ conda create -p /data/user/$USER/conda-envs/MPI numpy scipy scikit-learn matplotlib mpi4py -c conda-forge
$ conda activate /data/user/$USER/conda-envs/MPI

All examples will be run from the terminal (not within the notebook)

You can also run the examples on your local computer

## Hello MPI

As a initial test, try to run the following python script (named `helloMPI.py`)

In [None]:
from mpi4py import MPI
# Initialize MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
# Print a message from each process
print(f"Hello from process {rank} of {size}")

On a terminal, run the program as:

If everything is installed properly, you should see as output: (If not, make sure you loaded the proper conda env in your terminal)

So here we started two python processes, each process printing simply its ID (rank) and total number of processes started. No IPC so far

## Basic Communication Patterns

General Note: We will only use `numpy` arrays to do Inter-Process-Communication (IPC). This is the most common use-case in HPC and DataScience. 

But you can in principle exchange any python data between processes. 

(Less efficient than numpy arrays, since data needs to be pickled/serialized under the hood, whereas the numpy array can be treated like a C-pointer)

### Point-to-Point

We send a 1D numpy array (10 items, datatype `float64`) from process 0 to process 1

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = np.arange(10, dtype=np.float64)
    comm.Send(data, dest=1, tag=13)
    print ("0 sent:", data)
elif rank == 1:
    data = np.empty(10, dtype=np.float64)
    comm.Recv(data, source=0, tag=13)
    print ("1 recv:", data)

Run with two processes

### <span style="color:Blue">_Exercise_ </span> 

Send a 2d array (size 3x3, datatype `int64`) from rank 0 to rank 1 

(In productive applications, these would typically be detector images)

_Hint_: You can treat metadata (i.e. image size) as hard coded (global)

_Proposed solution_

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

imageSize = 3

if rank == 0:
    #We emulate a (small) detector image, typically rank 0 would load it from file
    detImage_2BeSent = np.arange(imageSize**2, dtype=np.int64).reshape(imageSize,imageSize)
    comm.Send(detImage_2BeSent, dest=1, tag=13)
    print ("0 sent:\n", detImage_2BeSent)
elif rank == 1:
    #We allocate a 1d array as receive buffer which will be reshaped at the end
    data = np.empty(imageSize**2, dtype=np.int64)
    comm.Recv(data, source=0, tag=13)
    detImage_Recv = data.reshape(imageSize,imageSize)
    print ("1 recvd:\n", detImage_Recv)

### <span style="color:Blue">_Exercise_ </span>

Now we try to improve the code in the sense that we make it more generic. 
Assume the receiver does not know the metadata (i.e. the image size)
Hence the sender must as well send the metadata (besides data)

_Hint_: Use MPI-tags to distinguish metadata and data

PseudoCode:


_Proposed solution_

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

dataTag = 13
metaDataTag = 14

if rank == 0:
    #We emulate a (small) detector image, typically rank 0 would load it from file
    #Hence only rank 0 knows its size (might be varying). Datatype is assumed fixed
    imageSize = 3 #this information would e.g be derived from hdf5 metadata of image directly
    #imageSize = 50 #Deadlocks, but this is MPI-implementation dependent
    #Send Meta data (we could also send an int here, but for a non quadratic image would be an array)
    comm.Send(np.array(imageSize, dtype=np.int64), dest=1, tag=metaDataTag)
    detImage_2BeSent = np.arange(imageSize**2, dtype=np.int64).reshape(imageSize,imageSize)
    #Send data
    comm.Send(detImage_2BeSent, dest=1, tag=dataTag)
    print ("0 sent:\n", detImage_2BeSent)

elif rank == 1:
    #Receive Metadata first (We must know how large the buffer is that we allocate
    imageSize_asArray = np.empty(1, dtype=np.int64)
    comm.Recv(imageSize_asArray, source=0, tag=metaDataTag)
    imageSize = imageSize_asArray[0]
    #Now that we know what to expect we can allocate receive buffer
    data = np.empty(imageSize**2, dtype=np.int64)
    comm.Recv(data, source=0, tag=dataTag)
    detImage_Recv = data.reshape(imageSize, imageSize)
    print ("1 recvd:\n", detImage_Recv)

### <span style="color:Blue">_Exercise_ </span> 

No make the sender rank send the data first, such that it is out of order with the receiver.

_Question_: Why is this code working? 

The `Send` and `Recv` are blocking, but the sequence of `Send`s does not match the Sequence of `Recv`s (order is reversed)

## Blocking versus Non-Blocking Communication

### What means blocking? Why Deadlocks depend on data size

When you use blocking sends (MPI.Send) and blocking receives (MPI.Recv) in MPI, the order of operations matters, especially for large data. 

If the order of MPI.Recv calls on one process does not match the order of MPI.Send calls on the other, a deadlock can occur. Here’s why:
How Blocking Communication Works

    Blocking Send (MPI.Send): The call does not return until the message buffer can be reused (i.e., the message is either copied out or the matching receive is ready).
    Blocking Receive (MPI.Recv): The call does not return until the message is received.

Why Small Data Works

For small messages, MPI implementations often use eager protocol:

    The data is copied into a temporary buffer and sent immediately.
    The send call returns as soon as the data is buffered, even if the receive is not yet posted.
    This can hide mismatched order for small messages.

Why Large Data Fails

For large messages, MPI typically uses rendezvous protocol:

    The send call waits until the matching receive is posted before transferring data.
    If the receive order is wrong, both processes can end up waiting for each other, causing a deadlock

### Making blocking/non-blocking more explicit explicit

Take away from above and motivation:

Blocking Send (MPI.Send): The call does not return until the  <span style="color:red">send message buffer can be reused</span>.

Note that nothing is said about the state of the receiver. 
This is confusing or at least vague

And: <span style="color:red">This is also mean</span>: If you test your MPI-application with a tiny dummy dataset on the login node, all works well,
if you then submit your large real-world job to the cluster, it deadlocks)

But this is how it is defined in the MPI standard. (And the standard is always and by defintion correct)

#### Synchronous Send and Receive

You can use `comm.Ssend` to enforce synchronous send. `comm.Ssend` waits for a hand shake with the corresponding receiver, only afterwards it returns.

### <span style="color:Blue">_Exercise_ </span> 

Edit the above script and replace a single `comm.Send` by a `comm.Ssend` such that the application deadlocks for sure (also for small datasets)

#### Aynchronous Send and Receive
You can use `comm.Isend` to enforce asynchronous (or non-blocking) send
`comm.Isend` should be used tother with `comm.Irecv`, the non-blocking version of `comm.Recv`

Example:

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    # Non-blocking send
    data = np.asarray([1, 2, 3])
    req_send = comm.Isend(data, dest=1, tag=77)
    # Do some computation while the send is in progress
    result = 42**2
    # Wait for the send to complete
    req_send.Wait()
    print ("Sent:", data)
elif rank == 1:
    #Allocate memory for the receive buffer
    data = np.empty(3, np.int64)
    # Non-blocking receive
    req_recv = comm.Irecv(data, source=0, tag=77)
    # Do some computation while the receive is in progress
    result = 100/26
    # Wait for the receive to complete
    req_recv.Wait()
    print("Received:", data)

The above example also highlights that non-blocking calls can be used to interleave communication with calculation.

This might give a performance increase as additional benefit. But this is not in the focus here. (The focus is on correct code)

### <span style="color:Blue">_Exercise_ </span> 

Modify the above script such that it never deadlocks, no matter what the sequence of sends/receives is and no matter how big data is

#### _Proposed Solution_

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

dataTag = 13
metaDataTag = 14

if rank == 0:
    #We emulate a (small) detector image, typically rank 0 would load it from file
    #Hence only rank 0 knows its size (might be varying). Datatype is assumed fixed
    imageSize = 3 #this information would e.g be derived from hdf5 metadata from image directly
    detImage_2BeSent = np.arange(imageSize**2, dtype=np.int64).reshape(imageSize,imageSize)
    #Send data first (*non-blocking*)
    req_data = comm.Isend(detImage_2BeSent, dest=1, tag=dataTag)
    print ("0 sent:\n", detImage_2BeSent)
    #Send Meta data (executes immediately since above in s non-blocking)
    req_metadata = comm.Isend(np.array(imageSize, dtype=np.int64), dest=1, tag=metaDataTag)
    req_data.Wait()
    req_metadata.Wait()
elif rank == 1:
    #Receive Metadata first (We must know how large the buffer is that we allocate
    imageSize_asArray = np.empty(1, dtype=np.int64)
    req_metadata = comm.Irecv(imageSize_asArray, source=0, tag=metaDataTag)
    req_metadata.Wait() #make sure we have data here (comment out line and se what happens)
    imageSize = imageSize_asArray[0]
    #Now that we know what to expect we can allocate receive buffer
    data = np.empty(imageSize**2, dtype=np.int64)
    req_data = comm.Irecv(data, source=0, tag=dataTag)
    req_data.Wait()
    detImage_Recv = data.reshape(imageSize, imageSize)
    print ("1 recvd:\n", detImage_Recv)

### <span style="color:Blue">_Exercise_ </span> :

Comment out line `req_metadata.Wait()` (in rank 1) and see what happens

## Basic Communication Patterns Continued

### Broadcast
As the name suggest, comm.BroadCast is used to fant out data from one rank to all the others.

The broadcasting rank is typially rank zero, whereas all other ranks > 0 recieve data

Example (taken from official documentation)

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

#Rank 0 allocates and initializes/define data to be broadcast
if rank == 0:
    data = np.arange(100, dtype='i')
#All other ranks allocate receive buffer
else:
    data = np.empty(100, dtype='i')
#Note 'collective' operation (*every* rank calls Bcast, not just rank 0) 
comm.Bcast(data, root=0)
#Check for all ranks that they have received the Broadcaster's data
for i in range(100):
    assert data[i] == i

### Scatter

Data from rank 0 is split into pieces, these pices are then distributed to other ranks, similar as for Bcast

Difference between Scatter and Broadcast

<img src="MPI_ScatterBcast.png" alt="Alt Text" width="500"/>

_Example_

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

matrixSize = 6 #Square matrix

if matrixSize % size != 0:
    print ("Warning, number of rows is not multiple of number of cores")

# Only rank 0 creates the full array
if rank == 0:
    A = np.arange(matrixSize**2, dtype=np.float64).reshape(matrixSize, matrixSize)
    print(f"Root process ({rank}) created the Matrix:\n{A}")
else:
    A = None #This declaration is mandatory

# Scatter the rows
# Calculate how many rows each process gets (above we made sure that no remainder)
matrixRowsPerProcess = matrixSize//size
#Allocate memory for row-slices (also for process 0)
A_rows = np.empty((matrixRowsPerProcess, matrixSize), dtype=np.float64)

# Scatter the rows (collective operation)
comm.Scatter(A, A_rows, root=0)

# Print the result on each process
print(f"Process {rank} received:\n{A_rows}")

### Gather

data from all ranks is aggregated at rank 0


<img src="MPI_Gather.png" alt="Alt Text" width="500"/>

_Example_

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

#Each rank inits ist share
vectorLengthLocal = 4
vectorLocal = np.arange(vectorLengthLocal, dtype=np.float64)
print(f"Process {rank} created local vector: {vectorLocal}")

# Root process prepares to receive all data
if rank == 0:
    vector = np.empty(size*vectorLengthLocal, dtype=np.float64)
else:
    vector = None

# Gather all local arrays to root
comm.Gather(vectorLocal, vector, root=0)
# Root process prints the gathered array
if rank == 0:
    print(f"Root process ({rank}) gathered vector: {vector}")


### <span style="color:Blue">_Exercise_ </span> 

Implement parrallel matrix multiplication `Ax` for a (4 x 4) matrix A and a (4 x 1) vector `x` using `Bcast`,  `Scatter` and `Gather`.

The code should work for 2 and 4 cores.


At the end, rank 0 (who collects the results) makes sure that the parallel version is correct (by doing the calculation itself)

<img src="Ax.jpg" alt="Ax" width="500"/>

#### _Proposed solution_

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

matrixSize = 4 #Square matrix
matrixRowsPerRank = matrixSize//size
vectorLength = matrixSize #Mandatory for well defined multiplication

if matrixSize % size != 0:
    if rank == 0:
        print ("WARNING: Number of matrix rows is not multiple of number of cores")
        comm.Abort(1)

#Allocate and/or initialize/define data (rank dependent)
#--------------------------------------------------------
if rank == 0:
    A = np.arange(matrixSize**2, dtype=np.float64).reshape(matrixSize, matrixSize)
    x = np.arange(vectorLength, dtype=np.float64)
    b = np.empty(vectorLength, dtype=np.float64)
else:
    A = None
    x = np.empty(vectorLength, dtype=np.float64)
    b = None

# Scatter Matrix
# --------------
A_local = np.empty((matrixRowsPerRank, matrixSize), dtype=np.float64)
comm.Scatter(A, A_local, root=0)

# Broadcast vector
# ----------------
comm.Bcast(x, root=0)

# Do parallel computation
# ------------------------
b_local = np.dot(A_local, x)

# Gather local results
# --------------------
comm.Gather(b_local, b, root=0)

# Unit Test (Possible since process 0 holds entire A in memory)
# -------------------------------------------------------------
if rank == 0:
    print ("Parallel computation equals sequential :", np.allclose(np.dot(A, x), b))


### Any receive
Using MPI.ANY_SOURCE it is possible to receive data from a unspecified rank 

Using MPI.ANY_SOURCE it is possible to receive data with a unspecified tag

_Example_

Rank 0 sends tagged data to rank 1, rank 1 is accepting any data

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = np.array([1.0, 2.0, 3.0], dtype=np.float64)
    # Send the data to rank 1 with tag 10
    comm.Send(data, dest=1, tag=10)
    print(f"[Rank 0] Sent data: {data} to Rank 1")

elif rank == 1:
    # Prepare a NumPy array to receive the data
    data = np.empty(3, dtype=np.float64)  # Same size as sender's array
    #Status tells receiver that data is ready to be received
    status = MPI.Status()
    #Use ANY_SOURCE and ANY_TAG to receive from any rank/tag
    comm.Recv(data, source=MPI.ANY_SOURCE, tag=MPI.ANY_TAG, status=status)
    print(f"[Rank 1] Received data: {data}")
    print(f"[Rank 1] Message received from rank {status.Get_source()} with tag {status.Get_tag()}")
else:
    pass

### <span style="color:Blue">_Exercise_ </span> 

The first n-1 ranks send data (a numpy array) to the last ranks which receives all.

   * Use a random tag to send
   * The receiver rank (rank=size -1) only allocates one common buffer for all data
   * The receiver rank prints data, sender rank and tag of the received data

#### _Proposed solution_

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

if rank < size -1:
    data = rank*np.ones(3, dtype=np.float64)
    comm.Send(data, dest=size-1, tag=np.random.randint(size))
    print(f"[Rank {rank}] Sent data: {data} to Rank {size-1}")
else: #rank size -1
    numberOfReceivedMessages = 0
    data = np.empty(3, dtype=np.float64)  # Same buffer for all receives
    while numberOfReceivedMessages < size -1:
        status = MPI.Status()
        comm.Recv(data, source=MPI.ANY_SOURCE, tag=MPI.ANY_TAG, status=status)
        print(f"[Rank {rank}] Received data: {data} from rank {status.Get_source()} with tag {status.Get_tag()}")
        numberOfReceivedMessages += 1

## When to use MPI
### When not to use MPI

If you have a single program that runs sequentially on many (small) input files  -> Use SLURM Job-arrays

### Use for embarrassingly parallel algorithms ? Maybe

If you have large input-data (e.g. a huge detector image) that should finish asap but is embarrassingly parallel.

What do we mean with embarrassingly parallel ?

_Example_: 

Normalize a detector image such that all intensities are between (0,1)

This calculation can be done independently for each pixel


In [None]:
import numpy as np
import matplotlib.pyplot as plt

bitDepth = 8
detectorImage = np.random.randint(2**bitDepth, size=(220, 200))
detectorImageNormalized = detectorImage.astype(np.float64) / 2**bitDepth
plt.imshow(detectorImage); plt.colorbar()
plt.show()
plt.imshow(detectorImageNormalized); plt.colorbar()
plt.show()

### <span style="color:Blue">_Exercise_ </span> 

Parallelize the above normalization

  * allocate the random image on rank 0
  * only rank 0 needs to hold the complete result (the full normalize imaged) <BR>
  * Use Scatter on Gather to distribute raw data and collect processed data (Reuse Matrix_Scatter code)
  * Compare result with single process computation (done on rank 0 only)

_Proposed solution_

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

numberOfRows = 220
numberOfColumns = 200
bitDepth = 8

if rank == 0:
    detectorImage = np.random.randint(2**bitDepth, size=(numberOfRows,numberOfColumns)).astype(np.float64)
else:
    detectorImage = None

numberOfRowsPerCore = numberOfRows//size

detectorImageSlice = np.empty((numberOfRowsPerCore, numberOfColumns), dtype=np.float64)

comm.Scatter(detectorImage, detectorImageSlice, root=0)

detectorImageSliceNormalized = detectorImageSlice / 2**bitDepth

if rank == 0:
    detectorImageNormalized = np.empty((numberOfRows,numberOfColumns) , dtype=np.float64)
else:
    detectorImageNormalized = None

comm.Gather(detectorImageSliceNormalized, detectorImageNormalized, root=0)

if rank == 0:
    print ("Parallel computation equals sequential :", np.allclose(detectorImage/2**bitDepth, detectorImageNormalized))


### Use for non-embarrassingly parallel algorithms

What does non-embarrassingly mean in the context of image analysis? <BR>
Calculations can not be done fully independent for each pixel, neighbouring pixels are involved <BR>

<img src="LaplaceStencil.png" alt="Alt Text" width="500"/>

As an example, we apply a Laplace filter to a random image

In [None]:
import numpy as np
from scipy.ndimage import laplace

import matplotlib.pyplot as plt

bitDepth = 8
detectorImage = np.random.randint(2**bitDepth, size=(220, 200)).astype(np.float64)
filteredImage = laplace(detectorImage, mode='constant', cval=0)
plt.imshow(detectorImage); plt.colorbar()
plt.show()
plt.imshow(filteredImage); plt.colorbar()
plt.show()

### <span style="color:Blue">_Exercise_ </span> 

Parallelize above code, first use Scatter + Gather as above. 

You only need to change a single line, i.e. the actual image processing, but maybe you also want to give better variable names

Does the unit test pass? Why not? Plot the difference (sequential result - parallel result) to get a hint.

_Proposed (Anti-)Solution_:

In [None]:
from mpi4py import MPI
import numpy as np

from scipy.ndimage import laplace

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

numberOfRows = 220
numberOfColumns = 200

bitDepth = 8
if rank == 0:
    detectorImage = np.random.randint(2**bitDepth, size=(numberOfRows,numberOfColumns)).astype(np.float64)
else:
    detectorImage = None

numberOfRowsPerCore = numberOfRows//size

detectorImageSlice = np.empty((numberOfRowsPerCore, numberOfColumns), dtype=np.float64)

comm.Scatter(detectorImage, detectorImageSlice, root=0)

#This is the only line that needs to be changed
detectorImageSliceFiltered = laplace(detectorImageSlice, mode='constant', cval=0)

if rank == 0:
    detectorImageFiltered = np.empty((numberOfRows,numberOfColumns) , dtype=np.float64)
else:
    detectorImageFiltered = None

comm.Gather(detectorImageSliceFiltered, detectorImageFiltered, root=0)

if rank == 0:
    print ("Parallel computation equals sequential :", np.allclose(laplace(detectorImage, mode='constant', cval=0), detectorImageFiltered))

### <span style="color:Blue">_Exercise_ </span> : 

Make the above code work

   * Assume 2 ranks only
   * Both ranks are allowed to allocate initial 2d array (inefficient for a real application)
   * The 2d array is split in a upper and lower part, as above, one for each rank
   * Rank 1 sends his (partial) filter result to rank 0 (point to point, no Gather/Scatter)
   * After having received image stripe from rank 1, rank 0 does the unit test

### The 2D Heat equation

The 2D heat equation reads

\begin{equation}
\frac{\partial u}{\partial t} = \alpha \Delta u ;
\quad  \quad
\Delta u = \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2u}{\partial y^2};
\quad \quad
u = u(x,y,t)
\end{equation}

We integrate this equation according a simple Euler schema to get the time evolution

\begin{equation}
u(x,y,k+1) = u(x,y,k) + dt \cdot \alpha \cdot \Delta u
\end{equation}
where $u(x,y,k)$ is the 2D heat distribution at discretized time $k$ and $dt$ labels a time step.

In [None]:
import numpy as np
from scipy.ndimage import laplace

numberOfGridPoints = 50
dt = 0.1
alpha = 1
numberOfIterations = 100000
u = np.zeros((numberOfGridPoints, numberOfGridPoints))
#initial condition (heat peak in center)
u[numberOfGridPoints//2, numberOfGridPoints//2] = 100
#boundary condition (Dirichlet boundary condition)
b = 1.0

def update(u, alpha, dt, b):
    return u + dt*alpha*laplace(u, mode='constant', cval=b)

for _ in range(numberOfIterations):
    u = update(u, alpha, dt, b)
'''
#Plotting only
import pylab as plt
from matplotlib.animation import FuncAnimation
fig, ax = plt.subplots()

def animate(i):
    global u
    u = update(u, alpha, dt, b)
    ax.imshow(u)
    ax.set_axis_off()

anim = FuncAnimation(fig, animate, frames=numberOfIterations, interval=10)

plt.show()
'''

### <span style="color:Blue">_Exercise_ </span> : 

Parallelize the above solver
   * Use the same structure as above (for the 2D filtering)
   * You must exchange boundary (HALO) between ranks after every iteration step


Ilustration of the HALO concept:


<img src="halo.svg" alt="Alt Text" width="500"/>

_Proposed Solution_

In [None]:
from mpi4py import MPI
import numpy as np
from scipy.ndimage import laplace

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

assert size == 2

nr = 220
nc = 200
assert nr % 2 == 0
nh = nr//2

dt = 0.1
alpha = 1
bound = 1
numberOfIterations = 100

doExchangeHalo = True #Set to False to make UT fail

def update(u):
    return u + dt*alpha*laplace(u, mode='constant', cval=bound)

u_0 = np.arange(nr*nc, dtype = np.float64).reshape(nr,nc)

if rank == 0:
    u_upper = u_0[:nh +1, :]
else:
    u_lower = u_0[nh -1:, :]

for _ in range(numberOfIterations):
    if rank == 0:
        u_upper = update(u_upper)
        if doExchangeHalo:
            comm.Send(u_upper[-2,:], dest=1, tag=12)
            comm.Recv(u_upper[-1,:], source=1, tag=13)
    else:
        u_lower = update(u_lower)
        if doExchangeHalo:
            comm.Recv(u_lower[0,:], source=0, tag=12)
            comm.Send(u_lower[1,:], dest=0, tag=13)

if rank == 0:
    u_f_total = np.zeros_like(u_0)
    u_f_total[:nh, :] = u_upper[:-1, :]
    comm.Recv(u_f_total[nh:,:], source=1, tag=14)
else: #<- only since we know that we are running on 2 cores
    comm.Send(u_lower[1:,:], dest=0, tag=14)


if rank == 0:
    u = u_0
    for _ in range(numberOfIterations):
        u = update(u)
    u_f_total_seq = u
    print ("Parallel integration equals sequential :", np.allclose(u_f_total, u_f_total_seq))

## Unit tests

If possible, always compare the result of the parallelized algorithm with the result of the sequential algorithm.

(Use a dummy example with small input data and low computational burden, e.g a redcued number of iteration steps)

If the unit test fails this does not necessarily mean that there is a bug in the parallel code.

_Example_ (Machine Learning)

We use some (synthetic) data to train a (random forest) classifier


In [None]:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Generate or load your dataset (here, a synthetic dataset)
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a local model
model = RandomForestClassifier(n_estimators=10, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
print (y_pred)
accuracy = np.mean(y_pred == y_test)
print(f"Accuracy: {accuracy:.2f}")

### <span style="color:Blue">_Exercise_ </span> 

Parallelize above code

  * Every rank reads its portion of the syntethic overall data (e.g. in a module fashion)
  * Every rank trains its own model according 'rank-local' data (embarassingly parallel)
  * Every rank makes its rank-local prediction (On the full test set)
  * Rank 0 gathers all local predictions and does a majority vote on the overall predictions
  * Hint You can use a one-liner: `all_preds = comm.gather(y_pred_local, root=0)`
    This returns a list (list index equals rank ID)
    

Is the Accuracy the same as for the sequential algorithm ? (Is the splitting of the model 'linear'?)

_Proposed Solution_

In [None]:
from mpi4py import MPI
import numpy as np
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Initialize MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Generate or load your dataset (a synthetic dataset). Every process holds all data which is
# clearly inefficient, in a productive application, each rank would reads its share from e.g a hdf5 file
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Split training data across processes (in a modulo fashion)
X_train_local = X_train[rank::size]
y_train_local = y_train[rank::size]

# Train a local model (On local test set)
model = RandomForestClassifier(n_estimators=10, random_state=42)
model.fit(X_train_local, y_train_local)

# Make local predictions (On full test set)
y_pred_local = model.predict(X_test)

# Gather predictions from all processes to rank 0
# This syntax gathers a list (index = rank). (For gathering non-equal size arrays)
all_preds = comm.gather(y_pred_local, root=0)

if rank == 0:
    # Stack predictions and compute ensemble prediction
    all_preds = np.vstack(all_preds)
    # For classification: majority vote
    y_pred_ensemble = np.round(np.mean(all_preds, axis=0)).astype(int)
    print (y_pred_ensemble)
    # For regression: average
    #y_pred_ensemble = np.mean(all_preds, axis=0)
    # Evaluate ensemble performance
    accuracy = np.mean(y_pred_ensemble == y_test)
    print(f"Ensemble accuracy: {accuracy:.2f}")

## Integration with Slurm
So far all examples were launched from command line and executed on login node.

Real world jobs (HPC use case with large memory consumption and and intense number crunching) must be submitted to the cluster.

_Example_

In [None]:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH -J ringComm
#SBATCH -o ringComm.out
#SBATCH -e ringComm.err
export PATH=/psi/home/studer_a1/data/condaforge/miniforge/bin:$PATH
source activate mpi
#It is not needed to specify number of processes here, done by Slurm
mpiexec python ringComm.py

The ring buffer example highlights again the blocking/non-blocking spects of the different send/receive modes

In [None]:
'''
This python program sends data (a numpy array) in a ring topology using mpi4py.
The main purpose is to demonstrate the effect of different send/receive modes,
especially blocking versus non-blocking. We need at least two ranks, i.e.
$ mpiexec -n 2 python ringComm.py
'''

import numpy as np
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

sendFunction = comm.Send #default send, blocking (returns after application buffer can be reused[*])
recvFunction = comm.Recv #default recv, blocking (returns after application buffer is filled)
#sendFunction = comm.Ssend #Synchronous send (returns after connection with reciever), deadlock [**]
#sendFunction = comm.Isend #Non-blocking send, recommended to avoid deadlock
#recvFunction = comm.Irecv #Non-blocking recv (not needed, if used, add a MPI.wait to avoid complaint,
#like: req = comm.Irecv(...), data2beRecv = req.wait() )


arraySize = 1

#Create numpy array with data unique to this rank/Process
data2beSent = np.ones(arraySize, dtype='i')*rank

#Allocate buffer for data to be received, should match with what is sent
data2beRecv = np.empty(arraySize, dtype='i')

#Define source and destination for this rank (nearest neighbours)
leftNeighbor = (rank -1 + size) % size
rightNeighbor = (rank + 1) % size

#Send
sendFunction([data2beSent, MPI.INT], dest=rightNeighbor, tag=0)
#Recv
recvFunction([data2beRecv, MPI.INT], source=leftNeighbor, tag=0)

#Send-And-Recv (Should always work, communication patterns hidden, so not suited for demonstration purpose)
#comm.Sendrecv(sendbuf=data2beSent, dest=rightNeighbor, sendtag=0,
#              recvbuf=data2beRecv, source=leftNeighbor,recvtag=0)

#Check if algorithm output matches our expectations
print(f"Process {rank} received data {data2beRecv} from process {leftNeighbor}")

'''
[*] Might deadlock for large numpy arrays (if MPI library decides that reallocating data
in library space is inefficient (i.e. copying data from application space to library space -> memory is doubled)
MPI might do a comm.Ssend under the hood, then the receiver must be ready at the time of sending.
[**]If you hear the fan kicking in: this indicates 'busy-waiting' (worst case scenario)
'''

## Outlook: Running Jobs on a GPU cluster