# MPI Guide
In this guide, we will be running some multi-node jobs using the Message Passing Interface (MPI). Since Slurm is a popular scheduler used by many HPCs and supercomputers, mainstream MPI implementations have built-in support for it. If you launch MPI software within a Slurm job, it is able to recognize the Slurm environment and launch the software accordingly (i.e., launching the right amount of parallel processes and using the correct allocated nodes). So you don't need to bother writing a machinefile/hostfile or manually putting in the `-np` option.    
For more information check out [Slurm - MPI User Guide](/doc/mpi_guide.html).

## Example: calculate $\pi$
In this example, we will estimate the value of $\pi$ using the [Monte Carlo method](https://en.wikipedia.org/wiki/Monte_Carlo_method) with [OpenMPI](https://www.open-mpi.org/). 
In this lab environment, some of the software and libraries are managed using [Environment Modules](https://modules.readthedocs.io/en/latest/), which is a very convenient way of managing multiple libraries, software, and different or even conflicting versions of them. 

In [None]:
# check available modules
module avail

In [None]:
# loading the mpi module
module load mpi

In [None]:
# list loaded modules
module list

Next, we can take a look at the code and build it. 

In [None]:
cat mpi-pi/parallel-pi.c

In [None]:
make --directory mpi-pi

Now we are ready to run the code. Slurm provides many ways of running an MPI program. One of them is by using the `--mpi` option of `srun`. With this option, you can launch the MPI program even from the submission host and see the stdout right there, but the actual execution happens on the compute node. For more details on the option, check out the man page for [`srun --mpi`](/doc/srun.html#OPT_mpi).  
For starters, use the option `--ntasks <N>` to specify how many MPI processes you would like to run. If you have more specific requirements for the number of nodes, processes, or memory, you could use a combination of the `--nodes`, `--ntasks-per-node`, `--cpus-per-task`, and `--mem` options. 

In [None]:
# 2 parallel process on 1 node
srun --nodes=1 --ntasks-per-node=2 --mpi=pmix mpi-pi/parallel-pi

In [None]:
# 8 parallel process, cross node
srun --ntasks=8 --mpi=pmix mpi-pi/parallel-pi

In [None]:
# request 4 nodes, 2 process on each node
srun --nodes=4 --ntasks-per-node=2 --mem=0 --mpi=pmix mpi-pi/parallel-pi

You might find it weird to see a multi-node execution run much slower than a single-node run. That is because `MPI_Reduce` is being called unnecessarily often. Each time this function is called, a barrier is set up, all processes stop and synchronize to exchange data, and this is a very costly operation across nodes.  
In the next section, we will run the HPL benchmark, which doesn't have such an issue and even offers an OpenMP multithreading option to further reduce cross-node synchronization and communication. 

## HPL Benchmark
The [High-Performance Linpack (HPL)](https://netlib.org/benchmark/hpl/) is a common benchmark in HPC/Supercomputing. It measures how many floating-point operations per second (FLOPS) a cluster is capable of doing to rate its computational power. HPL is commonly used in ranking the best supercomputers in the world, for UAT of new clusters/hardware, or as a stress test after hardware replacement in an HPC environment. In this section, we will build and run the HPL benchmark via Slurm.

### Install Spack
[Spack](https://spack.io/) is an HPC software package manager. Many compilers and HPC software are available, and they are built from source locally when you install them. It is one of the 10 initial projects in the [High Performance Software Foundation](https://hpsfoundation.github.io/#projects), formed by the [Linux Foundation](https://www.linuxfoundation.org/press/linux-foundation-announces-intent-to-form-high-performance-software-foundation-hpsf). We are going to install Spack into our container lab cluster and then build the HPL benchmark using it.

In [None]:
git clone -c feature.manyFiles=true https://github.com/spack/spack.git ~/.local/spack
git -C ~/.local/spack checkout v1.0.2

# add this line to setup spack on login
ansible -m lineinfile -a "path=${HOME}/.bashrc line='source ~/.local/spack/share/spack/setup-env.sh'" localhost

# activate spack
source ~/.local/spack/share/spack/setup-env.sh
which spack
spack config add modules:default:enable:[lmod]

# detect available compilers
spack compiler find
spack compilers

### Install HPL with Spack

In [None]:
spack list hpl

List and confirm the configuration Spack is going to use for installing HPL.

In [None]:
spack spec hpl

Building HPL and all its dependencies in parallel, spanning across 4 nodes. Make sure your Spack is installed on a shared, flock-supported file system, and that you haven't turned off the default locking mechanism of Spack. This could go terribly wrong otherwise.

In [None]:
srun --nodes=4 --ntasks-per-node=1 --exclusive spack install hpl

# verify hpl has been installed & setup module
spack find hpl

for mod_path in $( find ~/.local/spack/share/spack/lmod -iname "*.lua" | xargs dirname | xargs dirname | uniq ); do
    module use $mod_path
    ansible -m lineinfile -a "path=${HOME}/.bashrc line='module use $mod_path'" localhost
done

module avail

### Run HPL
To run the HPL benchmark, we need to prepare an `HPL.dat` file that describes the problem size and the configuration for running the benchmark. We can also choose between running it in pure multi-process MPI or a hybrid MPI + OpenMP execution. 

In [None]:
# load hpl from module or spack
module load hpl || spack load hpl

Example `HPL.dat` file. For details and tuning of these parameters, please refer to the [HPL Tuning Guide (https://www.netlib.org/benchmark/hpl/tuning.html).

In [None]:
cat ./HPL.dat

In [None]:
# MPI + OpenMP hybrid run
OMP_NUM_THREADS=2 srun --nodes=4 --ntasks-per-node=1 --cpus-per-task=2 --mpi=pmix xhpl

Next is an example of an sbatch HPL running script that generates an `HPL.dat` file using environment variables provided by Slurm. This allows for a bigger, longer run when more resources are requested for the job.

In [None]:
# Example sbatch job script
cat ./hpl-job.sh

In [None]:
# MPI + OpenMP Hybrid run with sbatch job script
sbatch --nodes=4 --ntasks-per-node=1 --cpus-per-task=2 ./hpl-job.sh