MPI is a complicated beast, conjuring thoughts of grey-beards sitting in front of amber screens, typing Fortran code that runs in distant, dark computing halls.
But it's not like that! MPI is something that anyone can use! If you have
code that would normally
clone()) itself, but your single
machine does not have enough CPU cores to meet your needs, MPI is a way to run
the same code across multiple machines, without needing to think about setting
up or coordinating IPC.
The code and docs in this repository are targeted at Python developers, and will take you from zero to running a simple MPI Python script!
Start with an Ubuntu Bionic (18.04) system, and run these commands to install software and run this code!
git clone https://github.com/akkornel/mpi4py.git sudo -s apt install python3-mpi4py echo 'btl_base_warn_component_unused = 0' > /etc/openmpi/openmpi-mca-params.conf exit cd mpi4py mpirun -n 20 python3 mpi4.py
The Python code is simple: It has one 'controller', and one or more 'workers':
The controller generates a random number, and broadcasts it to everyone.
Each worker has a unique ID (its "rank"), and the hostname of the machine where it is running. The worker receives the random number, adds its rank, and sends the result (the new number) and its hostname back to the controller.
The controller "gathers" all of the results (which come to it sorted by rank), and displays them. The output includes an "OK" if the math (random number + rank) is correct.
The Python code is simple: it does not have to deal with launching workers, and it does not have to set up communications between them (inter-process communications, or IPC).
The above example showed 20 copies of mpi4.py, all running on the local system. But, if you have two identical machines, you can spread the workers out across all hosts, without any changes:
If you have access to a compute environment running SLURM, you can let SLURM do all of the work for you:
Read on for information on the technologies that make all this work, and how to try it for yourself!
Behind the Scenes
Behind the above example, there are many moving pieces. Here is an explanation of each one.
MPI is an API specification for inter-process communication. The specification defines bindings for both C/C++ and Fortran, which are used as the basis for other languages (in our case, Python). MPI provides for many IPC needs, including…
Bi-directional communication, both blocking and non-blocking
Scatter-gather and broadcast communication
Shared access to files, and to pools ("windows") of memory
The latest published version of the standard is MPI-3.1, published June 4, 2015. MPI specification versions have an "MPI-" prefix to distinguish them from implementation versions. Specifications are developed by the MPI Forum.
MPI is interesting in that it only specifies an API; it is up to other groups to provide the implementations of the API. In fact, MPI does not specify things like encodings, or wire formats, or even transports. The implementation has full reign to use (what it feels is) the most efficient encoding, and to support whatever transport it wants. Most MPI implementations support communications over Infiniband (using native Infiniband verbs) or TCP (typically over Ethernet). Newer implementations also support native communication over modern 'converged' high-performance platforms, like RoCE.
Because each MPI implementation has control over how the MPI API is implemented, a given "world" of programs (or, more accurately, a world of copies of a single program) must all use the same MPI implementation.
OpenMPI is an implementation of the MPI API, providing libraries and headers for C, C++, Fortran, and Java; and supporting interfaces from TCP to Infiniband to RoCE. OpenMPI is fully open-source, with contributors from adademic and research fields (IU, UT, HPCC) to commercial entities (Cisco, Mellanox, and nVidia). It is licensed under the 3-clause BSD license. Your Linux distribution probably has it packaged already.
NOTE: When running OpenMPI for the first time, you may get a warning which says "A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces". That is OpenMPI complaining that it is not able to find anything other than a basic Ethernet interface.
To suppress the warning about no high-performance network interfaces, edit the
file at path
/etc/openmpi/openmpi-mca-params.conf, and insert the line
btl_base_warn_component_unused = 0.
As this is a Python demonstration, you will need Python! MPI has been around for a long time, and so many versions of Python are supported. The code in this repo works with Python 2.7, and also Python 3.5 and later.
The Python code in this repository includes type annotations. To run in Python 2.7, you will need to install the typing package (it comes as part of the standard library in 3.5 and later).
There are no MPI implementations written specifically for Python. Instead, the
mpi4py package provides a Python wrapper
around your choice of MPI implementation. In addition, mpi4py provides two
ways to interface with MPI: You can provide binary data (as a
bytes string or
bytearray); or you can provide a Python object, which mpi4py will serialize
for you (using
pickle). This adds a massive convenience for Python
programmers, because you do not have to deal with serialization (other than
ensuring your Python objects are pickleable).
WARNING: Although mpi4py will handle serialization for you, you are still responsible for checking what data are being received. It is your code running at the other end, but you may not be sure exactly what state the other copy is in right now.
NOTE: If you plan on building mpi4py yourself, you should be aware that MPI
was created at a time when
pkg-config did not
exist. Instead of storing things like compiler flags in a
.pc file, it
provides a set of wrapper scripts (
mpicc, and the like), which you should
call instead of calling the normal compiler programs. The commands are built
as part of a normal OpenMPI installation, and mpi4py's
setup.py script knows
to use them, and so
pip install --user mpi4py should work.
The code in this repository was tested by the author on the following OS, OpenMPI, and mpi4py configurations, on an x64 platform:
OpenMPI 3.0.0 (built from upstream) was used in all tests.
mpi4py version 3.0.2 (built from upstream) was used in all tests. The Ubuntu-supplied version (1.10.2-8ubuntu1) was tested, but reliably crashed with a buffer overflow.
The above software was tested in the following Python versions:
Python 2.7 (2.7.12-1~16.04)
Python 3.5 (3.5.1-3)
Python 3.6.8 from upstream
Python 3.7.3 from upstream
OpenMPI 2.1.1 (2.1.1-8) was used in all tests.
mpi4py 2.0.0 (2.0.0-3) was used in all tests.
The above software was tested in the following Python versions:
Python 2.7 (2.7.15~rc1-1).
Python 3.6 (3.6.7-1~18.04).
You can run this demonstration code multiple ways. This documentation explains
how to run directly (with
mpirun). This works automatically on one machine,
and can be configured to run on multiple machines. Information is also
provided for people with access to a SLURM environment.
The easiest way to run your code is with the
mpirun command. This command,
run in a shell, will launch multiple copies of your code, and set up
communications between them.
Here is an example of running
mpirun on a single machine, to launch five
copies of the script:
akkornel@machine1:~/mpi4py$ mpirun -n 5 python3 mpi4.py Controller @ MPI Rank 0: Input 773507788 Worker at MPI Rank 1: Output 773507789 is OK (from machine1) Worker at MPI Rank 2: Output 773507790 is OK (from machine1) Worker at MPI Rank 3: Output 773507791 is OK (from machine1) Worker at MPI Rank 4: Output 773507792 is OK (from machine1)
-np) option specifies the number of MPI tasks to run.
Note above how all of the MPI tasks ran on the same machine. If you want to run your code across multiple machines (which is probably the reason why you came here!), there are some steps you'll need to take:
All machines should be running the same architecture, and the same versions of OS, MPI implementation, Python, and mpi4py.
Minor variations in the version numbers may be tolerable (for example, one machine running Ubuntu 18.04.1 and one machine running Ubuntu 18.04.2), but you should examine what is different between versions. For example, if one machine is running a newer mpi4py, and that version fixed a problem that applies to your code, you should make sure mpi4py is at least that version across all machines.
There should be no firewall block: Each host needs to be able to communicate with each other host, on any TCP or UDP port. You can limit the ports used, but that is out of the scope of this quick-start.
SSH should be configured so that you can get an SSH prompt on each machine without needing to enter a password or answer a prompt. In other words, you'll need public-key auth configured.
The code should either live on shared storage, or should be identical on all nodes.
mpirunwill not copy your code to each machine, so it needs to be present at the same path on all nodes. If you do not have shared storage available, one way to ensure this is to keep all of your code in a version-controlled repository, and check the commit ID (or revision number, etc.) before running.
Assuming all of those conditions are met, you just need to tell
hosts to use, like so:
akkornel@machine1:~/mpi4py$ cat host_list machine1 machine2 akkornel@machine1:~/mpi4py$ mpirun -n 5 --hostfile host_list python3 mpi4.py Controller @ MPI Rank 0: Input 1421091723 Worker at MPI Rank 1: Output 1421091724 is OK (from machine1) Worker at MPI Rank 2: Output 1421091725 is OK (from machine1) Worker at MPI Rank 3: Output 1421091726 is OK (from machine2) Worker at MPI Rank 4: Output 1421091727 is OK (from machine2)
OpenMPI will spread out the MPI tasks evenly across nodes. If you would like
to limit how many tasks a host may run, add
slots=X to the file, like so:
machine1 slots=2 machine2 slots=4
In the above example, machine1 has two cores (or, one hyper-threaded core) and machine2 has four cores.
NOTE: If you specify a slot count on any machine,
mpirun will fail if
you ask for more slots that what is available.
And that's it! You should now be up and running with MPI across multiple hosts.
Many compute environments, especially in HPC, use the SLURM job scheduler. SLURM is able to communicate node and slot information to programs that use MPI, and in some MPI implementations SLURM is able to do all of the communications setup on its own.
With SLURM, there are two ways of launching our MPI job. The first is to use
srun, launching the job in a synchronous fasion (that was shown in the
example at the top of this page). The second is to use
sbatch, providing a
batch script that will be run asynchronously.
Regardless of the method of execution (that is, regardless of the SLURM command we run), we will need to tell SLURM what resources we will require:
Time. Our job takes less than five minutes to complete.
Number of Tasks. This is the equivalent to the
mpirun. We need to specify how many MPI tasks to run.
CPUs/cores. Our code is single-threaded, so this is "1 per task".
Memory. We are running a single Python interpreter in each task, but let's overestimate and say "512 MB per core". Note that SLURM specifies memory either per core or per node, not per task.
Note that in our request we never specified how many machines to run on. As long as all machines have access to the same storage (which is normal in a compute environment), we do not care exactly where our program runs. All of our tasks may run on a single machine, or they may be spread out across multiple machines.
Note also that SLURM has support for many different MPI implementations, but in
some cases, special care may need to be taken. For OpenMPI, nothing special is
required: SLURM supports OpenMPI natively, and OpenMPI supports SLURM by
default. More information on other MPI implementations is available from
SLURM. To see which MPIs are
supported by your SLURM environment, run the command
srun --mpi list. You
can check the default by looking at the environment's
slurm.conf file (which
is normally located in
/etc on each machine). The setting is named
In the examples which follow, we will always be requesting at least two nodes.
Synchronous SLURM with salloc
Running "synchronously" means you submit the work to SLURM, and you wait for results to come back. Any messages output by your code (for example, via standard output or standard error) will be displayed in the terminal.
To run synchronously, we use the
srun command, like so:
akkornel@rice15:~/mpi4py$ module load openmpi/3.0.0 akkornel@rice15:~/mpi4py$ srun --time 0:05:00 --ntasks 10 --cpus-per-task 1 --mem-per-cpu 512M --nodes 2-10 python3 mpi4.py Controller @ MPI Rank 0: Input 3609809428 Worker at MPI Rank 1: Output 3609809429 is OK (from wheat09) Worker at MPI Rank 2: Output 3609809430 is OK (from wheat10) Worker at MPI Rank 3: Output 3609809431 is OK (from wheat11) Worker at MPI Rank 4: Output 3609809432 is OK (from wheat12) Worker at MPI Rank 5: Output 3609809433 is OK (from wheat13) Worker at MPI Rank 6: Output 3609809434 is OK (from wheat14) Worker at MPI Rank 7: Output 3609809435 is OK (from wheat15) Worker at MPI Rank 8: Output 3609809436 is OK (from wheat16) Worker at MPI Rank 9: Output 3609809437 is OK (from wheat17)
In the above example, OpenMPI is provided via an Lmod module, instead of a system package. We have already built mpi4py using that version of OpenMPI. So, before running our code, we first load the same OpenMPI module used to build mpi4py.
(We don't need to run
module load on the remote system, because Lmod modules
do their work by manipulating environment variables, which SLURM copies over to
the compute nodes.)
Once that is complete, we use the
srun command to run our job. SLURM finds
enough nodes to run our job, runs it, and prints the contents of standard
output and standard error.
Note that the above example assumes OpenMPI is configured as SLURM's default
MPI. If not , you can provide
--mpi=openmpi to the
srun command, to force
it to use native OpenMPI.
SLURM batch jobs with sbatch
When running as a batch job, most of the resource options should be specified
in the script that is provided to
sbatch. That script should include (via
##SBATCH special comments) all of the resource-setting options that you
previously included in the
srun command line, except for
Since the code supports any number of workers (as is typical for an MPI
program), you should provide
--ntasks on the
sbatch command-line, so that
you can adjust it to fit your current needs.
This example uses the sbatch script included in this repository. Ten workers are requested.
akkornel@rice15:~/mpi4py$ sbatch --ntasks 10 mpi4.py.sbatch Submitted batch job 1491736 akkornel@rice15:~/mpi4py$ sleep 10 akkornel@rice15:~/mpi4py$ squeue -j 1491736 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) akkornel@rice15:~/mpi4py$ cat slurm-1491736.out Controller @ MPI Rank 0: Input 3761677288 Worker at MPI Rank 1: Output 3761677289 is OK (from wheat08) Worker at MPI Rank 2: Output 3761677290 is OK (from wheat08) Worker at MPI Rank 3: Output 3761677291 is OK (from wheat08) Worker at MPI Rank 4: Output 3761677292 is OK (from wheat08) Worker at MPI Rank 5: Output 3761677293 is OK (from wheat08) Worker at MPI Rank 6: Output 3761677294 is OK (from wheat08) Worker at MPI Rank 7: Output 3761677295 is OK (from wheat08) Worker at MPI Rank 8: Output 3761677296 is OK (from wheat08) Worker at MPI Rank 9: Output 3761677297 is OK (from wheat08)
When SLURM launches the job, the batch script is executed on one of the nodes
allocated to the job. The batch script is responsible for setting up the
environment (in this case, loading the OpenMPI module), and then executing
srun. Note that in the batch script,
srun does not have any command-line
options (except for
srun is able to recognize that it is running
inside a batch job, and will automatically scale itself to use all of the
That's it! Without having to change any Python code, your work is now being run via a job scheduler, with all of the resource scheduling and IPC facilities manged for you.
Where to go next?
This was only a dip of the toe into the world of MPI. To continue on, you should first access the mpi4py documentation. As most MPI publications provide their code in C, the mpi4py docs will guide you on how to convert that C code into Python calls. You can also use the built-in help system, like so:
akkornel@rice13:~$ python3.5 Python 3.5.2 (default, Nov 12 2018, 13:43:14) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from mpi4py import MPI >>> mpi_comm = MPI.COMM_WORLD >>> help(mpi_comm.Get_rank) Help on built-in function Get_rank: Get_rank(...) method of mpi4py.MPI.Intracomm instance Comm.Get_rank(self) Return the rank of this process in a communicator
The mpi4py docs also has a tutorial, which you should read!
Now that is done, the next thing to do is to think of a problem, and to try solving it using MPI! Focus on a problem that can be parallelized easily.
The Lawrence Livermore National Lab has a tutorial on MPI, featuring some exercises that you can try. For example, try going to the tutorial and skipped ahead to the Point to Point Communication Routines section, which guides you through one of the ways of calculating the digits of Pi.
Other tutorials include Rajeev Thakur's Introduction to MPI slide deck, which has been condenced by Brad Chamberlain of the University of Washington (for a course there); and the MPI for Dummies slide deck, created for the PPoPP conference.
If you are interested in spreading your MPI code out across multiple machines, you might want to set up a SLURM environment. First, get some shared storage by setting up an NFS server (in Ubuntu, or you could use Amazon EFS or Google Cloud Filestore). Then, check out the SLURM Quick Start. On RPM-based systems, the quick-start walks you through creating RPMs; on Debian-based systems, the packages are already available under the name "slurm-llnl".
Once you are done with all that, start looking at nearby universities and research institutions. They may have a job opening for you!
Copyright & License
The contents of this repository are Copyright © 2019 the Board of Trustees of the Leland Stanford Junior University.
The code and docs (with the exception of the
.gitignore file) are licensed
under the GNU General Public License, version 3.0.