### Absolute Beginners Tutorial for MPI and DeepSpeed

Today we'll review an absolute beginner's use of MPI and DeepSpeed.  For this tutorial we'll use OpenMPI which is a message passing interface often used at super-computing centers for high performance computing.  It can be used for problems that require parallelism or those that require distributed computing (where processes require careful coordination).  OpenMPI on Domino Data Lab is easy to use.  

Here are a few tips on hardware choices:

1. Make sure to choose the correct hardware.  If working with small to medium size data (under a gigabyte) often a small or medium hardware tier is sufficient.
2. If using large datasets (larger than a GB) it is often useful to select a high memory hardware tier in either a CPU or GPU.
3. If using a complex calculation with very large data (larger than a few GB) then hardware 

For DeepSpeed use cases, the models and / or data is typically large, so a high-memory, GPU is most helpful.

### What is OpenMPI?

OpenMPI is a message passing interface library that is open-sourced.  Its a popular choice for super-computing centers.  The Open MPI Project is an MPI implementat that is created and maintained by a group of academic, research and industry partners.  The High Performance Computing community contributes the best code and technologies to the OpenMPI library.  Some of the features of OpenMPI include:

- Conforms to MPI-3.1 standards
- Thread management and concurrency (your computer will not get 'fried')
- Support of all networks
- Supports most job schedulers
- High Performance on all platforms
- Open source license (BSD license)

### How MPI is used in High Performance Computing

High Performance Computing (HPC) is simply -- at high speeds -- performing compex calculations.  It run both parallel and distributed problems.  The best known example of HPC is a supercomputer.  Supercomputers are made up of many CPUs or GPUs and processors working together to perform both parallel and distributed tasks.  MPI aids in this by coorindating messaging among CPUs, GPUs and nodes.

### Tutorial and Objectives

Upon completion of part one of this tutorial the learner will be able to:
    
- Describe OpenMPI and its use cases
- Identify appropriate hardware
- Perform a distributed calculation of pi using MPI workers
- Understand use of python files to run programs on a MPI-managed cluster

#### A 'Hello World' Example and Calcuate Pi

In this example we will use a naïve method (without distributed compute or parallelism) to calculate pi. We will compare this to using OpenMPI to Calculate Pi using a python file, but running the code on a set of workers.  First let's calculate pi the standard way.  Nota Bene: make sure to sync your MPI cluster before running new code.

In order to run a program or python files using MPI and python the code in the file needs to be set-up to run on the cluster.  In this exercise we will look at the number of processes running and the rank for each.  This ``hello world`` example is run from a sepparate python file.  One can run many python programs in the same manner. 

```
from __future__ import print_function
from mpi4py import MPI
comm = MPI.COMM_WORLD
print("Hello! I'm rank %d from %d running in total..." % (comm.rank, comm.size))
comm.Barrier()```

Notice in the code above we set up the communicator, which indicates the 'world' size or number of cpus or gpus in the cluster on which the program will run.  The ```com.Barrier()``` function tells the program to wait until all processes and workers are synced.  Once that happens the program can be run at the command line using ``mpirun`` along with the appropriate options.  Try this for yourself changing the number of processes.  You can also add or change the 'hello world' code to run a different python function.

In [2]:
import mpi4py

In [7]:
mpi4py.__version__

'3.1.3'

In [2]:
!mpirun -np 3  python hello_world.py

Hello! I'm rank 0 from 3 running in total...
Hello! I'm rank 1 from 3 running in total...
Hello! I'm rank 2 from 3 running in total...


In [3]:
!mpiexec --hostfile /domino/mpi/hosts -np 3 python hello_world.py 

Hello! I'm rank 1 from 3 running in total...
Hello! I'm rank 0 from 3 running in total...
Hello! I'm rank 2 from 3 running in total...


Calculating pi on 2 workers in parallel.

In [5]:
%%time

!mpirun -np 3 --hostfile /domino/mpi/hosts --bind-to core --map-by slot python calculate_pi.py

pi ~= 3.14146568
total time in seconds  77.99676060676575
pi ~= 3.14164336
total time in seconds  78.48352980613708
pi ~= 3.14160336
total time in seconds  79.8351399898529
CPU times: user 853 ms, sys: 226 ms, total: 1.08 s
Wall time: 1min 22s


In [4]:
### how to reach the mpi help file

!mpirun --help

mpirun (Open MPI) 4.1.2a1

Usage: mpirun [OPTION]...  [PROGRAM]...
Start the given program using Open RTE

-c|-np|--np <arg0>       Number of processes to run
-h|--help <arg0>         This help message
   -n|--n <arg0>         Number of processes to run
-q|--quiet               Suppress helpful messages
-v|--verbose             Be verbose
-V|--version             Print version and exit

For additional mpirun arguments, run 'mpirun --help <category>'

The following categories exist: general (Defaults to this option), debug,
    output, input, mapping, ranking, binding, devel (arguments useful to OMPI
    Developers), compatibility (arguments supported for backwards compatibility),
    launch (arguments to modify launch options), and dvm (Distributed Virtual
    Machine arguments).

Report bugs to http://www.open-mpi.org/community/help/


#### Calculating Pi with and without Distributed Compute

Below let's look at an example of how to calculate pi without mpi (naïve method) and with mpi (distributed method).

In [6]:
### create function for calculation
import time

import math
import random
import time
import os

def sample(num_samples):
    num_inside = 0
    for _ in range(num_samples):
        x, y = random.uniform(-1, 1), random.uniform(-1, 1)
        if math.hypot(x, y) <= 1:
            num_inside += 1
    return num_inside

def approximate_pi(num_samples):
    start = time.time()
    num_inside = sample(num_samples)
    end = time.time()
    
    print("pi ~= {}".format((4*num_inside)/num_samples))

In [7]:
%%time

## Perform calculation on 100 million data points

approximate_pi(10**8)

pi ~= 3.1414826
CPU times: user 1min 3s, sys: 0 ns, total: 1min 3s
Wall time: 1min 3s


Now let's compare that calculation to calculating pi using OpenMPI.  For this function we will use a python wrapper / library around MPI called py4mpi.  We will keep our calculation code in a sepparate file so we can run the function over MPI.

In [4]:
## prepare for distributed calculation using MPI

maxprocs = input('How many GPU or CPU workers do you have?')

How many GPU or CPU workers do you have? 3


In [5]:
%%time

from mpi4py import MPI
import numpy
import sys



comm = MPI.COMM_SELF.Spawn(sys.executable,
                           args=['cpi.py'],
                           maxprocs=int(maxprocs))

N = numpy.array(10**8, 'i')
comm.Bcast([N, MPI.INT], root=MPI.ROOT)
PI = numpy.array(0.0, 'd')
comm.Reduce(None, [PI, MPI.DOUBLE],
            op=MPI.SUM, root=MPI.ROOT)
print(PI)

comm.Disconnect()

3.1415926535901537
CPU times: user 17 s, sys: 7.33 s, total: 24.4 s
Wall time: 26.3 s


In [10]:
%%time

!mpirun -np 3 --hostfile /domino/mpi/hosts python cpi2.py

CPU times: user 9.36 ms, sys: 129 µs, total: 9.48 ms
Wall time: 671 ms


In [None]:
%%time

comm = MPI.COMM_SELF.Spawn(sys.executable,
                           args=['cpi2.py'],
                           maxprocs=int(maxprocs))

N = numpy.array(10**8, 'i')
comm.Bcast([N, MPI.INT], root=MPI.ROOT)
PI = numpy.array(0.0, 'd')
comm.Reduce(None, [PI, MPI.DOUBLE],
            op=MPI.SUM, root=MPI.ROOT)
print(PI)

comm.Disconnect()

Traceback (most recent call last):
  File "cpi2.py", line 4, in <module>
    import cupy as cp
ModuleNotFoundError: No module named 'cupy'
Traceback (most recent call last):
  File "cpi2.py", line 4, in <module>
    import cupy as cp
ModuleNotFoundError: No module named 'cupy'
Traceback (most recent call last):
  File "cpi2.py", line 4, in <module>
    import cupy as cp
ModuleNotFoundError: No module named 'cupy'
--------------------------------------------------------------------------
Child job 3 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
(null) detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[56347,3],1]
  Exit code:    1
------------------------------------------