## <center>Introduction to Message Passing Interface (MPI)</center>
### <center> Linh B. Ngo </center>
### <center> CPSC 3620 </center>

#### <center>Message Passing

- Processes communicate via messages
- Messages can be
    - Raw data to be used in actual calculations
    - Signals and acknowledgements for the receiving processes regarding the workflow

#### <center>History of MPI

** Early 80s:**
- Various message passing environments were developed
- Many similar fundamental concepts
- N-cube (Caltech), P4 (Argonne), PICL and PVM (Oakridge), LAM (Ohio SC)

** 1992: **
- More than 80 reseachers from different institutions in US and Europe agreed to develop and implement a common standard for message passing
- First meeting colocated with Supercomputing 1992

** After finalization: **
- MPI becomes the *de-factor* standard for distributed memory parallel programming
- Available on every popular operating system and architecture
- Interconnect manufacturers commonly provide MPI implementations optimized for their hardware
- MPI standard defines interfaces for C, C++, and Fortran
    - Language bindings available for many popular languages (quality varies)
    - MPI4PY: Bindings for python

** 1994: MPI-1 **
- Communicators
    - Information about the runtime environments
    - Creation of customized topologies
- Point-to-point communication
    - Send and receive messages
    - Blocking and non-blocking variations
- Collectives
    - Broadcast and reduce
    - Gather and scatter

** 1998: MPI-2 **
- One-sided communication (non-blocking)
    - Get & Put (remote memory access)
- Dynamic process management
    - Spawn
- Parallel I/O
    - Multiple readers and writers for a single file
    - Requires file-system level support (LustreFS, PVFS)

** 2012: MPI-3 **
- Revised remote-memory access semantic
- Fault tolerance model
- Non-blocking collective communication
- Access to internal variables, states, and counters for performance evaluation purposes

#### <center> Set up MPI on Palmetto for C/C++

**Interactive mode:**

`qsub -I -l select=1:ncpus=8:mpiprocs=8:mem=10gb,walltime=01:00:00`

`module load gcc/5.3.0 openmpi/1.10.3`

*The module load command can be added to a script, which then is to be sourced from inside .bashrc to automate module loading. Calling the module load directly from inside .bashrc is not recommended.*

- Create a file named **hello.c** inside directory **cpsc3620** with the following content
```
#include <stdio.h>
#include <sys/utsname.h>
#include <mpi.h>
int main(int argc, char *argv[]){
    MPI_Init(&argc, &argv);
    struct utsname uts;
    uname (&uts);
    printf("My process is on node %s.\n", uts.nodename);
	MPI_Finalize();
	return 0;
}
```

- Compile hello.c
```
mpicc hello.c -o hello
```
- Run hello.c
```
mpirun -np 2 ./hello
```

#### <center> Set up MPI on Palmetto for Python (Interactive via Jupyter Notebook)

**Before launching JupyterHub**
- Make sure that you have the command ``module load gcc/5.3.0 openmpi/1.10.3`` in your .jhubrc file. If you are using JupyterHub to edit the file, the server will need to be stopped and started again. 

**Before launching a Jupyter notebook (only need to be done once)**
- Install mpi4py by executing ``pip install --user mpi4py`` from a terminal. This needs to be done prior to launching a Jupyter notebook. 

**Before launching a Jupyter notebook (any time you wish to run interactive MPI inside Jupyter Notebook)**
- Inside a terminal (that must be kept open), execute the following command:
```
ipcluster start --n <Number of total possible cores> --profile=mpicluster
```

- In the first cell of your Jupyter notebook, type the followings:
```
import ipyparallel
c = ipyparallel.Client(profile="mpicluster")
```

- Test the attached cluster by running the following in a cell:
```
print(c.ids)
```

- Any cell that contains MPI codes must be started with ``%%px``

In [5]:
import ipyparallel
c=ipyparallel.Client(profile="mpicluster")
print(c.ids)

[0, 1, 2, 3]


In [6]:
%%px
from mpi4py import MPI
import socket
print ("My process is on node %s" % (socket.gethostname()))

[stdout:0] My process is on node node1925
[stdout:1] My process is on node node1925
[stdout:2] My process is on node node1925
[stdout:3] My process is on node node1925


#### <center> The working of MPI in a nutshell

- All processes are launched at the beginning of the program execution
    - The number of processes are user-speficied
    - Typically, this number is matched to the total number of cores available across the entire cluster
- All processes have their own memory space and have access to the same source codes

**Basic parameters available to individual processes: **
```
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
name = MPI.Get_processor_name()
```

- MPI defines **communicator** groups for point-to-point and collective communications
    - Unique IDs (**rank**) are defined for individual processes within a communicator group
    - Communications are performed based on these IDs
    - Default **global communication** (COMM_WORLD) contains all processes
    - For $N$ processes, ranks go from $0$ to $N-1$

In [3]:
%%px
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
name = MPI.Get_processor_name()
print ("Hello world from process %s running on host %s out of %s processes" % 
       (rank, name, size))

[stdout:0] Hello world from process 0 running on host node1925 out of 4 processes
[stdout:1] Hello world from process 2 running on host node1925 out of 4 processes
[stdout:2] Hello world from process 1 running on host node1925 out of 4 processes
[stdout:3] Hello world from process 3 running on host node1925 out of 4 processes


- Ranks are used to enforce execution/exclusion of code segments within the original source code

In [43]:
%%px
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
name = MPI.Get_processor_name()
if (rank % 2 == 0):
    print ("Process %s is even" % (rank))
else:
    print ("Process %s is odd" % (rank))

[stdout:0] Process 0 is even
[stdout:1] Process 2 is even
[stdout:2] Process 1 is odd
[stdout:3] Process 3 is odd


- Ranks can be used as mean to calculate and distributed workload (data) among the processes

In [10]:
%%px
from mpi4py import MPI
import random
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
name = MPI.Get_processor_name()
A = [2,13,4,3,5,1,0,12,10,8,7,9,11,6,15,14]
#print ("Elements %s and %s are assigned to process %s" 
#       % (A[rank%15], A[1+rank%15], rank))
if (rank < 4):
    print ("Process %s has elements %s" % (rank, A[(4*rank):(4*rank+4)]))

[stdout:1] Process 2 has elements [10, 8, 7, 9]
[stdout:2] Process 0 has elements [2, 13, 4, 3]
[stdout:9] Process 1 has elements [5, 1, 0, 12]
[stdout:15] Process 3 has elements [11, 6, 15, 14]


- Individual processes rely on communication (message passing) to enforce workflow
    - Point-to-point Communication
    - Collective Communication

#### <center> Point-to-Point: Send and Receive

```
comm = MPI.COMM_WORLD
```
- Sender process:
```
comm.send(data, dest_rank)
```
- Receiver process:   
```
data = comm.recv(source_rank)
```

** Original MPI C Syntax: MPI_Send**
```
int MPI_Send(void *buf, 
	int count, 
	MPI_Datatype datatype, 
	int dest, 
	int tag, 
	MPI_Comm comm)
```

- MPI_Datatype may be MPI_BYTE, MPI_PACKED, MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_UNSIGNED_CHAR
- *dest* is the rank of the process the message is sent to
- *tag* is an integer identify the message. Programmer is responsible for managing tag


** Original MPI C Syntax: MPI_Recv**
```
int MPI_Recv(
	void *buf, 
	int count, 
	MPI_Datatype datatype, 
	int source, 
	int tag, 
	MPI_Comm comm,
	MPI_Status *status)
```

- MPI_Datatype may be MPI_BYTE, MPI_PACKED, MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_UNSIGNED_CHAR
- *source* is the rank of the process from which the message was sent.
- *tag* is an integer identify the message. MPI_Recv will only place data in the buffer if the tag from MPI_Send matches. The constant MPI_ANY_TAG may be used when the source tag is unknown or not important. 


In [4]:
%%px
from mpi4py import MPI
import random
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if (rank == 0):
    send_pkg = random.random()
    print (send_pkg)
    comm.send(send_pkg, dest = 1, tag = 1)
if (rank == 1):
    recv_pkg = 0
    recv_pkg = comm.recv(source = 0, tag = 1)
    print (recv_pkg)

[stdout:0] 0.655381079134236
[stdout:2] 0.655381079134236


**Blocking risks**
- Send data larger than available network buffer (Blocking send)
- Lost data (or missing sender) leading to receiver hanging indefinitely (Blocking receive)

**Data types**
- MPI4PY supports all default MPI's data types
- MPI4PY uses *pickle* to facilitate sending and receiving of complex data

In [12]:
%%px
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
    data = {'class': 'cpsc3620', 'semester': 'Fall 2016', 
            'enrollments': 40}
    comm.send(data, dest=1, tag=11)
elif rank == 1:
    data = comm.recv(source=0, tag=11)
    print(data)

[stdout:9] {'class': 'cpsc3620', 'enrollments': 40, 'semester': 'Fall 2016'}


In [13]:
%%px
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
    data = [1,2,3,4]
    comm.send(data, dest=1, tag=11)
elif rank == 1:
    data = comm.recv(source=0, tag=11)
    print(data)

[stdout:9] [1, 2, 3, 4]


#### <center> Collective Communication

- Must involve ALL processes within the scope of a communicator
- Unexpected behavior, including programming failure, if even one process does not participate
- Types of collective communications:
    - Synchronization: barrier
    - Data movement: broadcast, scatter/gather
    - Collective computation (aggregate data to perform computation): Reduce

<center> <img src="pictures/mpi-collective.png" width="700"/> 
<sub> *https://computing.llnl.gov/tutorials/mpi/* </sub>
</center>

**broadcast:**
```
comm = MPI.COMM_WORLD
```
- All processes:
```
<buffer at receiving process> = comm.bcast(<original data>, root=<root process>)
```
- *root* process is the one that has the original data initially. 

```
int MPI_Bcast(
	void *buf, 
	int count, 
	MPI_Datatype datatype, 
	int root, 
	MPI_Comm comm);
```
- Don’t need to specify a TAG or DESTINATION
- Must specify the SENDER (root)
- Blocking call for all processes

In [10]:
%%px
from mpi4py import MPI
comm = MPI.COMM_WORLD; rank = comm.Get_rank();
if rank == 0:
    data = 2
elif (rank != 1):
    data = -1
else:
    data = -2
print ("%s: %s" % (rank, data))
if (rank != 1):
    data = comm.bcast(data, root=0)
    print ("%s: %s" % (rank, data))
else:
    data1 = comm.bcast(data, root=0)
print ("%s: %s" % (rank, data))
if (rank == 1):
    print("%s: %s" % (rank, data1))

[stdout:0] 
0: 2
0: 2
0: 2
[stdout:1] 
2: -1
2: 2
2: 2
[stdout:2] 
1: -2
1: -2
1: 2
[stdout:3] 
3: -1
3: 2
3: 2


In [12]:
%%px
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = {'key1' : [7, 2.72, 2+3j],
            'key2' : ( 'abc', 'xyz')}
else:
    data = None
data = comm.bcast(data, root=0)
print ("process %s" % (rank))
print (rank,data)

[stdout:0] 
process 0
0 {'key2': ('abc', 'xyz'), 'key1': [7, 2.72, (2+3j)]}
[stdout:1] 
process 2
2 {'key1': [7, 2.72, (2+3j)], 'key2': ('abc', 'xyz')}
[stdout:2] 
process 1
1 {'key1': [7, 2.72, (2+3j)], 'key2': ('abc', 'xyz')}
[stdout:3] 
process 3
3 {'key1': [7, 2.72, (2+3j)], 'key2': ('abc', 'xyz')}


** How to save broadcast data into different variables on different processes? **

In [50]:
%%px
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = 2
else:
    data = -1
print ("%s: %s" % (rank, data))
data = comm.bcast(data, root=0)
print ("%s: %s" % (rank, data))

[stdout:0] 
0: 2
0: 2
[stdout:1] 
2: -1
2: 2
[stdout:2] 
1: -1
1: 2
[stdout:3] 
3: -1
3: 2


**scatter:**
```
comm = MPI.COMM_WORLD
```
- All processes:
```
<buffer at receiving process> = comm.scatter(<original array>, root=<root process>)
```
- *root* process is the one that has the original data array initially. 
- Data are divided according to rank

** Original MPI C Syntax: MPI_Scatter**
```
int MPI_Scatter(
	void *sendbuf, 
	int sendcount, 
	MPI_Datatype sendtype, 
	void *recvbuf,
	int recvcnt,
	MPI_Datatype recvtype,
	int root, 
	MPI_Comm comm);
```

In [15]:
%%px
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size();rank = comm.Get_rank(); print(rank)

if rank == 0:
    data = [(i+1)**2 for i in range(size)]
    print (data)
else:
    data = None
partial_data = comm.scatter(data, root=0)
print (data)
print (partial_data)

[stdout:0] 
0
[1, 4, 9, 16]
[1, 4, 9, 16]
1
[stdout:1] 
2
None
9
[stdout:2] 
1
None
4
[stdout:3] 
3
None
16


In [7]:
%%px
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = [{'key1' : [7, 2.72, 2+3j]},
            {'key2' : ( 'abc', 'xyz')},
            {'key3' : ( 'abc', 'xyz')},
            {'key4' : ( 'cde', 'xyz')}]
else:
    data = None
data = comm.scatter(data, root=0)
print ("%s: %s" % (rank, data))

[stdout:0] 2: {'key3': ('abc', 'xyz')}
[stdout:1] 0: {'key1': [7, 2.72, (2+3j)]}
[stdout:2] 1: {'key2': ('abc', 'xyz')}
[stdout:3] 3: {'key4': ('cde', 'xyz')}


**gather:**
```
comm = MPI.COMM_WORLD
```
- All processes:
```
<buffer at sending process> = comm.gather(<final array>, root=<root process>)
```
- *root* process is the one that receives the original data array initially. 
- Data arrive and are sorted at *root* according to rank

** Original MPI C Syntax: MPI_Gather**
```
int MPI_Gather(
	void *sendbuff, 
	int sendcount, 
	MPI_Datatype sendtype, 
	void *recvbuff,
	int recvcnt,
	MPI_Datatype recvtype,
	int root, 
	MPI_Comm comm);
```

In [9]:
%%px
from mpi4py import MPI

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

data = (rank+1)**2
print ("%s: %s" % (rank, data))
all_data = comm.gather(data, root=0)
if rank == 0:
    print ("%s: %s" % (rank, data))
    print (all_data)
else:
    print ("%s: %s" % (rank, data))

[stdout:0] 
2: 9
2: 9
[stdout:1] 
0: 1
0: 1
[1, 4, 9, 16]
[stdout:2] 
1: 4
1: 4
[stdout:3] 
3: 16
3: 16


** What happpens to variable *data* at the non-root process?**

In [54]:
%%px
from mpi4py import MPI

comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

data = (rank+1)**2
print ("%s: %s" % (rank, data))
data = comm.gather(data, root=0)
if rank == 0:
    print ("%s: %s" % (rank, data))
else:
    print ("%s: %s" % (rank, data))

[stdout:0] 
0: 1
0: [1, 4, 9, 16]
[stdout:1] 
2: 9
2: None
[stdout:2] 
1: 4
1: None
[stdout:3] 
3: 16
3: None


**reduce**
```
comm = MPI.COMM_WORLD
```
- All processes:
```
<final result at sending process> = comm.reduce(<data to be reduced>, op=MPI.<operation>, root=<root process>)
```
- *root* process is the one that receives the final reduced data initially. 

** Original MPI C Syntax: MPI_Reduce**
```
int MPI_Reduce(
	void *sendbuf, 
	void *recvbuff,
	int count, 
	MPI_Datatype datatype, 
	MPI_OP op,
	int root, 
	MPI_Comm comm);
```
- MPI_Op may be MPI_MIN, MPI_MAX, MPI_SUM, MPI_PROD (twelve total)
- Programmer may add operations, must be commutative and associative
- If count > 1, then operation is performed element-wise


In [10]:
%%px
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

sum = comm.reduce(rank, op=MPI.SUM, root=0)

if rank == 0:
    print ("The reduction is %s" % (sum))

[stdout:1] The reduction is 6
