############# Markdown note ##################

<div class="alert alert-block alert-info"> <b>NOTE</b> Use blue boxes for Tips and notes. </div>

<div class="alert alert-block alert-success"> Use green boxes sparingly, and only for some specific purpose that the other boxes can't cover. For example, if you have a lot of related content to link to, maybe you decide to use green boxes for related links from each section of a notebook. </div>

<div class="alert alert-block alert-warning"> Use yellow boxes for examples that are not inside code cells, or use for mathematical formulas if needed. </div>

<div class="alert alert-block alert-danger"> In general, just avoid the red boxes. </div>

<img src="<path>" width=20% style="margin-left:auto; margin-right:auto">
<img src="<path>" width=40% style="float: right;">  

In [None]:
%%sh

# reset all programs
rm -rf debug*

# MPI Collectives

Collective Communications with **Message Passing Interface** (MPI)

Communications involving a **group** or **groups** of processes are called **collectives**.

<div class="alert alert-block alert-warning"><code>MPI 1.0-2.0</code> collective calls are blocking. <code>MPI-3</code> introduced <b>non-blocking</b> collectives. <code>MPI-4</code> introduced <b>persistent</b> collectives.</div>

They have the following characteristics:
* **Every** process in the communicator shall call the collective function;
* **No tags** are required.

## More efficient...

<div class="alert alert-block alert-success"> Designed to replace loops of point-to-point calls to be <b>more efficient</b>. </div>

Is this true according to standard? ....

> ...While vendors may write optimized collective routines
matched to their architectures, a complete library of the collective communication
routines can be written entirely using the MPI point-to-point communication func-
tions and a few auxiliary functions...

## Intra or Inter Communicator

Two types of **communicators**:

* **intra-communicator**: identifier for a single group of MPI processes linked with a context
* **inter-communicator**: identifies **two** distinct groups of MPI processes linked with a context


## Collectives Categories

MPI intra-communicator (and almost inter...) collective operations fits **4** categories:

* **All-To-All**: All MPI processes contribute to the result. All MPI processes receive the result.
* **All-To-One**: All MPI processes contribute to the result. One MPI process (**root**) receives the result.
* **One-To-All**: One MPI process contributes to the result. All MPI processes receive the result.
* **Other**: Collective operations that do not fit into one of the above categories.


## Collectives in an image...

<img src="./Images/collectives.png" width=30% style="margin-left:auto; margin-right:auto">

## Barriers

* _Intra-Comm_: stop a group of processes until they are **synchronized**.
* _Inter-Comm_: the call returns for group **A** if all members of group **B** have entered the call (and viceversa) 

> ...An MPI process may return from the call before all MPI processes in its own group have entered the call...

`MPI_Barrier`: see https://www.open-mpi.org/doc/v4.1/man3/MPI_Barrier.3.php

<img src="./Images/barrier.png" width=50% style="margin-left:auto; margin-right:auto">

<div class="alert alert-block alert-danger"> <b>Severe</b> performance impact if used too often. </div>

## Broadcast

* _Intra-Comm_: broadcasts a message from the **root** process to **all other processes** of the group.
* _Inter-Comm_: group **A** is the **root**. All MPI processes in group **B** pass in argument root the rank of the root in group **A**. The root passes the value `MPI_ROOT` in root. All other MPI processes in group A pass the value `MPI_PROC_NULL` in root. Data is broadcast from the **root** to all MPI processes in group B.

`MPI_Bcast`: https://www.open-mpi.org/doc/v4.1/man3/MPI_Bcast.3.php

<img src="./Images/bcast.png" width=40% style="margin-left:auto; margin-right:auto">

In [None]:
%%writefile main_bcast.cpp

#include <iostream>
#include <mpi.h>

int main(int argc, char **argv) 
{
    MPI_Init(&argc, &argv);
    
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    
    MPI_Status status;
    double a[2] = { 0.0, 0.0 };
    if ( rank == 0 ) 
    {
        a[0] = 2.1; 
        a[1] = 4.3;
    }
    
    // send the information to all the other processes
    MPI_Bcast(a, 2, MPI_DOUBLE, 0, MPI_COMM_WORLD);
    
    std::cout<< "Process "<< rank<< " ";
    std::cout<< "a "<< a[0]<< ", "<< a[1]<< std::endl; 
    
    MPI_Finalize();
    return 0;
}

In [None]:
%%sh

# compile program
mkdir -p ./debug_bcast
cd debug_bcast
cmake -DSOURCES="main_bcast.cpp" ..
make

In [None]:
%%sh

# run program
cd debug_bcast
mpirun -np 4 3_Collectives

## Gather and Scatter (1)

* _Intra-Comm_: the **root** process collects data elements from all the processes and stores them in rank order (**Gather**) and viceversa (**Scatter**).
* _Inter-Comm_: group **A** defines the **root**. All MPI processes in group **B** pass the same value in argument root, which is the rank of the root in group **A**. The root passes the value `MPI_ROOT` in root. All other MPI processes in group A
pass the value `MPI_PROC_NULL` in root. Data is gathered from all MPI processes in group **B** to the root (and viceversa).

## Gather and Scatter (2)
  
* `MPI_Gather`: https://www.open-mpi.org/doc/v4.1/man3/MPI_Gather.3.php
* `MPI_Scatter`: https://www.open-mpi.org/doc/v4.1/man3/MPI_Scatter.3.php

<img src="./Images/gather.png" width=48% style="float: left;">  
<img src="./Images/scatter.png" width=48% style="float: right;">  

In [None]:
%%writefile main_gather_scatter.cpp

#include <iostream>
#include <mpi.h>

int main(int argc, char **argv) 
{
    MPI_Init(&argc, &argv);
    
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    
    MPI_Status status;
    int a[8] = { 0 };
    if (rank == 0) 
    {
        for (unsigned int i = 0; i < 8; i++)
            a[i] = i + 1; 
    }
    
    // send the information to all the other processes
    MPI_Scatter(a, 2, MPI_INT, a, 2, MPI_INT, 0, MPI_COMM_WORLD);
    
    std::cout<< "Before Process "<< rank<< " ";
    std::cout<< "a: ";
    for (unsigned int i = 0; i < 8; i++)
        std::cout<< (i == 0 ? "" : ", ")<< a[i];
    std::cout<< std::endl;
        
    a[0] *= 2;
    a[1] *= 2;
    
    // get the information from the other processes
    MPI_Gather(a, 2, MPI_INT, a, 2, MPI_INT, 0, MPI_COMM_WORLD);
    
    // stop processes to obtain a good output
    MPI_Barrier(MPI_COMM_WORLD);
    
    std::cout<< "After Process "<< rank<< " ";
    std::cout<< "a: ";
    for (unsigned int i = 0; i < 8; i++)
        std::cout<< (i == 0 ? "" : ", ")<< a[i];
    std::cout<< std::endl;
    
    MPI_Finalize();
    return 0;
}

In [None]:
%%sh

# compile program
mkdir -p ./debug_gather_scatter
cd debug_gather_scatter
cmake -DSOURCES="main_gather_scatter.cpp" ..
make

In [None]:
%%sh

# run program
cd debug_gather_scatter
mpirun -np 4 3_Collectives

## GatherV and ScatterV

More **complex** gather and scatter call where it’s possible to define a different length of arrays.

* `MPI_GatherV`: https://www.open-mpi.org/doc/v4.1/man3/MPI_Gatherv.3.php
* `MPI_ScatterV`: https://www.open-mpi.org/doc/v4.1/man3/MPI_Scatterv.3.php

<img src="./Images/scatterv.png" width=50% style="margin-left:auto; margin-right:auto"> 

## Other collectives

* `MPI_Allgather`: https://www.open-mpi.org/doc/v4.1/man3/MPI_Allgather.3.php
* `MPI_Alltoall`: https://www.open-mpi.org/doc/v4.1/man3/MPI_Alltoall.3.php
* `MPI_Allgatherv`: https://www.open-mpi.org/doc/v4.1/man3/MPI_Allgatherv.3.php
* `MPI_Alltoallv`: https://www.open-mpi.org/doc/v4.1/man3/MPI_Alltoallv.3.php
* ...

<div class="alert alert-block alert-danger"><b>NOTE</b>: expensive calls - use only when needed.</div>

### `MPI_Allgather` - Intra-Comm

<img src="./Images/allgather.png" width=48% style="margin-left:auto; margin-right:auto">

### `MPI_Allgather` - Inter-Comm

<img src="./Images/all_gather_inter.png" width=48% style="margin-left:auto; margin-right:auto">  


### One-Way Inter-Communication

> ...When a complete exchange is executed in the inter-communicator case, then the number of data items sent from MPI processes in group **A** to MPI processes in group **B** need **NOT** equal the number of items sent in the reverse direction.
In particular, one can have unidirectional communication by specifying `sendcount = 0` in the reverse direction...

## Reductions

<img src="./Images/reduce.png" width=23% style="float: right;">  

A **reduction** takes values from a group of processes and generates a **single value** with some operation (e.g. a `sum`, `average`, etc.).

* collect data from different processes;
* store the result on a single process or distribute the value to all processes

<div class="alert alert-block alert-success"> Designed to <b>avoid race-conditions</b>.</div>

* `MPI_Reduce`: https://www.open-mpi.org/doc/v4.1/man3/MPI_Reduce.3.php
* `MPI_Allreduce`: https://www.open-mpi.org/doc/v4.1/man3/MPI_Allreduce.3.php

In [None]:
%%writefile main_reduce.cpp

#include <iostream>
#include <mpi.h>

int main(int argc, char **argv) 
{
    MPI_Init(&argc, &argv);
    
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    
    MPI_Status status;
    int a[2] = { rank, rank + 1 };
    int sum[2] = { 0, 0 };
    
    std::cout<< "Before Process "<< rank<< " ";
    std::cout<< "a: "<< a[0]<< ", "<< a[1]<< std::endl;
    
    MPI_Reduce(&a, &sum, 2, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
            
    std::cout<< "After Process "<< rank<< " ";
    std::cout<< "sum: "<< sum[0]<< ", "<< sum[1]<< std::endl;
    
    MPI_Finalize();
    return 0;
}

In [None]:
%%sh

# compile program
mkdir -p ./debug_reduce
cd debug_reduce
cmake -DSOURCES="main_reduce.cpp" ..
make

In [None]:
%%sh

# run program
cd debug_reduce
mpirun -np 4 3_Collectives

### Reduction Operations
<img src="./Images/reduction.png" width=50% style="margin-left:auto; margin-right:auto"> 

### Reduction Operations (2)

* To use `MPI_MINLOC` and `MPI_MAXLOC` in a reduce operation, one must provide a datatype argument that represents a **pair** (value and index), as `MPI_DOUBLE_INT`
* User can define **Reduction Operations** with function `MPI_OP_CREATE`

<div class="alert alert-block alert-danger"> <b>REMEMBER</b>: use <code>MPI_OP_ FREE</code> to remove it </div>

* `MPI_SCAN` is an interesting function, see the example  

```
 * +---------------+   +---------------+   +---------------+   +---------------+
 * | MPI process 0 |   | MPI process 1 |   | MPI process 2 |   | MPI process 3 |
 * +---------------+   +---------------+   +---------------+   +---------------+
 * |       0       |   |       1       |   |       2       |   |       3       |
 * +-------+-------+   +-------+-------+   +-------+-------+   +-------+-------+
 *         |                   |                   |                   |
 *         |                +--+--+                |                   |
 *         +----------------| SUM |                |                   |
 *         |                +--+--+                |                   |
 *         |                   |                +--+--+                |
 *         |                   +----------------| SUM |                |
 *         |                   |                +--+--+                |
 *         |                   |                   |                +--+--+
 *         |                   |                   +----------------| SUM |
 *         |                   |                   |                +--+--+
 *         |                   |                   |                   |
 * +-------+-------+   +-------+-------+   +-------+-------+   +-------+-------+
 * |       0       |   |       1       |   |       3       |   |       6       |
 * +---------------+   +---------------+   +---------------+   +---------------+
 * | MPI process 0 |   | MPI process 1 |   | MPI process 2 |   | MPI process 3 |
 * +---------------+   +---------------+   +---------------+   +---------------+
```

### Inter-Comm Reduction Operations

* for reduce opeartion:
> ...If comm is an inter-communicator, then the call involves all MPI processes in the inter-communicator, but with group **A** defining the root. All MPI processes in the group **B** pass the same value in argument root, which is the rank of the **root** in group A. The root passes the value `MPI_ROOT` in root. All other MPI processes in group **A** pass the value `MPI_PROC_NULL` in root. Only send buffer arguments are significant in group **B** and only receive buffer arguments are significant at the root....
* for All reduce operations
> ...If comm is an inter-communicator, then the result of the reduction of the data provided by MPI processes in group **A** is stored at each MPI process in group **B**, and vice versa. Both groups should provide count and datatype arguments that specify the same type signa...

## Collectives and Performance

MPI vendors work **hard** to optimise collectives for parallel hardware.

_Latency measurements_ (minimum time needed to transfer data) in different machines: image taken from: https://doi.org/10.1002/cpe.6769
<img src="./Images/performance.png" width=80% style="margin-left:auto; margin-right:auto"> 

<div class="alert alert-block alert-warning"> <b>NOTE</b>: parallel scaling is often dictated by MPI collectives. </div>

### Nonblocking and Persistent Collectives

> ...The nonblocking collective communication model is similar to the model used for non-blocking point-to-point communication...

> ...It is erroneous to call `MPI_REQUEST_FREE` or `MPI_CANCEL` for a request associated with a nonblocking collective operation. Nonblocking collective requests created using the APIs described in this section are not persistent. However, persistent collective requests can be created using persistent collective operations...

> ..nonblocking collective operations do not match with blocking collective operations, and collective operations do not have a **tag** argument. All
MPI processes must call collective operations (blocking and nonblocking) in the same order per communicator. In particular, once a MPI process calls a collective operation, all other MPI processes in the communicator must eventually call the same collective operation, and no other collective operation with the same communicator in between. This is consistent with the ordering rules for blocking collective operations in threaded environments...