

Prof. Dr. D. Kranzlmüller, Dr. K. Fürlinger

# Parallel Computing WS 2017/18

Session 6: MPI Recap, Stencil Operations

Tobias Fuchs, M.Sc. tobias.fuchs@nm.ifi.lmu.de







### Two-sided vs. One-sided





#### Two-sided

- Memory is private to each process.
- When the sender calls the MPI\_Send operation and the receiver calls the MPI\_Recv operation, data in the sender memory is copied to a buffer then sent over the network, where it is copied to the receiver memory.
- **Drawback:** sender has to wait for the receiver to be ready to receive the data before it can send the data.
- Both sender and receiver have to state a specific call for the communication.

#### → Coupled program flow





#### One-sided

- Sections in local memory are made accessible among processes.
- Requires only one process to transfer data, decouples data transfer from system synchronization.
- MPI 3.0 supports one-sided passive target communication without the intervention of the remote process via Remote Direct Memory Access (RDMA).
  - That is: send or receive data without any local action.
- → Decoupled program flow





#### One-sided

- Sections in local memory are made accessible among processes.
- Requires **only one process to transfer data**, degree les data transfer from system synchronization.
- MPI 3.0 supports one-sided passive arget communication without the intervention of the remotion ocess via Remote Direct Memory Access (RDMA).
  - That is: sence the ceive data without any local action.
- → Decoupled program flow

In the real world, passive RDMA in most MPI implementations is either buggy, or inefficient (barrier spin-locks and other delightful hacks) or both, but it's getting better.





### **One-sided Operations**

Standard Request-based

MPI\_Put MPI\_Rput

MPI\_Get MPI\_Rget

MPI\_Accumulate MPI\_Raccumulate

- All data movement operations are non-blocking.
- Requires explicit synchronization call to ensure completion, e.g.:

```
MPI_Wait(req)
MPI_Win_fence(win)
MPI_Win_flush(win)
```







### MPI Get / MPI Put

Origin: calling (i.e. local) process

Target: remote process

```
MPI_Get(oaddr, ocount, otype,
        trank, tdisp, tcount, ttype, window)
```

Transfer elements from target in window[tdisp:tcount] into local buffer oaddr at origin.

```
MPI_Put(oaddr, ocount, otype,
        trank, tdisp, tcount, ttype, window)
```

Transfer elements from local buffer oaddr at origin to target into window[tdisp:tcount].







### (Blocking / Nonblocking) x (One-sided x Two-sided)

|           | Blocking | Nonblocking |
|-----------|----------|-------------|
| Two-sided | MPI_Send | MPI_Isend   |
| One-sided | MPI_Put  | MPI_Rput    |

One-sided communication can be used to implement collective operations.

Pop quiz: How would you implement a reduce operation using

one-sided communication?

Find minimum value in distributed array. **Example:** 





### One-sided true passive RMA Example





### One-sided true passive RMA Example ctd.

What could go wrong?





### One-sided true passive RMA Example ctd.

Data races, as known from multi-threading.



## **Stencil Operations**









| rank  | 0 | rank  | 1 |
|-------|---|-------|---|
| block |   | block |   |
| [0,0] |   | [0,1] |   |
| rank  | 2 | rank  | 3 |
| block |   | block |   |
| [1,0] |   | [1,1] |   |



#### **Stencil Algorithms using MPI**



Computing the inner-most values in a local block is straight-forward (local-only).















Assuming a 1-nn stencil,

block size 
$$B = b_X \times b_Y$$

field size 
$$N = n_X \times n_Y$$

Elements exchanged with all neighbors per block:  $4b_Xb_V$ 

**Surface-to-volume ratio?** 







### Surface/Volume Ratio

- For high degree of parallelism: select small block size (more processes → more blocks → smaller block size)
- But: small block size affects border exchanges, surface/volume ratio increases.
- As the size of a block increases its **volume grows faster than its surface area**.
  - Square-Cube Law: O(n3) vs O(n2)
- High ratio → more the time spent on communication per iteration, less time left to spend on actual computations.



Domain Decomposition vs.

Process Topology vs.

**Hardware Locality** 





## Decomposition vs. Process Topology vs. Hardware



## Data distribution, hardware and communication patterns are highly interdependent with respect to performance







## Decomposition vs. Process Topology vs. Hardware



### # slow boundaries in halo exchange: 12







## **Decomposition vs. Process Topology vs. Hardware**



# slow boundaries in halo exchange: 12

# distinct neighbor ranks per process: 2







## **Decomposition vs. Process Topology vs. Hardware**



# slow boundaries in halo exchange: 12

# distinct neighbor ranks per process: 2



A high number of different neighbor processes is advantageous, assuming communication with different processes can be parallelized.



2 distinct neighbor ranks, 12 regions



3 distinct neighbor ranks, 12 regions





## Decomposition vs. Process Topology vs. Hardware



### How many slow boundaries for this mapping scheme?







## Decomposition vs. Process Topology vs. Hardware



... or for this one?# slow boundaries in halo exchange: 4





## **Decomposition vs. Process Topology vs. Hardware**







## Decomposition vs. Process Topology vs. Hardware







The second mapping is hierarchically structured just like the underlying tree topology.

It is usually a good idea to reflect the physical structure of hardware in the data mapping scheme.



## **Decomposition vs. Process Topology vs. Hardware**



#### **BUT your intuition is easily deceived.**

Would a more interconnected topology help here?







## **Decomposition vs. Process Topology vs. Hardware**



#### It should!







## **Decomposition vs. Process Topology vs. Hardware**



It should! But domain decomposition scheme prevents to exploit it here.









## Decomposition vs. Process Topology vs. Hardware



Just for fun, slow boundaries are even identical for either topology in this example.





#### **Further Reading**



### The one MPI tutorial you all want to read:

Basics: <a href="https://cvw.cac.cornell.edu/MPI/">https://cvw.cac.cornell.edu/MPI/</a>

P2P: <a href="https://cvw.cac.cornell.edu/MPIP2P/">https://cvw.cac.cornell.edu/MPIP2P/</a>

RMA: <a href="https://cvw.cac.cornell.edu/MPIoneSided/">https://cvw.cac.cornell.edu/MPIoneSided/</a>

Advanced: <a href="https://cvw.cac.cornell.edu/MPIAdvTopics/">https://cvw.cac.cornell.edu/MPIAdvTopics/</a>

Official MPI 3.1 documentation (Index):

http://www.mpi-forum.org/docs/mpi-3.1/mpi31-report/mpi31-report.htm#Node0

Again, a collection of documented MPI examples:

http://www.mcs.anl.gov/~thakur/sc14-mpi-tutorial/



Tobias Fuchs
tobias.fuchs@nm.ifi.lmu.de
www.mnm-team.org/~fuchst

