# Optimizing communications in multi-grid methods using GPUDirect Async

Elena Agostini University "La Sapienza" Rome, Italy Email: elena.ago@gmail.com Davide Rossetti NVIDIA Santa Clara, CA, USA Email: drossetti@nvidia.com Nikolay Sakharnykh NVIDIA Santa Clara, CA, USA Email: nsakharnykh@nvidia.com

Abstract—HPGMG is an HPC benchmark based on geometric multi-grid methods. A multi-grid solver works on levels of different sizes so that the amount of computation and communication varies a lot within the same application.

NVIDIA has developed a CUDA version of HPGMG-FV (Finite Volume) where, according to a threshold, finer (higher) levels are computed on the GPU while coarser (lower) levels on the CPU: this hybrid scheme takes advantage of each processor strengths. During the HPGMG-FV execution there are three main communication phases: boundaries exchange (among a variable number of cluster nodes), interpolation (lower to higher level) and restriction (higher to lower level). By default communications use the Message Passing Interface (MPI).

GPUDirect Async is a new technology introduced by NVIDIA in CUDA 8.0 which allows mutual direct control between the GPU and the interconnect device, a Mellanox Infiniband HCA in our case.

In this paper, we propose two optimized communication schemes involving the GPUDirect Async technology, where the burden of the communication is off-loaded onto the GPU itself. When running a properly modified version of HPGMG-FV, where Async technology is used for only those aforementioned communication phases, we measured a communication performance increase of up to 13%.

#### I. INTRODUCTION

Linear solvers are probably the most common tool in scientific computing applications and can be divided in two basic classes: direct and iterative. Multi-grid methods are iterative methods that can deliver linear complexity by solving elliptic PDEs Ax = b using a hierarchical approach, i.e. the solution to an hard problem (finer grid of elements) is expressed as solution to an easier problem (coarser grid of element). There are two different types of Multi-grid methods: algebric multi-grid (AMG), where operator A is a sparse matrix, and geometric multi-grid (GMG), where operator A is a stencil. While AMG method is a more general approach using an arbitrary graph, GMG method is more efficient on structured problems, since is can take advantage of the additional information from the geometric representation of the problem.

An example of GMG is HPGMG [3], an HPC benchmarking effort developed by Lawrence Berkeley National Laboratory. In particular, HPGMG-FV solves variable-coefficient elliptic problems on isotropic Cartesian grids using the finite volume method (FV) [5] and Full Multigrid (FMG) [6]. It supports high-order discretizations, and is extremely fast and accurate;

in case of multi-process execution, the workload is fairly distributed among processes because, in order to improve the parallelization, each problem level is divided into several same-size boxes.

NVIDIA improved the HPGMG-FV implementation [4] using a hybrid solution (involving both CPU and GPU), achieving a great enhancement in performances. In Section II we explain the most important changes carried out by NVIDIA useful to understand the rest of the paper.

Considering the NVIDIA hybrid implementation, in case of multi-GPU (multi-process) execution, communications play an important role in the algorithm. Section IV, focus on the communication enhancement obtained by the use of GPUDirect Async, a new technology introduced by NVIDIA in CUDA 8.0 (see Section III)

#### II. HPGMG-FV WITH CUDA

At the core of every multi-grid solver there is a V-cycle (Figure 1) computation pattern: starting from the finest structured grid, where a smoothing operation is applied to reduce high-frequency errors, the residual is calculated and propagated to the coarser grid. This process is repeated until the bottom level is reached, at which point the coarsest problem is solved directly, and then the solution is propagated back to the finest grid. The V-cycle is mainly dominated by the smoothing (GSRB smoother in our case), and residual operations at the top levels. These are usually 3D stencil operations on a structured grid.



Fig. 1: V-Cycle

HPGMG-FV implements an F-cycle, which starts at the bottom of the multi-grid hierarchy and performs multiple V-cycles, gradually adding finer levels. HPGMG-FV takes as input the amount and the  $log_2(size)$  of the finest level boxes, calculating the level total size; then it obtains the size of all the other (smaller) levels. The F-cycle is considered a state-of-the-art in multi-grid methods and converges much faster than a conventional V-cycle.

NVIDIA analyzed the difference among levels in an F-Cycle: Top (fine) levels have lots of grid points and can run efficiently on throughput-oriented parallel architectures like GPUs, while bottom (coarse) levels will be latency limited on a GPU because there is not enough work to make efficient use of all the parallel cores. During an F-Cycle, coarse levels are visited progressively more often than the fine levels therefore their cost is significant in an F-Cycle. For those reasons, coarse grids are better suited for latency-optimized processors like CPUs. Thus, for optimal performance, an hybrid scheme is required to guarantee that each level is executed on the suitable architecture: if the size of a level is over a certain threshold (empirically set to 10000 elements), then it runs on GPU, otherwise on CPU (Figure 2).



Fig. 2: F-Cycle with CPU-GPU threshold

To enable GPU accelleration, the simplest way was to add corresponding GPU kernels for low-level stencil operators and update memory allocations using *cudaMallocManaged()* instead of *malloc()*, in order to use Unified Memory [7] for memory used by GPU only, and use host pinned memory if it must be used by both CPU and GPU (i.e. communication buffers).

In Figure 3 there is a simplified operations timeline in case of coarse level (CPU) to fine level (GPU) and again coarse level (CPU).

In Figure 4 we report the enhancement obtained by NVIDIA using the hybrid solution in a benchmark on the ORNL Titan supercomputer [8]. For further details, please refer to [4]

# A. Communication periods

As described in Section II, in case of *Multi-grid* methods the smoother is a stencil. According to a stencil-like code, in case of multi-GPU execution the smoother must exchange the boundary ghost regions with the other processes (*intralevel* communication). On the contrary, (Figure 3) restriction and interpolation play a role in case of moving from a level



Fig. 3: F-Cycle: moving from a coarse level to a finer level and then go back to the coarse level



Fig. 4: Performance of GPU-accelerated HPGMG on the ORNL Titan supercomputer. Results obtained by Sam Williams from Lawrence Berkeley National Lab.

to another (*inter-level* communication). *inter-level* exchanges play a bigger role at scale where we do consolidation of N to N processes (so we have lots of data moved around)

1) Exchange Boundaries: This function implements the boundary region exchange doing a {pack}, {send}, {interior\_compute}, {receive}, {unpack} sequence among processes working on the same level. See Algorithm 1 for the pseudo-code.

A *cudaDeviceSynchronize()* is required between the CUDA kernel pack operation and the *MPI\_Isend()* to guarantee correctly updated *sendBuffers*. The *exchangeBoundaries()* is the most used communication function during an HPGMG-FV execution.

- 2) Restriction: This function occurs when moving from a finer level to a coarser level; in case of GPU-to-CPU level, it is quite similar to exchangeBoundaries(): GPU kernels works on sendBuffers, then a cudaDeviceSynchronize() is needed before the MPI\_Isend(). Moreover, if the coarsest level is on CPU, an additional cudaDeviceSynchronize() is required because GPU kernels are launched asynchronously and we need to guarantee completion before we start CPU tasks.
- 3) Interpolation: It occurs when moving from a coarser level to a finer level. It requires a cudaDeviceSynchronize() before the MPI\_Isend() but it doesn't needs to synchronize

# Algorithm 1 Exchange Boundaries function

```
1: for i = 1 to PROCESSES do
       cudaMallocHost(sendBuffers[i])
 3:
       cudaMallocHost(receiveBuffers[i])
 4: end for
5: ...
 6: function EXCHANGEBOUNDARIES()
       for i = 1 to PROCESSES do
          MPI_Irecv(receiveBuffers[i], &reqs_recv[i])
 8:
 9:
       end for
10:
       cuda_pack(sendBuffers)
       cudaDeviceSynchronize()
11:
12:
       for i = 1 to PROCESSES do
13:
          MPI_Isend(sendBuffers[i])
14:
       end for
       cuda interior compute(localBuffers)
15:
16:
       MPI_Waitall(reqs_recv)
       cuda_unpack(receiveBuffers)
17:
18: end function
```

if the coarsest level is on CPU because CPU tasks are synchronous and they will be completed before we launch GPU kernels.

Although communications are not the most expensive part of the algorithm, profiling the GPU levels execution we noticed that:

- the CPU launched a lot of CUDA kernels for residual and smoothing and each kernel launch required a lot of time, leaving sometime the GPU idle, waiting for an other kernel
- the high number of cudaDeviceSynchronize() slowed the performances

To hide the kernel launches and remove as many *cudaDeviceSynchronize()* as possible, we used GPUDirect Async (see Section III) to improve performances of all the communications periods.

In Figure 5 there is a simplified timeline to clarify how *exchangeBoundaries()*, *restriction()* and *interpolation()* are used during a GPU level processing.



Fig. 5: Synchronous exchangeBoundaries() timeline

#### III. GPUDIRECT ASYNC

GPUDirect [2] is a family of technologies aimed at optimizing data movement respectively among GPUs (P2P) and with third-party devices (RDMA). In particular, GPUDirect RDMA reduces latency and improves bus utilization by enabling a direct data path over the PCI Express (PCIe) bus between the GPU and a third-party device, in our case a network interface controller (NIC).

GPUDirect Async is a new GPUDirect technology introduced by NVIDIA in CUDA 8.0; it allows mutual direct

control between the GPU and the third party device, a recent generation Infiniband HCA in our case.

Although an in-depth explaination of the GPUDirect Async implementation is beyond the scope of this paper, in the following we will breafly describe how it works and how it can be leveraged in HPGMG-FV. With Async the GPU is able to trigger communications on HCA, while at the same time HCA is able to unblock CUDA tasks; the CPU is only needed to prepare and queue both the compute and communication tasks on GPU. More specifically:

- The CPU allocates communication buffers (device or host pinned memory).
- It maps some HCA specific data structures, like command queues (IB Verbs QPs) and completion queues (IB Verbs CQs), onto the GPU by using cuMemHostRegister().
- Prepares the send/receive requests descriptors.
- Converts those descriptors into a sequence of basic operations to be executed by the GPU, for example by the GPU front-end unit which is in charge of the CUDA stream abstraction. Examples of those operations are:
  - Writing an HW mailbox register onto the HCA, i.e. ringing the HCA doorbell.
  - Triggering a pre-launched CUDA kernel.
  - Waiting (polling) for communication task completion (IB Verbs CQ entries, or CQEs).

For example, after having prepared and queued all the necessary tasks onto the GPU, the CPU can go back and do other useful work. By leveraging this mechanism a whole parallel computation phase can be offloaded onto a CUDA stream. Still when necessary, the CPU can query the CUDA stream for the status of the outstanding operations. The same mechanism is expected to be useful to efficiently scale up in combination with a low performance CPU, effectively removing its cohordinating work from the critical path.



Fig. 6: GPUDirect Async Software stack

The GPUDirect software stack (Figure 6) comprises:

- 4) libmlx5: (Vendor/device specific) libmlx5 is a low-level device driver for Mellanox Connect-IB InfiniBand host channel adapters (HCAs). This allows userspace processes to access Mellanox HCA hardware directly with low latency and low overhead. The standard implementation has been extended for needs of GPUDirect Async.
- 5) libibverbs: (Verbs APIs) libibverbs library implements the OpenFabrics Infiniband Verbs API. The standard implementation has been extended with new Verbs specific to GPUDirect Async.
- 6) LibGDSync: (NVIDIA open-source) Developed by NVIDIA, it consist of a set of hybrid APIs where both IB Verbs and CUDA GPU stream are merged. It is responsible to create IB tracking data structures respecting the constraints of GPUDirect Async, to register host memory when needed, to post send instructions and completion waiting directly on the GPU stream.
- 7) LibMP: (NVIDIA open-source) It is at the top level and is a messaging library (similar to MPI) developed as a tecnology demostrator to easily deploy the GPUDirect Async technology on MPI applications. It leverages LibGDSync APIs and offer basic communications like mp\_isend\_on\_stream(), mp\_wait\_on\_stream(), mp\_iput\_on\_stream(), etc..

In addition to the Stream Async communication model previously described, where communications are synchronous to the CUDA streams, in this paper we are going to explore a *kernel-initiated* model, where the Simultaneous Multiprocessors (SMs), which are in charge of executing the CUDA kernels, can directly issue communication primitives, i.e. sending messages and waiting for completions.

In the following sections, we will show that kernel-initiated mode can be faster than Stream Async mode, i.e. by allowing kernel fusion techniques thereby exposing more concurrency to the highly parallel GPU HW units. The downside of it is that it is more complicated to use, mainly because the programmer needs to manually schedule different sub-tasks (send, receive or compute) to separate CUDA kernel thread blocks respecting the constraints of the algorithm.

## IV. ASYNCHRONOUS COMMUNICATIONS

We tried both the asynchronous modes in HPGMG-FV, to evaluate performances considering execution times of GPU levels only.

## A. Stream Async mode

See Algorithm 2 for the *exchangeBoundariesAsync()* pseudo-code.

The *cudaDeviceSynchronize()* between *cuda\_pack()* and *mp\_isend\_on\_stream()* function is no more required. The GPU stream will:

- 1) Write on sendBuffers using the cuda\_pack() kernel
- 2) Post the send requests
- 3) Execute the *cuda\_interior\_compute()* kernel
- 4) Wait the receive completion
- 5) Read the received data (receiveBuffers) with the cuda\_unpack() kernel

# Algorithm 2 Exchange Boundaries Stream Async function

```
1: for i = 1 to PROCESSES do
       cudaMallocHost(sendBuffers[i])
3:
       cudaMallocHost(receiveBuffers[i])
4: end for
5: ...
6: function EXCHANGEBOUNDARIESSTREAMASYNC(STREAM)
7:
       for i = 1 to PROCESSES do
          mp\_irecv(receiveBuffers[i], & receiveDescriptors[i])
8:
9:
       end for
10:
       cuda_pack(sendBuffers, stream)
       for i = 1 to PROCESSES do
11:
          mp_isend_on_stream(sendBuffers[i],
12:
   &sendDescriptors[i], stream)
13:
       end for
       cuda interior compute(localBuffers)
14:
15:
       mp_wait_all_on_stream(receiveDescriptors)
16:
       cuda_unpack(receiveBuffers)
       mp_wait_all_on_stream(sendDescriptors)
17:
18: end function
```

Similar considerations can be done for *restriction()* and *interpolation()* functions.

Stream Async is used only if the level is a GPU level: this means that only a *cudaDeviceSynchronize()* is needed during the *restriction()* function from the last GPU (higher) level to the first CPU (lower) level. During GPU levels, CPU can launch all the CUDA kernels without waiting, hiding the kernel launches times (Figure 7).



Fig. 7: exchangeBoundariesStreamAsync() timeline

#### B. kernel-initiated mode

The algorithm from the CPU point of view is extremely simple (Figure 8a): it must prepare send/receive descriptors and launch a single CUDA kernel in which GPU will overlap, as much as possible, all the *exchangeBoundaries()* operations (see Algorithm 3).

The most difficult lies in the part cuda compute exchange kernel(). According to previous observations about stencils, we can distringuish three different groups of independent operations: [pack, send], [interior compute], [receive, unpack]; then we can execute them in parallel using different CUDA kernel blocks. cuda\_compute\_exchange\_kernel() Basically, the N+M+1 blocks in a mono-dimensional grid, where N is the number of blocks required by the cuda\_pack() and cuda\_interior\_compute() and M is the number of blocks required by cuda\_unpack() plus 1 block, used to receive data as explained in Figure 8b.

The receive is the most expensive one, then the first incoming kernel block waits to receive all data (each block's



exchange boundaries Remember () time inc

Fig. 8: kernel-initiated mode

#### **Algorithm 3** Exchange Boundaries kernel-initiated function

```
1: for i = 1 to PROCESSES do
       cudaMallocHost(sendBuffers[i])
 3:
       cudaMallocHost(receiveBuffers[i])
 4: end for
 5: ...
 6: function EXCHANGEBOUNDARIESKERNELINITIATED()
       for i = 1 to PROCESSES do
 7:
          mp_irecv(receiveBuffers[i], &receiveDescriptors[i])
 8.
 9:
       end for
10:
       for i = 1 to PROCESSES do
11:
          mp_isend_prepare(&sendDescriptors[i])
12:
       end for
13:
       cuda_compute_exchange_kernel(receiveDescriptors, sendDe-
14:
       mp_wait_all_on_stream(sendDescriptors)
15: end function
```

thread receives from a single peer process); all the blocks from the second to the N-th work on the [pack, send] group of operations plus the [interior compute]. Finally the remaining M blocks wait for the the completion of the first block receive and then they will unpack received data, finishing the *exchangeBoundaries()* execution.



Fig. 9: Global Lock among kernel blocks

To force the last blocks to wait for the receive completion, we used a global lock as explained in Figure 9. All threads in wait blocks move to the \_\_syncthreads() except for the first one: the Thread0 of all the wait blocks is waiting for the Thread0 of the receive block to set a global memory variable to 1. When this happens (after the receive completion), all the Thread0 in the wait blocks will reach the \_\_syncthreads() and

then they start to unpack received data.

When using kernel-initiated mode, it is very important that the receive (or wait for receive) operation is not preventing the send, otherwise you will have a deadlock. For example, having a GPU with 15 SM, if the first one is waiting on the receive and the others 14 are waiting for the receive completion, the [pack, send] operations will never occur.

#### V. BENCHMARKS AND RESULTS

As described in Section II, HPGMG-FV takes as input the size and number of the boxes in the highest level; during our benchmarks we used 8 boxes varying the size in [4,5,6,7]. According to Section II, the threshold size used during NVIDIA tests was 10000; considering 4 as minimum  $log_2$  size for boxes, this means that the 3 smallest levels are always executed on CPU and all the others on GPU.

We tried different threshold values (i.e. changing the number of GPU levels for each execution) during our asynchronous benchmarks but we found that the best values is always 10000.

#### A. Two node benchmarks

For the first round of benchmarks we used two standard 2U Xeon based servers, each one with a Mellanox dual-port FDR Connect-IB HCA and a single Tesla K40m (boost clocks set to 875 MHz), running RHEL 6.6 and a pre-release version of the GPU display driver. In Figure 10 the Y axis shows the performance increase (the larger the better) of the GPU levels only for both Stream Async and kernel-initiated mode over the standard MPI mode. The bigger the boxe size, the more performance gain decreases, because communication message size grows with the box size, so therefore overheads become less important. Here we also note the the kernel-initiated mode always wins over the Stream Async one.

#### B. Scaling up

For this second benchmark, we used the Wilkes HPC cluster at the Cambridge University, UK)[9]. The system consists of 128 Dell T620 servers, 256 NVIDIA Tesla K20c GPUs interconnected by 256 Mellanox Connect IB cards. For technical reasons we were only able to use 16 cluster nodes, one GPU each. Figure 11 shows performance increase on two nodes, i.e. the same setup as in the previous section.

The substantial gain obtained (80% in case of box size 7) with Async modes over standard MPI has been tracked back to a subperforming implementation of the Unified Memory



Fig. 10: 2 processes, Asynchronous time gain, NVIDIA servers



Fig. 11: 2 processes, Asynchronous time gain, Wilkes HPC

manager in the CUDA driver, which does not affect the runtime in the Async case. On the contrary, execution times of the Async versions were similar to the first round of benchmarks.

We believe that a software update will soon be deployed to fix the Unified Memory performance on the Wilkes cluster.

## VI. CONCLUSION

In this paper we presented the performance of a GPU accelerate multi-node implementation of the HPGMG-FV benchamrk using the recently introduced NVIDIA GPUDi-rect Async technology. In particular, we developed an asynchronous version of a stencil operator, that is highly used in the context of scientific and engineering applications. Althought communications aren't the most relevant part in the HPGMG-FV algorithm, we reached a time gain of about 13%. Unfortunately for the moment, we had some problem during large-scale benchmarks related to the display driver released by NVIDIA along with the new CUDA 8.0 RC Toolkit. The next step is to perform all benchmarks again up to 64 nodes on

the Wilkes HPC using the most updated (and recently released) display driver 361.77.

#### ACKNOWLEDGMENT

The authors would like to thank Sreeram Potluri for useful discussions, and Filippo Spiga for making the Wilkes cluster available to us.

#### REFERENCES

- H. Kopka and P. W. Daly, A Guide to <u>ETEX</u>, 3rd ed. Harlow, England: Addison-Wesley, 1999.
- [2] GPUDirect family: https://developer.nvidia.com/gpudirect
- [3] HPGMG https://hpgmg.org
- [4] N. Sakharnykh High-Performance Geometric Multi-Grid with GPU Acceleration. https://devblogs.nvidia.com/parallelforall/highperformance-geometric-multi-grid-gpu-acceleration
- [5] Finite Volume method. https://en.wikipedia.org/wiki/Finite\_volume\_method
- [6] Full MultiGrid method. https://en.wikipedia.org/wiki/Multigrid\_method
- [7] Unified Memory. https://devblogs.nvidia.com/parallelforall/unifiedmemory-in-cuda-6
- [8] ORNL Titan supercomputer. https://www.olcf.ornl.gov/titan
- [9] Wilkes HPC Cambridge, UK. www.hpc.cam.ac.uk