Add RCCL and OFFLOAD profiling support for AMD GPUs #101

gsitaram · 2022-01-26T15:43:40Z

This PR adds support for rcclReduceScatter using device buffers in order to improve the performance of this collective operation over processes in multiple nodes. In addition, support for roctx ranges have been added so that rocprof timelines can show both CPU and GPU activity. This code is more a proof of concept than a final product, more work is needed to make it production ready. I list the missing pieces below:

An explicit device buffer alloc+copy from HtoD are done before the rcclReduceScatter call and a DtoH transfer is done after it. These redundant alloc+copy HtoD must be eliminated by keeping the data in device buffers before the rcclReduceScatter. The extra device buffer for storing the result may be allocated only once at the beginning.
The roctx ranges macros (COSMA_RE, COSMA_RL) may be extended to nvtx calls. Currently they are used only when running on AMD GPUs.
An optimization to use RCCL is to create the RCCL communicator only once. This works only if the sequence of PDGEMM calls are called with the same set of processes in the communicator ring. It does not work when we try to use RCCL across PDGEMM calls with vastly different MPI communicators. So when running CP2K's RPA benchmark, for instance, some initial PDGEMM calls have to be routed to scalapack using COSMA_DIM_THRESHOLD environment variable so that performance could be gained in the RI-RPA calculation stage. It would be nice to fix this bug so that users do not have to worry about which PDGEMM calls to avoid.
I would like the maintainers of COSMA to choose how to name the various variables such as COSMA_WITH_RCCL, COSMA_WITH_OFFLOAD_PROFILING, etc. and decide whether they should be implemented as CMake variables or environment variables.

gsitaram · 2022-04-21T18:53:28Z

Hi @kabicm, what is the status of my PR? Do you intend to merge it some time? It works for CP2K at scale, I would like to see this code formally added to your repo.

kabicm · 2022-05-13T14:06:20Z

Hi Gina (@gsitaram),

You might have noticed that we made a new PR for this: #102, which in addition, brings the following optimizations:

adds the NCCL backend for both AMD and NVIDIA GPUs.
caches and reuses all MPI communicators
caches and reuses all device memory (also the memory needed for NCCL).
avoids double reshuffling of data, i.e. the data from NCCL buffers is immediately copied in the right layout, without additional reshuffling.

All these changes required substantial refactoring of the code, which is why it was easier for us to implement it in a new PR.

The only thing left to do for this PR is to make it work for AMD is just to fix building, i.e. to add the cmake scripts for finding hipblas and other hip-related libraries. The code itself is already AMD-ready.

It would be great to talk some of these days, as we are planning to merge this soon.

Cheers,
Marko

alazzaro · 2022-05-13T21:14:38Z

Hello @kabicm I'm chiming in here to ask if the new COSMA will have the possibility to keep the GPU buffers to enable G2G. Currently, I'm hacking the library by replacing this call with the following code:

     // first transfer send_pointer to GPU
      Scalar *d_send_pointer=NULL, *d_receive_pointer=NULL;
      int nranks;
      MPI_Comm_size(comm, &nranks);

      hipMalloc((void **)&d_send_pointer, nranks*recvcnts[0]*sizeof(Scalar));
      hipMalloc((void **)&d_receive_pointer, recvcnts[0]*sizeof(Scalar));
      hipMemcpy(d_send_pointer, send_pointer, nranks*recvcnts[0]*sizeof(Scalar), hipMemcpyHostToDevice);

      // Use GPU pointers
      MPI_Reduce_scatter_block(d_send_pointer,
                               d_receive_pointer,
                               recvcnts[0],
                               mpi_type,
                               MPI_SUM,
                               comm);

      hipMemcpy(receive_pointer, d_receive_pointer, recvcnts[0]*sizeof(Scalar), hipMemcpyDeviceToHost);
      hipFree(d_send_pointer);
      hipFree(d_receive_pointer);

Basically, I do copy the data in/out to the device to have the MPI call to run on the GPU (which is the winning solution). Clearly, we can avoid at least one of the copies (and memory allocation) if you provide the buffers of the data allocated on the GPU (assuming that it is possible). Is it something the new COSMA will have?
Do you think I can start to the the new PR with the HIP backend?

@alazzaro

This PR enables COSMA to take advantage of fast GPU-to-GPU interconnects like NVLink, to efficiently utilize modern Multi-GPU Systems. This is achieved in 2 ways: - **Using `NCCL/RCCL` Libraries:** by specifying `-DCOSMA_WITH_NCCL=ON` (for NVIDIA GPUs) or `-DCOSMA_WITH_RCCL=ON` (for AMD GPUs) cmake options. - **Using GPU-aware MPI:** by specifying `-DCOSMA_WITH_GPU_AWARE_MPI=ON` cmake option, as proposed [here](#101 (comment)). See [README](https://github.com/eth-cscs/COSMA/blob/master/README.md) and [INSTALL](https://github.com/eth-cscs/COSMA/blob/master/INSTALL.md) for more info on how to build. In addition, the following performance improvemets have been made: - **Improved Caching:** - all nccl buffers, MPI comms, nccl comms are cached and reused when appropriate. - all device memory is cached and reused. - **Reduced Data Trasfers:** the GPU backend of COSMA called [Tiled-MM](https://github.com/eth-cscs/Tiled-MM) is extended to offer the possibility to the user to leave the resulting matrix C on the GPU. In that case, there is no need to trasfer matrix C from device to host, which not only reduces the communication, but also speeds up the whole cpu->gpu pipeline as no additional synchronizations are needed. Furthermore, reduce_scatter operation does not have to wait for C to be transfered back to host but is immediately invoked with GPU pointers, thus utilizing fast inter-gpu links. This way, there is no unnecessary data transfers between cpu<->gpu. - **All collectives updated:** both `all-gather` and `reduce-scatter` collectives are improved. - **Reduced Data Reshuffling:** avoids double reshuffling of data, i.e. the data from NCCL/RCCL GPU buffers is immediately copied in the right layout, without additional reshuffling. - **Works for variable blocks:** NCCL/RCCL' reduce_scatter operation assumes that all the blocks are of the same size and is hence not completely equivalent to `MPI_Reduce_scatterv` which we previously used. We padded all the blocks to be able to overcome this issue. - **Portability:** Supports both NVIDIA and AMD GPUs. Therefore, this fixes the limitations of #101 and brings above-mentioned improvements. Thanks to @alazzaro and @gsitaram for their great feedback and contribution to this PR!

kabicm · 2022-06-29T16:57:54Z

These issues have been resolved in #102, so we are closing this PR. Thanks @gsitaram for your contribution!

@alazzaro

The biggest improvement in this release comes is that is enables COSMA to take advantage of fast GPU-to-GPU interconnects like NVLink, to efficiently utilize modern Multi-GPU Systems. This is achieved in 2 ways: - **Using `NCCL/RCCL` Libraries:** by specifying `-DCOSMA_WITH_NCCL=ON` cmake option. - **Using GPU-aware MPI:** by specifying `-DCOSMA_WITH_GPU_AWARE_MPI=ON` cmake option, as proposed [here](#101 (comment)). See [README](https://github.com/eth-cscs/COSMA/blob/master/README.md) and [INSTALL](https://github.com/eth-cscs/COSMA/blob/master/INSTALL.md) for more info on how to build. In addition, the following performance improvemets have been made: - **Improved Caching:** - all nccl buffers, MPI comms, nccl comms are cached and reused when appropriate. - all device memory is cached and reused. - **Reduced Data Trasfers:** the GPU backend of COSMA called [Tiled-MM](https://github.com/eth-cscs/Tiled-MM) is extended to offer the possibility to the user to leave the resulting matrix C on the GPU. In that case, there is no need to trasfer matrix C from device to host, which not only reduces the communication, but also speeds up the whole cpu->gpu pipeline as no additional synchronizations are needed. Furthermore, reduce_scatter operation does not have to wait for C to be transfered back to host but is immediately invoked with GPU pointers, thus utilizing fast inter-gpu links. This way, there is no unnecessary data transfers between cpu<->gpu. - **All collectives updated:** both `all-gather` and `reduce-scatter` collectives are improved. - **Reduced Data Reshuffling:** avoids double reshuffling of data, i.e. the data from NCCL/RCCL GPU buffers is immediately copied in the right layout, without additional reshuffling. - **Works for variable blocks:** NCCL/RCCL' reduce_scatter operation assumes that all the blocks are of the same size and is hence not completely equivalent to `MPI_Reduce_scatterv` which we previously used. We padded all the blocks to be able to overcome this issue. - **Portability:** Supports both NVIDIA and AMD GPUs. Therefore, this fixes the limitations of #101 and brings above-mentioned improvements. Thanks to @alazzaro and @gsitaram for their great feedback and contribution to this PR!

Add RCCL and OFFLOAD profiling support for AMD GPUs

14df416

alazzaro mentioned this pull request May 19, 2022

Optimizing communication with NCCL and RCCL. #102

Merged

kabicm closed this Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RCCL and OFFLOAD profiling support for AMD GPUs #101

Add RCCL and OFFLOAD profiling support for AMD GPUs #101

gsitaram commented Jan 26, 2022

gsitaram commented Apr 21, 2022

kabicm commented May 13, 2022

alazzaro commented May 13, 2022

kabicm commented Jun 29, 2022

Add RCCL and OFFLOAD profiling support for AMD GPUs #101

Add RCCL and OFFLOAD profiling support for AMD GPUs #101

Conversation

gsitaram commented Jan 26, 2022

gsitaram commented Apr 21, 2022

kabicm commented May 13, 2022

alazzaro commented May 13, 2022

kabicm commented Jun 29, 2022