-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RCCL and OFFLOAD profiling support for AMD GPUs #101
Conversation
Hi @kabicm, what is the status of my PR? Do you intend to merge it some time? It works for CP2K at scale, I would like to see this code formally added to your repo. |
Hi Gina (@gsitaram), You might have noticed that we made a new PR for this: #102, which in addition, brings the following optimizations:
All these changes required substantial refactoring of the code, which is why it was easier for us to implement it in a new PR. The only thing left to do for this PR is to make it work for AMD is just to fix building, i.e. to add the cmake scripts for finding hipblas and other hip-related libraries. The code itself is already AMD-ready. It would be great to talk some of these days, as we are planning to merge this soon. Cheers, |
Hello @kabicm I'm chiming in here to ask if the new COSMA will have the possibility to keep the GPU buffers to enable G2G. Currently, I'm hacking the library by replacing this call with the following code:
Basically, I do copy the data in/out to the device to have the MPI call to run on the GPU (which is the winning solution). Clearly, we can avoid at least one of the copies (and memory allocation) if you provide the buffers of the data allocated on the GPU (assuming that it is possible). Is it something the new COSMA will have? |
This PR enables COSMA to take advantage of fast GPU-to-GPU interconnects like NVLink, to efficiently utilize modern Multi-GPU Systems. This is achieved in 2 ways: - **Using `NCCL/RCCL` Libraries:** by specifying `-DCOSMA_WITH_NCCL=ON` (for NVIDIA GPUs) or `-DCOSMA_WITH_RCCL=ON` (for AMD GPUs) cmake options. - **Using GPU-aware MPI:** by specifying `-DCOSMA_WITH_GPU_AWARE_MPI=ON` cmake option, as proposed [here](#101 (comment)). See [README](https://github.com/eth-cscs/COSMA/blob/master/README.md) and [INSTALL](https://github.com/eth-cscs/COSMA/blob/master/INSTALL.md) for more info on how to build. In addition, the following performance improvemets have been made: - **Improved Caching:** - all nccl buffers, MPI comms, nccl comms are cached and reused when appropriate. - all device memory is cached and reused. - **Reduced Data Trasfers:** the GPU backend of COSMA called [Tiled-MM](https://github.com/eth-cscs/Tiled-MM) is extended to offer the possibility to the user to leave the resulting matrix C on the GPU. In that case, there is no need to trasfer matrix C from device to host, which not only reduces the communication, but also speeds up the whole cpu->gpu pipeline as no additional synchronizations are needed. Furthermore, reduce_scatter operation does not have to wait for C to be transfered back to host but is immediately invoked with GPU pointers, thus utilizing fast inter-gpu links. This way, there is no unnecessary data transfers between cpu<->gpu. - **All collectives updated:** both `all-gather` and `reduce-scatter` collectives are improved. - **Reduced Data Reshuffling:** avoids double reshuffling of data, i.e. the data from NCCL/RCCL GPU buffers is immediately copied in the right layout, without additional reshuffling. - **Works for variable blocks:** NCCL/RCCL' reduce_scatter operation assumes that all the blocks are of the same size and is hence not completely equivalent to `MPI_Reduce_scatterv` which we previously used. We padded all the blocks to be able to overcome this issue. - **Portability:** Supports both NVIDIA and AMD GPUs. Therefore, this fixes the limitations of #101 and brings above-mentioned improvements. Thanks to @alazzaro and @gsitaram for their great feedback and contribution to this PR!
The biggest improvement in this release comes is that is enables COSMA to take advantage of fast GPU-to-GPU interconnects like NVLink, to efficiently utilize modern Multi-GPU Systems. This is achieved in 2 ways: - **Using `NCCL/RCCL` Libraries:** by specifying `-DCOSMA_WITH_NCCL=ON` cmake option. - **Using GPU-aware MPI:** by specifying `-DCOSMA_WITH_GPU_AWARE_MPI=ON` cmake option, as proposed [here](#101 (comment)). See [README](https://github.com/eth-cscs/COSMA/blob/master/README.md) and [INSTALL](https://github.com/eth-cscs/COSMA/blob/master/INSTALL.md) for more info on how to build. In addition, the following performance improvemets have been made: - **Improved Caching:** - all nccl buffers, MPI comms, nccl comms are cached and reused when appropriate. - all device memory is cached and reused. - **Reduced Data Trasfers:** the GPU backend of COSMA called [Tiled-MM](https://github.com/eth-cscs/Tiled-MM) is extended to offer the possibility to the user to leave the resulting matrix C on the GPU. In that case, there is no need to trasfer matrix C from device to host, which not only reduces the communication, but also speeds up the whole cpu->gpu pipeline as no additional synchronizations are needed. Furthermore, reduce_scatter operation does not have to wait for C to be transfered back to host but is immediately invoked with GPU pointers, thus utilizing fast inter-gpu links. This way, there is no unnecessary data transfers between cpu<->gpu. - **All collectives updated:** both `all-gather` and `reduce-scatter` collectives are improved. - **Reduced Data Reshuffling:** avoids double reshuffling of data, i.e. the data from NCCL/RCCL GPU buffers is immediately copied in the right layout, without additional reshuffling. - **Works for variable blocks:** NCCL/RCCL' reduce_scatter operation assumes that all the blocks are of the same size and is hence not completely equivalent to `MPI_Reduce_scatterv` which we previously used. We padded all the blocks to be able to overcome this issue. - **Portability:** Supports both NVIDIA and AMD GPUs. Therefore, this fixes the limitations of #101 and brings above-mentioned improvements. Thanks to @alazzaro and @gsitaram for their great feedback and contribution to this PR!
This PR adds support for rcclReduceScatter using device buffers in order to improve the performance of this collective operation over processes in multiple nodes. In addition, support for roctx ranges have been added so that rocprof timelines can show both CPU and GPU activity. This code is more a proof of concept than a final product, more work is needed to make it production ready. I list the missing pieces below: