You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Integration of FTorch with a distributed CPU based solver can lead to a scenario where there are N (--ntasks-per-node) MPI and M (torch::cuda::device_count()) GPUs per node (**M** <= **N**). The current implementation of FTorch appears to leverage only GPU:0 for all N MPI ranks. Providing user ability to decide which GPU to leverage can ensure that all available GPUs are used.
An initial discussion regarding this potential feature: #84.
Furthermore, there might still be multiple MPI ranks per GPU even after uniformly distributing the MPI ranks among available GPUs. The GPU probably calls these ML model copies serially. CUDA MPS could be utilized to concurrently run the ML model copies. An alternative might be to perform (gather to a single task, deploy the ML model from that task, and finally scatter to respective tasks) inside the fortran code.
The text was updated successfully, but these errors were encountered:
Integration of FTorch with a distributed CPU based solver can lead to a scenario where there are N (
--ntasks-per-node
) MPI and M (torch::cuda::device_count()
) GPUs per node(**M** <= **N**)
. The current implementation of FTorch appears to leverage only GPU:0 for all N MPI ranks. Providing user ability to decide which GPU to leverage can ensure that all available GPUs are used.An initial discussion regarding this potential feature: #84.
Furthermore, there might still be multiple MPI ranks per GPU even after uniformly distributing the MPI ranks among available GPUs. The GPU probably calls these ML model copies serially. CUDA MPS could be utilized to concurrently run the ML model copies. An alternative might be to perform (gather to a single task, deploy the ML model from that task, and finally scatter to respective tasks) inside the fortran code.
The text was updated successfully, but these errors were encountered: