Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User ability to decide GPU device number #85

Closed
siddanib opened this issue Jan 2, 2024 · 3 comments · Fixed by #96
Closed

User ability to decide GPU device number #85

siddanib opened this issue Jan 2, 2024 · 3 comments · Fixed by #96
Assignees
Labels
enhancement New feature or request

Comments

@siddanib
Copy link

siddanib commented Jan 2, 2024

Integration of FTorch with a distributed CPU based solver can lead to a scenario where there are N (--ntasks-per-node) MPI and M (torch::cuda::device_count()) GPUs per node (**M** <= **N**). The current implementation of FTorch appears to leverage only GPU:0 for all N MPI ranks. Providing user ability to decide which GPU to leverage can ensure that all available GPUs are used.

An initial discussion regarding this potential feature: #84.

Furthermore, there might still be multiple MPI ranks per GPU even after uniformly distributing the MPI ranks among available GPUs. The GPU probably calls these ML model copies serially. CUDA MPS could be utilized to concurrently run the ML model copies. An alternative might be to perform (gather to a single task, deploy the ML model from that task, and finally scatter to respective tasks) inside the fortran code.

@TomMelt TomMelt self-assigned this Jan 15, 2024
@TomMelt TomMelt added the enhancement New feature or request label Jan 15, 2024
@jatkinson1000 jatkinson1000 assigned jwallwork23 and unassigned TomMelt Mar 22, 2024
@jatkinson1000
Copy link
Member

@siddanib PR #96 has been opened to address this which builds on your suggestions (thank you!) but in a way consistent with our overall approach.

Please do feel free to take a look and see what you think.

@jatkinson1000
Copy link
Member

@siddanib this is now implemented in the main code.
You can see information here and a worked example here.

Thank you for raising this as an improvement to the code, and all the help you provided pointing us to the relevant information.

@siddanib
Copy link
Author

Thank you very much for providing this capability, @jatkinson1000 & @jwallwork23! I will look into this new feature and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants