Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to issue 235. If max_gpuprocs is specified, use that value even … #236

Conversation

craigwarner-ufastro
Copy link
Contributor

…if it

results in a crash. If not, attempt to find the number of GPUs and set max_gpuprocs equal to that instead of the number of ranks. When using MPI, cupy.cuda.runtime.getDeviceCount() will not work if gpu-bind is used so we first attempt to check /proc and if that fails we gather the device PCI bus ids to count devices and issue a warning because this is slow (~2s for 64 ranks). Looking at /proc should work on all Linux systems.

…f it

results in a crash.  If not, attempt to find the number of GPUs and set
max_gpuprocs equal to that instead of the number of ranks.  When using MPI,
cupy.cuda.runtime.getDeviceCount() will not work if gpu-bind is used so
we first attempt to check /proc and if that fails we gather the device
PCI bus ids to count devices and issue a warning because this is slow
(~2s for 64 ranks).  Looking at /proc should work on all Linux systems.
@coveralls
Copy link

Coverage Status

Coverage: 37.913% (-0.4%) from 38.326% when pulling 32e0192 on 235-rrdesi_mpi-gpu-out-of-memory-errors-unless-specifying-max-gpuprocs-4 into 2ac2a0f on main.

@sbailey
Copy link
Collaborator

sbailey commented Mar 14, 2023

Thanks. I think this wins the award for the longest DESI branch name yet. Thanks for all the off-list discussion and checks about what is both robust and performance optimal. I checked that the final branch works on both GPU and non-GPU machines with default params. Looks good. Merging.

@sbailey sbailey merged commit 3c0ec64 into main Mar 14, 2023
10 checks passed
@sbailey sbailey deleted the 235-rrdesi_mpi-gpu-out-of-memory-errors-unless-specifying-max-gpuprocs-4 branch March 14, 2023 00:24
@craigwarner-ufastro
Copy link
Contributor Author

Haha! I thought the same - I clicked the button on this page to fork a new branch for this issue and it created that super long name for it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rrdesi_mpi --gpu out of memory errors unless specifying --max-gpuprocs 4
3 participants