New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory and rank allocation #1899
Conversation
Recently discovered: the logs from this PR test run have messages like
This does not seem to be causing any tangible problems, but also doesn't seem good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall, just one major issue with the contiguous rank/device mapping method.
I'm also a bit puzzled by the sys.stdout.flush
issue and the GTL messages. It doesn't seem like they are blocking for now but I'll see if I can figure out what is going on there.
py/desispec/gpu.py
Outdated
if method == 'round-robin': | ||
device_id = comm.rank % ngpu | ||
elif method == 'contiguous': | ||
device_id = int(comm.rank / ngpu) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a bug here for cases where int(comm.size / ngpu) >= ngpu
. For example, rank 31 in a comm of size 32 would get assigned device_id=7
on a node with only 4 gpus.
this also gets a little more complicated if the MPI comm spans multiple nodes.
device_id = 0 | ||
cupy.cuda.Device(device_id).use() | ||
log.info(f'No MPI communicator; assigning process to GPU {device_id}/{ngpu}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked that the cupy 0-based indexing is compatible with using the CUDA_VISIBLE_DEVICES
to set gpu device visibility.
dmargala@nid001065:~> srun -n 1 --gpus-per-node 4 bash -c 'CUDA_VISIBLE_DEVICES=3 python -c "import cupy; device = cupy.cuda.Device(); print(device.id, device.pci_bus_id)"'
0 0000:C1:00.0
dmargala@nid001065:~> srun -n 1 --gpus-per-node 4 bash -c 'CUDA_VISIBLE_DEVICES=0 python -c "import cupy; device = cupy.cuda.Device(); print(device.id, device.pci_bus_id)"'
0 0000:03:00.0
This looks good to me. cupy reports device index zero in both cases but we can see the pci bus id changes when we choose a device via the CUDA_VISIBLE_DEVICES
env variable.
desispec.gpu.free_gpu_memory() | ||
desispec.gpu.redistribute_gpu_ranks(comm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a risk to clearing the wrong device memory pool if these functions are called in opposite order. Perhaps it would be worth adding an opt-out option like desispec.gpu.redistribute_gpu_ranks(..., free_gpu_memory=True)
to call to desispec.gpu.free_gpu_memory()
before switching to a new device ?
@dmargala thanks for catching the method='contiguous' bug. I have fixed that, including handling the multi-node case, albeit with some assumptions about MPI rank distribution documented in the docstring. Please double check me. Good point about freeing memory before re-allocating ranks to GPUs. I couldn't think of a reason why we would want to re-allocate rank:gpu without freeing the unused memory, so I added a call to |
@sbailey the method='contiguous' logic looks good now. The main reason to not free the memory pool would be performance considerations depending on the situation. I think it's fine to leave as is until a performance study reveals an opportunity there. |
This PR fixes the GPU memory and rank:GPU allocation problems reported in issue #1897, using the methods suggested by @dmargala in that issue. This PR provides a new
desispec.gpu
module with helper functionsis_gpu_available()
: standardizing if a GPU+cupy+numba.cuda is available (will be used more in a future PR that will remove the other places we do the same sorts of calculations)free_gpu_memory()
: free unused GPU memory if possible, no-op if we're not using GPUsredistribute_gpu_ranks(comm)
: redistribute MPI rank:GPU mapping, handling case ofcomm=None
and the case of no GPUs.Testing was bogged down by:
sys.stdout.flush
immediately after reassigning the GPUs resulted in an errorsrun: error: eio_handle_mainloop: Abandoning IO 60 secs after job shutdown initiated
, but flushing before reassigning did not cause this.sys.stdout.flush()
commands, and hasn't failed in multiple test runs. I'm mainly documenting this for the record in case the error pops up again with larger test runs.