Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory and rank allocation #1899

Merged
merged 4 commits into from Nov 8, 2022
Merged

GPU memory and rank allocation #1899

merged 4 commits into from Nov 8, 2022

Conversation

sbailey
Copy link
Contributor

@sbailey sbailey commented Nov 7, 2022

This PR fixes the GPU memory and rank:GPU allocation problems reported in issue #1897, using the methods suggested by @dmargala in that issue. This PR provides a new desispec.gpu module with helper functions

  • is_gpu_available(): standardizing if a GPU+cupy+numba.cuda is available (will be used more in a future PR that will remove the other places we do the same sorts of calculations)
  • free_gpu_memory(): free unused GPU memory if possible, no-op if we're not using GPUs
  • redistribute_gpu_ranks(comm): redistribute MPI rank:GPU mapping, handling case of comm=None and the case of no GPUs.

Testing was bogged down by:

  • typos resulting in NameError and ValueError exceptions that were getting swallowed by stdout buffering before the job was killed
  • a surprising feature/bug where sys.stdout.flush immediately after reassigning the GPUs resulted in an error srun: error: eio_handle_mainloop: Abandoning IO 60 secs after job shutdown initiated, but flushing before reassigning did not cause this.
    • This remains surprising enough that I'm not entirely convinced that that was the actual relevant feature, vs. just tweaking the timing of some other race condition that I wasn't understanding while trying to debug where my stdout messages were going. This current branch doesn't have the debugging sys.stdout.flush() commands, and hasn't failed in multiple test runs. I'm mainly documenting this for the record in case the error pops up again with larger test runs.

@sbailey sbailey requested a review from dmargala November 7, 2022 05:08
@sbailey sbailey added this to In progress in Himalayas via automation Nov 7, 2022
@sbailey
Copy link
Contributor Author

sbailey commented Nov 7, 2022

Recently discovered: the logs from this PR test run have messages like

[6] This process is not using the same device it did during gtl_init.
 Current device_id 2 original device_id 0.
 Use of IPC is disabled from this point. IPC PUT protocol may be used instead 

This does not seem to be causing any tangible problems, but also doesn't seem good.

Copy link
Contributor

@dmargala dmargala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall, just one major issue with the contiguous rank/device mapping method.

I'm also a bit puzzled by the sys.stdout.flush issue and the GTL messages. It doesn't seem like they are blocking for now but I'll see if I can figure out what is going on there.

if method == 'round-robin':
device_id = comm.rank % ngpu
elif method == 'contiguous':
device_id = int(comm.rank / ngpu)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a bug here for cases where int(comm.size / ngpu) >= ngpu. For example, rank 31 in a comm of size 32 would get assigned device_id=7 on a node with only 4 gpus.

this also gets a little more complicated if the MPI comm spans multiple nodes.

Comment on lines +67 to +69
device_id = 0
cupy.cuda.Device(device_id).use()
log.info(f'No MPI communicator; assigning process to GPU {device_id}/{ngpu}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked that the cupy 0-based indexing is compatible with using the CUDA_VISIBLE_DEVICES to set gpu device visibility.

dmargala@nid001065:~> srun -n 1 --gpus-per-node 4 bash -c 'CUDA_VISIBLE_DEVICES=3 python -c "import cupy; device = cupy.cuda.Device(); print(device.id, device.pci_bus_id)"'
0 0000:C1:00.0
dmargala@nid001065:~> srun -n 1 --gpus-per-node 4 bash -c 'CUDA_VISIBLE_DEVICES=0 python -c "import cupy; device = cupy.cuda.Device(); print(device.id, device.pci_bus_id)"'
0 0000:03:00.0

This looks good to me. cupy reports device index zero in both cases but we can see the pci bus id changes when we choose a device via the CUDA_VISIBLE_DEVICES env variable.

Comment on lines 186 to 187
desispec.gpu.free_gpu_memory()
desispec.gpu.redistribute_gpu_ranks(comm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a risk to clearing the wrong device memory pool if these functions are called in opposite order. Perhaps it would be worth adding an opt-out option like desispec.gpu.redistribute_gpu_ranks(..., free_gpu_memory=True) to call to desispec.gpu.free_gpu_memory() before switching to a new device ?

@sbailey
Copy link
Contributor Author

sbailey commented Nov 8, 2022

@dmargala thanks for catching the method='contiguous' bug. I have fixed that, including handling the multi-node case, albeit with some assumptions about MPI rank distribution documented in the docstring. Please double check me.

Good point about freeing memory before re-allocating ranks to GPUs. I couldn't think of a reason why we would want to re-allocate rank:gpu without freeing the unused memory, so I added a call to free_gpu_memory inside redistribute_gpu_ranks, without an option to toggle that behavior. If we find we need to reallocate ranks without freeing memory, we can add an option for that in the future. Or if you know of a reason why we need that now, let me know and I can add it. In the meantime, the default does the right thing (free memory before re-distributing) and avoids options that we may never use.

@dmargala
Copy link
Contributor

dmargala commented Nov 8, 2022

@sbailey the method='contiguous' logic looks good now.

The main reason to not free the memory pool would be performance considerations depending on the situation. I think it's fine to leave as is until a performance study reveals an opportunity there.

@sbailey sbailey merged commit b97ca67 into main Nov 8, 2022
Himalayas automation moved this from In progress to Done Nov 8, 2022
@sbailey sbailey deleted the gpu_mem branch November 8, 2022 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants