GPU memory and rank allocation #1899

sbailey · 2022-11-07T05:08:46Z

This PR fixes the GPU memory and rank:GPU allocation problems reported in issue #1897, using the methods suggested by @dmargala in that issue. This PR provides a new desispec.gpu module with helper functions

is_gpu_available(): standardizing if a GPU+cupy+numba.cuda is available (will be used more in a future PR that will remove the other places we do the same sorts of calculations)
free_gpu_memory(): free unused GPU memory if possible, no-op if we're not using GPUs
redistribute_gpu_ranks(comm): redistribute MPI rank:GPU mapping, handling case of comm=None and the case of no GPUs.

Testing was bogged down by:

typos resulting in NameError and ValueError exceptions that were getting swallowed by stdout buffering before the job was killed
a surprising feature/bug where sys.stdout.flush immediately after reassigning the GPUs resulted in an error srun: error: eio_handle_mainloop: Abandoning IO 60 secs after job shutdown initiated, but flushing before reassigning did not cause this.
- This remains surprising enough that I'm not entirely convinced that that was the actual relevant feature, vs. just tweaking the timing of some other race condition that I wasn't understanding while trying to debug where my stdout messages were going. This current branch doesn't have the debugging sys.stdout.flush() commands, and hasn't failed in multiple test runs. I'm mainly documenting this for the record in case the error pops up again with larger test runs.

sbailey · 2022-11-07T20:03:21Z

Recently discovered: the logs from this PR test run have messages like

[6] This process is not using the same device it did during gtl_init.
 Current device_id 2 original device_id 0.
 Use of IPC is disabled from this point. IPC PUT protocol may be used instead

This does not seem to be causing any tangible problems, but also doesn't seem good.

dmargala

This looks good overall, just one major issue with the contiguous rank/device mapping method.

I'm also a bit puzzled by the sys.stdout.flush issue and the GTL messages. It doesn't seem like they are blocking for now but I'll see if I can figure out what is going on there.

dmargala · 2022-11-07T19:37:35Z

py/desispec/gpu.py

+            if method == 'round-robin':
+                device_id = comm.rank % ngpu
+            elif method == 'contiguous':
+                device_id = int(comm.rank / ngpu)


there is a bug here for cases where int(comm.size / ngpu) >= ngpu. For example, rank 31 in a comm of size 32 would get assigned device_id=7 on a node with only 4 gpus.

this also gets a little more complicated if the MPI comm spans multiple nodes.

dmargala · 2022-11-07T19:43:54Z

py/desispec/gpu.py

+            device_id = 0
+            cupy.cuda.Device(device_id).use()
+            log.info(f'No MPI communicator; assigning process to GPU {device_id}/{ngpu}')


I double checked that the cupy 0-based indexing is compatible with using the CUDA_VISIBLE_DEVICES to set gpu device visibility.

dmargala@nid001065:~> srun -n 1 --gpus-per-node 4 bash -c 'CUDA_VISIBLE_DEVICES=3 python -c "import cupy; device = cupy.cuda.Device(); print(device.id, device.pci_bus_id)"' 0 0000:C1:00.0 dmargala@nid001065:~> srun -n 1 --gpus-per-node 4 bash -c 'CUDA_VISIBLE_DEVICES=0 python -c "import cupy; device = cupy.cuda.Device(); print(device.id, device.pci_bus_id)"' 0 0000:03:00.0

This looks good to me. cupy reports device index zero in both cases but we can see the pci bus id changes when we choose a device via the CUDA_VISIBLE_DEVICES env variable.

dmargala · 2022-11-07T20:07:50Z

py/desispec/scripts/proc_tilenight.py

+    desispec.gpu.free_gpu_memory()
+    desispec.gpu.redistribute_gpu_ranks(comm)


There is a risk to clearing the wrong device memory pool if these functions are called in opposite order. Perhaps it would be worth adding an opt-out option like desispec.gpu.redistribute_gpu_ranks(..., free_gpu_memory=True) to call to desispec.gpu.free_gpu_memory() before switching to a new device ?

sbailey · 2022-11-08T19:59:22Z

@dmargala thanks for catching the method='contiguous' bug. I have fixed that, including handling the multi-node case, albeit with some assumptions about MPI rank distribution documented in the docstring. Please double check me.

Good point about freeing memory before re-allocating ranks to GPUs. I couldn't think of a reason why we would want to re-allocate rank:gpu without freeing the unused memory, so I added a call to free_gpu_memory inside redistribute_gpu_ranks, without an option to toggle that behavior. If we find we need to reallocate ranks without freeing memory, we can add an option for that in the future. Or if you know of a reason why we need that now, let me know and I can add it. In the meantime, the default does the right thing (free memory before re-distributing) and avoids options that we may never use.

dmargala · 2022-11-08T21:21:04Z

@sbailey the method='contiguous' logic looks good now.

The main reason to not free the memory pool would be performance considerations depending on the situation. I think it's fine to leave as is until a performance study reveals an opportunity there.

Stephen Bailey added 2 commits November 6, 2022 16:47

GPU memory and rank assignment cleanup

418a981

handle no-MPI case

e35a465

sbailey requested a review from dmargala November 7, 2022 05:08

sbailey added this to In progress in Himalayas via automation Nov 7, 2022

docstring clarify that comm can be None

9bebded

dmargala reviewed Nov 7, 2022

View reviewed changes

redistribute contiguous bug fix; free mem before realloc rank:gpu

5c8064d

sbailey merged commit b97ca67 into main Nov 8, 2022

Himalayas automation moved this from In progress to Done Nov 8, 2022

sbailey deleted the gpu_mem branch November 8, 2022 21:29

This was referenced Nov 8, 2022

CUDA_ERROR_OUT_OF_MEMORY on tilenight stdstar step #1897

Closed

GPU vs. CPU production running #1901

Merged

marcelo-alvarez assigned sbailey and dmargala Nov 8, 2022

sbailey mentioned this pull request Nov 26, 2022

bugfix in distribute_ranks_to_blocks #1919

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory and rank allocation #1899

GPU memory and rank allocation #1899

sbailey commented Nov 7, 2022 •

edited

sbailey commented Nov 7, 2022

dmargala left a comment

dmargala Nov 7, 2022

dmargala Nov 7, 2022

dmargala Nov 7, 2022

sbailey commented Nov 8, 2022

dmargala commented Nov 8, 2022

		desispec.gpu.free_gpu_memory()
		desispec.gpu.redistribute_gpu_ranks(comm)

GPU memory and rank allocation #1899

GPU memory and rank allocation #1899

Conversation

sbailey commented Nov 7, 2022 • edited

sbailey commented Nov 7, 2022

dmargala left a comment

Choose a reason for hiding this comment

dmargala Nov 7, 2022

Choose a reason for hiding this comment

dmargala Nov 7, 2022

Choose a reason for hiding this comment

dmargala Nov 7, 2022

Choose a reason for hiding this comment

sbailey commented Nov 8, 2022

dmargala commented Nov 8, 2022

sbailey commented Nov 7, 2022 •

edited