Skip to content

Conversation

@msimberg
Copy link
Collaborator

@msimberg msimberg commented Jul 2, 2025

hwloc 2.11.0 started exposing the Grace-Hopper GPU NUMA nodes by default (https://github.com/open-mpi/hwloc/blob/41030697179b16f96f7e169f4530061c5fe6803f/NEWS#L164):

Version 2.11.0
--------------
...
* GPU support
  + Don't hide the GPU NUMA node on NVIDIA Grace Hopper.

This means that the MPS wrapper script would try to set GPU NUMA nodes to silly values (GPU NUMA nodes are indexed >= 4). For example:

$ hwloc-calc --physical --intersect NUMAnode $(hwloc-bind socket:0 -- hwloc-bind --get --taskset) # default
0,4
$ HWLOC_KEEP_NVIDIA_GPU_NUMA_NODES=0 hwloc-calc --physical --intersect NUMAnode $(hwloc-bind socket:0 -- hwloc-bind --get --taskset) # explicitly ignoring GPU NUMA nodes
0

This is not yet a problem with the system hwloc (on daint at least) which is at version 2.9.0, but if it's updated or if a newer hwloc is visible in an environment the MPS wrapper script would behave weirdly.

This PR explicitly sets HWLOC_KEEP_NVIDIA_GPU_NUMA_NODES=0 when getting the GPU index for a rank. Setting it doesn't hurt on older hwloc versions.

@msimberg msimberg requested review from RMeli and bcumming as code owners July 2, 2025 14:17
@github-actions
Copy link

github-actions bot commented Jul 2, 2025

preview available: https://docs.tds.cscs.ch/179

Copy link
Member

@RMeli RMeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but maybe worth adding an annotation with a bit of explanation?

@msimberg
Copy link
Collaborator Author

msimberg commented Jul 3, 2025

LGTM, but maybe worth adding an annotation with a bit of explanation?

Yep, fair point. I added a comment in 2e406a2. What do you think?

@github-actions
Copy link

github-actions bot commented Jul 3, 2025

preview available: https://docs.tds.cscs.ch/179

@RMeli RMeli merged commit 4e5afdc into main Jul 3, 2025
1 check passed
@RMeli RMeli deleted the msimberg-patch-1 branch July 3, 2025 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants