-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shell: support -o gpu-affinity=map:LIST
#5356
Commits on Jul 28, 2023
-
shell: export cpuset array functions from affinity.c
Problem: We'd like to support a map:LIST option for the gpu affinity shell plugin that matches the cpu affinity support, but the functions to create, parse, and destroy lists of hwloc_cpuset_t objects are internal to the shell CPU affinity plugin. Export the cpuset array functions from affinity.c for use by other builtin shell plugins.
Configuration menu - View commit details
-
Copy full SHA for b3c27c6 - Browse repository at this point
Copy the full SHA b3c27c6View commit details -
shell: gpubind: use cpuset array for per-task support
Problem: The gpubind plugin "per-task" works by allocating gpu ids from a shared gpus idset within the task.init callback, but this is different from how the cpu affinity plugins and thus code can't be easily shared. Rewrite the gpubind plugin to use an array of hwloc_cpuset_t objects to maintain which gpus should be assinged to each task. To ease memory management, add a context object that contains relevant data for the plugin, and create this object at initialization. The gpubind plugin now assigns GPUs to tasks during initialization, as does the affinity plugin. This allows the cpuset array code to be reused, and for the gpubind plugin to support a "map:LIST" option in the future.
Configuration menu - View commit details
-
Copy full SHA for 2e121cb - Browse repository at this point
Copy the full SHA 2e121cbView commit details -
shell: affinity: do not error if mapped cpuset outside of job
Problem: The usefulness of the `-o cpu-affinity=map:LIST` option is reduced because the assigned cores must be contained within the cpu set assigned to the job. However, if a user is using the `map:LIST` option then they are presumably agreeing to take complete control of core assignments, and the option should not limit them to only the assigned cores. Drop the check in the affinity plugin that cores are contained within the job cpuset. Don't even check if the cores are valid (this would presumably fail later when the affinity is actually applied.) Fixes flux-framework#5352
Configuration menu - View commit details
-
Copy full SHA for 35557b4 - Browse repository at this point
Copy the full SHA 35557b4View commit details -
shell: gpubind: always register task.init handler
Problem: Conditional registration of the task.init callback may cause code duplication if the callback is required in other situations. Always register the task.init callback. Do nothing if ctx->gpusets has not been assigned. This means there are no per-task GPUs.
Configuration menu - View commit details
-
Copy full SHA for da0db0d - Browse repository at this point
Copy the full SHA da0db0dView commit details -
shell: gpubind: don't exit early from plugin if no gpus assigned
Problem: The gpubind plugin returns early if no GPUs were allocated to the job, but some future gpu-affinity options may want to override the default binding in this case. Let the plugin fall through to the if/else block even if no GPUs are assigned to the job. Change the final `else` to `else if (ngpus > 0)` though, so that the plugin still does nothing (besides setting CUDA_VISIBLE_DEVICES=-1) if ngpus == 0.
Configuration menu - View commit details
-
Copy full SHA for ff5f6e0 - Browse repository at this point
Copy the full SHA ff5f6e0View commit details -
shell: gpubind: support gpu-affinity=map:LIST
Problem: The gpubind shell plugin doesn't support explicit mapping of GPUs to tasks. Support a `-o gpu-affinity=map:LIST` which works similar to the cpu-affinity of the same name. This option allows explicity specification of GPUs to tasks without regard for the actual GPU ids assigned to a job. It is mainly meant for testing, benchmarks, or when default GPU assignment is not working for a particular situation. Fixes flux-framework#5350
Configuration menu - View commit details
-
Copy full SHA for e6d0c33 - Browse repository at this point
Copy the full SHA e6d0c33View commit details -
testsuite: test -o gpu-affinity=map:LIST option
Problem: No tests in the testsuite ensure proper operation of the job shell gpu-affinity=map: option. Add a simple test that ensures the gpu-affinity `map:` option is working for a basic scenario.
Configuration menu - View commit details
-
Copy full SHA for 7dfbb59 - Browse repository at this point
Copy the full SHA 7dfbb59View commit details -
doc: document gpu-affinity=map in flux-shell(1)
Problem: The `map:` argument of the `gpu-affinity` shell option is not documented. Add a short description of this option to flux-shell(1).
Configuration menu - View commit details
-
Copy full SHA for bbeb9ef - Browse repository at this point
Copy the full SHA bbeb9efView commit details -
doc: fix link in flux-shell(1) cpu-affinity documentation
Problem: The documentation of the cpu-affinity shell option in the flux-shell(1) manpage references the `hwloc_cpuset_t(3)` manpage, but the link does not exist. Reference the `hwlocality_bitmap(3)` manpage instead, and use a direct link to the hwloc docs v2.9.0 since hwlocality_bitmap(3) doesn't exist on the default manpage site (linux.die.net). Update the spelling dictionary.
Configuration menu - View commit details
-
Copy full SHA for 1e311b9 - Browse repository at this point
Copy the full SHA 1e311b9View commit details