Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shell: support -o gpu-affinity=map:LIST #5356

Merged
merged 9 commits into from
Jul 28, 2023

Commits on Jul 28, 2023

  1. shell: export cpuset array functions from affinity.c

    Problem: We'd like to support a map:LIST option for the gpu affinity
    shell plugin that matches the cpu affinity support, but the functions
    to create, parse, and destroy lists of hwloc_cpuset_t objects are
    internal to the shell CPU affinity plugin.
    
    Export the cpuset array functions from affinity.c for use by other
    builtin shell plugins.
    grondo committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    b3c27c6 View commit details
    Browse the repository at this point in the history
  2. shell: gpubind: use cpuset array for per-task support

    Problem: The gpubind plugin "per-task" works by allocating gpu ids
    from a shared gpus idset within the task.init callback, but this is
    different from how the cpu affinity plugins and thus code can't be
    easily shared.
    
    Rewrite the gpubind plugin to use an array of hwloc_cpuset_t objects
    to maintain which gpus should be assinged to each task. To ease memory
    management, add a context object that contains relevant data for the
    plugin, and create this object at initialization.
    
    The gpubind plugin now assigns GPUs to tasks during initialization,
    as does the affinity plugin. This allows the cpuset array code to be
    reused, and for the gpubind plugin to support a "map:LIST" option in
    the future.
    grondo committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    2e121cb View commit details
    Browse the repository at this point in the history
  3. shell: affinity: do not error if mapped cpuset outside of job

    Problem: The usefulness of the `-o cpu-affinity=map:LIST` option is
    reduced because the assigned cores must be contained within the cpu
    set assigned to the job. However, if a user is using the `map:LIST`
    option then they are presumably agreeing to take complete control of
    core assignments, and the option should not limit them to only the
    assigned cores.
    
    Drop the check in the affinity plugin that cores are contained within
    the job cpuset. Don't even check if the cores are valid (this would
    presumably fail later when the affinity is actually applied.)
    
    Fixes flux-framework#5352
    grondo committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    35557b4 View commit details
    Browse the repository at this point in the history
  4. shell: gpubind: always register task.init handler

    Problem: Conditional registration of the task.init callback may cause
    code duplication if the callback is required in other situations.
    
    Always register the task.init callback. Do nothing if ctx->gpusets
    has not been assigned. This means there are no per-task GPUs.
    grondo committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    da0db0d View commit details
    Browse the repository at this point in the history
  5. shell: gpubind: don't exit early from plugin if no gpus assigned

    Problem: The gpubind plugin returns early if no GPUs were allocated
    to the job, but some future gpu-affinity options may want to override
    the default binding in this case.
    
    Let the plugin fall through to the if/else block even if no GPUs are
    assigned to the job. Change the final `else` to `else if (ngpus >
    0)` though, so that the plugin still does nothing (besides setting
    CUDA_VISIBLE_DEVICES=-1) if ngpus == 0.
    grondo committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    ff5f6e0 View commit details
    Browse the repository at this point in the history
  6. shell: gpubind: support gpu-affinity=map:LIST

    Problem: The gpubind shell plugin doesn't support explicit mapping
    of GPUs to tasks.
    
    Support a `-o gpu-affinity=map:LIST` which works similar to
    the cpu-affinity of the same name. This option allows explicity
    specification of GPUs to tasks without regard for the actual GPU ids
    assigned to a job. It is mainly meant for testing, benchmarks, or when
    default GPU assignment is not working for a particular situation.
    
    Fixes flux-framework#5350
    grondo committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    e6d0c33 View commit details
    Browse the repository at this point in the history
  7. testsuite: test -o gpu-affinity=map:LIST option

    Problem: No tests in the testsuite ensure proper operation of the
    job shell gpu-affinity=map: option.
    
    Add a simple test that ensures the gpu-affinity `map:` option is working
    for a basic scenario.
    grondo committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    7dfbb59 View commit details
    Browse the repository at this point in the history
  8. doc: document gpu-affinity=map in flux-shell(1)

    Problem: The `map:` argument of the `gpu-affinity` shell option is not
    documented.
    
    Add a short description of this option to flux-shell(1).
    grondo committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    bbeb9ef View commit details
    Browse the repository at this point in the history
  9. doc: fix link in flux-shell(1) cpu-affinity documentation

    Problem: The documentation of the cpu-affinity shell option in the
    flux-shell(1) manpage references the `hwloc_cpuset_t(3)` manpage,
    but the link does not exist.
    
    Reference the `hwlocality_bitmap(3)` manpage instead, and use a
    direct link to the hwloc docs v2.9.0 since hwlocality_bitmap(3)
    doesn't exist on the default manpage site (linux.die.net).
    
    Update the spelling dictionary.
    grondo committed Jul 28, 2023
    Configuration menu
    Copy the full SHA
    1e311b9 View commit details
    Browse the repository at this point in the history