Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shell: add gpu affinity support #2406

Merged
merged 7 commits into from Sep 29, 2019

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Sep 28, 2019

This PR adds support for parsing gpu resources from R if assigned into the shell internal rcalc class. The gpu resources are then exposed to plugins via flux_shell_get_rank_info() and a gpubind.c builtin plugin is introduced to read the gpu list and set CUDA_VISIBLE_DEVICES appropriately.

The plugin can be controlled via the gpu-affinity shell option. gpu-affinity="off" disables plugin operation, gpu-affinity="on" is the default and sets CUDA_VISIBLE_DEVICES once per job, and gpu-affinity="per-task" divides available GPUs evenly among local tasks. (If GPUs do not evenly divide among tasks the current behavior is undefined 😉. Probably should fix this, but I don't think it could actually happen in the current system)

I couldn't directly port the cuda_visible_devices.lua plugin from 0.11 because that depended on the lua cpu_set_t bindings which were removed in the wreck purge of '19.

@grondo
Copy link
Contributor Author

grondo commented Sep 28, 2019

Rebased and switched from jq to flux mini run --dry-run in tests.

@garlick
Copy link
Member

garlick commented Sep 28, 2019

I guess we also need a --gpus-per-task=N option for flux mini run huh?

@grondo
Copy link
Contributor Author

grondo commented Sep 28, 2019

Yeah, there's an issue open on that #2403
I was wondering if we could do something more.. flexible, than --gpus-per-task. But I guess that is what maxi run is for... 😉

@grondo
Copy link
Contributor Author

grondo commented Sep 29, 2019

I guess we also need a --gpus-per-task=N option for flux mini run huh?

Ah, I see. Were you hinting here that this PR should add that option now that flux-mini run is merged? That actually makes sense and I could do that if you're not already tackling it.

Support "gpu" resource type in R v1 children list. Save
and return this list in struct rcalc_rankinfo.

Add a total count of gpu for informational purposes.
Support GPUs in the shell rcalc test program.
Add an R input for rcalc tests with GPUs to sanity check rcalc
gpu support.
Add gpus list to flux_shell_get_rank_info() JSON output.
Add a builtin gpu affinity plugin for the shell which sets
CUDA_VISIBLE_DEVICES (optionally per-task), as well as
CUDA_DEVICE_ORDER=PCI_BUS_ID.

This builtin plugin can be overridden with the name 'gpu-affinity'.

The plugin reads the shell option "gpu-affinity: string", and supports
values "on", "off", and "per-task". Invalid options are ignored
and the default is "on".
t2604-job-shell-affinity no longer uses jq. Remove the jq
test and requirement.
Ensure basic operation of the shell builtin gpu affinity plugin.
@garlick
Copy link
Member

garlick commented Sep 29, 2019 via email

@grondo
Copy link
Contributor Author

grondo commented Sep 29, 2019

I was playing with something like this? 🤷‍♂️

diff --git a/src/cmd/flux-mini.py b/src/cmd/flux-mini.py
index e53539dc8..35ac78de7 100755
--- a/src/cmd/flux-mini.py
+++ b/src/cmd/flux-mini.py
@@ -34,7 +34,7 @@ from datetime import timedelta
 
 
 class JobSpec:
-    def __init__(self, command, num_tasks=1, cores_per_task=1, num_nodes=None):
+    def __init__(self, command, num_tasks=1, cores_per_task=1, gpus_per_task=0, num_nodes=None):
         """
         Constructor builds the minimum legal v1 jobspec.
         Use setters to assign additional properties.
@@ -45,12 +45,16 @@ class JobSpec:
             raise ValueError("task count must be a integer >= 1")
         if not isinstance(cores_per_task, int) or cores_per_task < 1:
             raise ValueError("cores per task must be an integer >= 1")
+        if not isinstance(gpus_per_task, int) or gpus_per_task < 1:
+            raise ValueError("gpus per task must be an integer >= 0")
         if num_nodes is not None:
             if not isinstance(num_nodes, int) or num_nodes < 1:
                 raise ValueError("node count must be an integer >= 1 (if set)")
             if num_nodes > num_tasks:
                 raise ValueError("node count must not be greater than task count")
-        core = self.__create_resource("core", cores_per_task)
+        children = [self.__create_resource("core", cores_per_task)]
+        if gpus_per_task > 0:
+            children.append (self.__create_resource("gpu", gpus_per_task))
         if num_nodes is not None:
             num_slots = int(math.ceil(num_tasks / float(num_nodes)))
             if num_tasks % num_nodes != 0:
@@ -58,11 +62,11 @@ class JobSpec:
                 task_count_dict = {"total": num_tasks}
             else:
                 task_count_dict = {"per_slot": 1}
-            slot = self.__create_slot("task", num_slots, [core])
+            slot = self.__create_slot("task", num_slots, children)
             resource_section = self.__create_resource("node", num_nodes, [slot])
         else:
             task_count_dict = {"per_slot": 1}
-            slot = self.__create_slot("task", num_tasks, [core])
+            slot = self.__create_slot("task", num_tasks, children)
             resource_section = slot
 
         self.jobspec = {
@@ -173,7 +177,6 @@ class JobSpec:
     def dumps(self):
         return json.dumps(self.jobspec)
 
-
 class SubmitCmd:
     """
     SubmitCmd submits a job, displays the jobid on stdout, and returns.
@@ -208,6 +211,14 @@ class SubmitCmd:
             default=1,
             help="Number of cores to allocate per task",
         )
+        parser.add_argument(
+            "-g",
+            "--gpus-per-task",
+            type=int,
+            metavar="N",
+            default=0,
+            help="Number of GPUs to allocate per task",
+        )
         parser.add_argument(
             "-t",
             "--time-limit",
@@ -275,6 +286,7 @@ class SubmitCmd:
             args.command,
             num_tasks=args.ntasks,
             cores_per_task=args.cores_per_task,
+            gpus_per_task=args.gpus_per_task,
             num_nodes=args.nodes,
         )
         jobspec.set_cwd(os.getcwd())

flux mini run -g 1 -c 2 -n 2 produces:

  "resources": [
    {
      "count": 2,
      "with": [
        {
          "count": 2,
          "type": "core"
        },
        {
          "count": 1,
          "type": "gpu"
        }
      ],
      "type": "slot",
      "label": "task"
    }

@grondo
Copy link
Contributor Author

grondo commented Sep 29, 2019

However, since the scheduler in core ignores any resource besides node and core, there isn't a way to do an end to end test of GPU support (in flux-core anyway).

@grondo
Copy link
Contributor Author

grondo commented Sep 29, 2019

I went ahead and pushed the change to add flux-mini -g, --gpus-per-task=N (no tests yet.) That would allow this PR to close #2402. If it is wrong it can easily be backed out.

"--gpus-per-task",
type=int,
metavar="N",
default=0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Just reading through this, won't this fail if -g is not specified, since value will be zero and <1 raises an exception?

You might want to follow the pattern of --nodes / num_nodes where it is None unless the user provided the option.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that sounds better. The assertion was a typo and should allow 0, but use of None feels more right to me. I think I've got it wrong anyway since the builds are failing in travis. I won't have more time to work on this until this afternoon if you'd rather get it done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll have a go.

Otherwise, are you feeling like this PR is ready?

@garlick
Copy link
Member

garlick commented Sep 29, 2019

I'll add the new option to the flux-mini(1) man page in #2409 if that works for you.

Edit: 0d1e625 if you prefer to cherry pick, or redo.

@grondo
Copy link
Contributor Author

grondo commented Sep 29, 2019

@garlick, feel free to force push this branch without that last commit. Then, I think this is merge ready. (Maybe after another spot check)

@codecov-io
Copy link

Codecov Report

Merging #2406 into master will increase coverage by <.01%.
The diff coverage is 82.35%.

@@            Coverage Diff             @@
##           master    #2406      +/-   ##
==========================================
+ Coverage   81.12%   81.12%   +<.01%     
==========================================
  Files         224      225       +1     
  Lines       36022    36099      +77     
==========================================
+ Hits        29223    29287      +64     
- Misses       6799     6812      +13
Impacted Files Coverage Δ
src/shell/builtins.c 95.23% <ø> (ø) ⬆️
src/shell/shell.c 85.75% <100%> (-0.24%) ⬇️
src/shell/gpubind.c 79.68% <79.68%> (ø)
src/shell/rcalc.c 87.24% <88.88%> (+0.83%) ⬆️
src/common/libsubprocess/local.c 79.86% <0%> (-0.35%) ⬇️
src/common/libflux/message.c 80.22% <0%> (-0.14%) ⬇️
src/common/libutil/veb.c 98.95% <0%> (+0.52%) ⬆️
src/modules/job-info/guest_watch.c 74.71% <0%> (+0.56%) ⬆️

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mergify mergify bot merged commit 62a92b3 into flux-framework:master Sep 29, 2019
@grondo grondo deleted the shell-gpu-support branch September 29, 2019 20:00
@grondo
Copy link
Contributor Author

grondo commented Sep 29, 2019

Thanks for fixing it up for me! (Off doing kid stuff all day)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants