New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wreck: set CUDA_VISIBLE_DEVICES when gpus are in R_lite #1599
Conversation
Does this need modifications to
|
Codecov Report
@@ Coverage Diff @@
## master #1599 +/- ##
=======================================
Coverage 79.25% 79.25%
=======================================
Files 171 171
Lines 31341 31341
=======================================
Hits 24840 24840
Misses 6501 6501
|
Oops, sorry I didn't push a fully functional PR (and I left debug code in the script) Fixes coming shortly. |
5298ae2
to
dfe969c
Compare
Add a new wreck plugin to set CUDA_VISIBLE_DEVICES when there are GPU resources set in R_lite. The plugin also sets CUDA_DEVICE_ORDER=PCI_BUS_ID so that CUDA uses the same GPU ids as are understood by hwloc and the Flux scheduler. By default, CUDA_VISIBLE_DEVICES is set to all locally allocated GPUs in all local tasks. The list of GPUs can be partitioned and assigned per-task with the wreck option `-o gpubind=per-task`. This plugin may also be disabled with the option `-o gpubind=off` Fixes flux-framework#1562 Fixes flux-framework#1598
dfe969c
to
8bec7ad
Compare
Ok, this version should be better, though there is no CI testing yet. |
I've added support in This feature is then used to sanity check the I also added the stanza to allow the If travis and peer review passes, this should be ready to go in. |
Something in the new test isn't working in Travis. I'll have to run that down tonight. |
bbe8564
to
4c34f47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Some slick shell-fu going on in the tests. TIL that :
is a thing and how to generate files inline with cat
.
Hm, I guess my script-fu has failed though. Still failing in Travis but not on my other systems. |
Add cuda_devices.lua to distribution
Add "gpubind" to list of accepted options for the wreckrun and submit `-o` option.
When the -g, --gpus-per-task option is used and no scheduler is loaded, add "fake" GPU resources to the generated R_lite. This will be helpful in testing plugins and other parts of flux-core that look for allocated GPUs in R_lite.
Add basic sanity checks for proper operation of the CUDA_VISIBLE_DEVICES plugin for wreck jobs.
4c34f47
to
6eda6db
Compare
Oh, duh. Forgot to add |
Great! Merging. |
I thought I'd throw the current version of the CUDA_VISIBLE_DEVICES plugin into a PR.
This version also sets CUDA_DEVICE_ORDER as @dongahn instructs in #1598.
The per-task behavior can be requested with
-o gpubind=per-task
and the plugin can be disabled with-o gpubind=off
.Probably should figure out some way to test this in CI, but we first need a way to simulate gpu resources I guess...