Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck: set CUDA_VISIBLE_DEVICES when gpus are in R_lite #1599

Merged
merged 5 commits into from Jul 25, 2018

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Jul 24, 2018

I thought I'd throw the current version of the CUDA_VISIBLE_DEVICES plugin into a PR.
This version also sets CUDA_DEVICE_ORDER as @dongahn instructs in #1598.

The per-task behavior can be requested with -o gpubind=per-task and the plugin can be disabled with -o gpubind=off.

Probably should figure out some way to test this in CI, but we first need a way to simulate gpu resources I guess...

@coveralls
Copy link

coveralls commented Jul 24, 2018

Coverage Status

Coverage decreased (-0.02%) to 79.412% when pulling 6eda6db on grondo:cuda_visible_devices into 55d47d5 on flux-framework:master.

@SteVwonder
Copy link
Member

Does this need modifications to wreckrun to handle the gpubind flag properly?

flux wreckrun -l -N1 -c10 -n4 -g1 -o gpubind=per-task printenv CUDA_VISIBLE_DEVICES
wreckrun: Failed to process cmdline args

@codecov-io
Copy link

codecov-io commented Jul 24, 2018

Codecov Report

Merging #1599 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1599   +/-   ##
=======================================
  Coverage   79.25%   79.25%           
=======================================
  Files         171      171           
  Lines       31341    31341           
=======================================
  Hits        24840    24840           
  Misses       6501     6501
Impacted Files Coverage Δ
src/common/libflux/content.c 86.66% <0%> (-3.34%) ⬇️
src/common/libutil/base64.c 95.07% <0%> (-0.71%) ⬇️
src/cmd/flux-module.c 85.06% <0%> (-0.31%) ⬇️
src/common/libflux/message.c 81.14% <0%> (ø) ⬆️
src/common/libflux/handle.c 83.91% <0%> (ø) ⬆️
src/common/libkvs/kvs_watch.c 90.12% <0%> (+0.42%) ⬆️
src/common/libflux/response.c 83.55% <0%> (+0.65%) ⬆️
src/broker/modservice.c 81.55% <0%> (+0.97%) ⬆️

@grondo
Copy link
Contributor Author

grondo commented Jul 24, 2018

Oops, sorry I didn't push a fully functional PR (and I left debug code in the script)

Fixes coming shortly.

Add a new wreck plugin to set CUDA_VISIBLE_DEVICES when there
are GPU resources set in R_lite. The plugin also sets
CUDA_DEVICE_ORDER=PCI_BUS_ID so that CUDA uses the same GPU ids
as are understood by hwloc and the Flux scheduler.

By default, CUDA_VISIBLE_DEVICES is set to all locally allocated
GPUs in all local tasks. The list of GPUs can be partitioned and
assigned per-task with the wreck option `-o gpubind=per-task`.

This plugin may also be disabled with the option `-o gpubind=off`

Fixes flux-framework#1562
Fixes flux-framework#1598
@grondo
Copy link
Contributor Author

grondo commented Jul 24, 2018

Ok, this version should be better, though there is no CI testing yet.

@grondo
Copy link
Contributor Author

grondo commented Jul 24, 2018

I've added support in wreckrun to fake gpus in R_lite if a scheduler is not loaded and the -g, --gpus-per-task option is used.

This feature is then used to sanity check the CUDA_VISIBLE_DEVICES support added by the Lua plugin.

I also added the stanza to allow the -o gpubind to wreckrun/submit.

If travis and peer review passes, this should be ready to go in.

@grondo grondo requested a review from SteVwonder July 24, 2018 23:15
@grondo
Copy link
Contributor Author

grondo commented Jul 24, 2018

Something in the new test isn't working in Travis. I'll have to run that down tonight.

Copy link
Member

@SteVwonder SteVwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Some slick shell-fu going on in the tests. TIL that : is a thing and how to generate files inline with cat.

@grondo
Copy link
Contributor Author

grondo commented Jul 25, 2018

Hm, I guess my script-fu has failed though. Still failing in Travis but not on my other systems. ☹️

Add cuda_devices.lua to distribution
Add "gpubind" to list of accepted options for the wreckrun and
submit `-o` option.
When the -g, --gpus-per-task option is used and no scheduler is
loaded, add "fake" GPU resources to the generated R_lite. This
will be helpful in testing plugins and other parts of flux-core
that look for allocated GPUs in R_lite.
Add basic sanity checks for proper operation of the CUDA_VISIBLE_DEVICES
plugin for wreck jobs.
@grondo
Copy link
Contributor Author

grondo commented Jul 25, 2018

Oh, duh. Forgot to add cuda_devices.lua to Makefile.am.

@garlick
Copy link
Member

garlick commented Jul 25, 2018

Great! Merging.

@garlick garlick merged commit c91ac60 into flux-framework:master Jul 25, 2018
@grondo grondo deleted the cuda_visible_devices branch July 27, 2018 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants