Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wreck: set CUDA_VISIBLE_DEVICES when gpus are in R_lite #1599

Merged
merged 5 commits into from Jul 25, 2018

Conversation

Projects
None yet
5 participants
@grondo
Copy link
Contributor

grondo commented Jul 24, 2018

I thought I'd throw the current version of the CUDA_VISIBLE_DEVICES plugin into a PR.
This version also sets CUDA_DEVICE_ORDER as @dongahn instructs in #1598.

The per-task behavior can be requested with -o gpubind=per-task and the plugin can be disabled with -o gpubind=off.

Probably should figure out some way to test this in CI, but we first need a way to simulate gpu resources I guess...

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jul 24, 2018

Coverage Status

Coverage decreased (-0.02%) to 79.412% when pulling 6eda6db on grondo:cuda_visible_devices into 55d47d5 on flux-framework:master.

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Jul 24, 2018

Does this need modifications to wreckrun to handle the gpubind flag properly?

flux wreckrun -l -N1 -c10 -n4 -g1 -o gpubind=per-task printenv CUDA_VISIBLE_DEVICES
wreckrun: Failed to process cmdline args
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Jul 24, 2018

Codecov Report

Merging #1599 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1599   +/-   ##
=======================================
  Coverage   79.25%   79.25%           
=======================================
  Files         171      171           
  Lines       31341    31341           
=======================================
  Hits        24840    24840           
  Misses       6501     6501
Impacted Files Coverage Δ
src/common/libflux/content.c 86.66% <0%> (-3.34%) ⬇️
src/common/libutil/base64.c 95.07% <0%> (-0.71%) ⬇️
src/cmd/flux-module.c 85.06% <0%> (-0.31%) ⬇️
src/common/libflux/message.c 81.14% <0%> (ø) ⬆️
src/common/libflux/handle.c 83.91% <0%> (ø) ⬆️
src/common/libkvs/kvs_watch.c 90.12% <0%> (+0.42%) ⬆️
src/common/libflux/response.c 83.55% <0%> (+0.65%) ⬆️
src/broker/modservice.c 81.55% <0%> (+0.97%) ⬆️
@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 24, 2018

Oops, sorry I didn't push a fully functional PR (and I left debug code in the script)

Fixes coming shortly.

@grondo grondo force-pushed the grondo:cuda_visible_devices branch from 5298ae2 to dfe969c Jul 24, 2018

wreck: set CUDA_VISIBLE_DEVICES for allocated gpus
Add a new wreck plugin to set CUDA_VISIBLE_DEVICES when there
are GPU resources set in R_lite. The plugin also sets
CUDA_DEVICE_ORDER=PCI_BUS_ID so that CUDA uses the same GPU ids
as are understood by hwloc and the Flux scheduler.

By default, CUDA_VISIBLE_DEVICES is set to all locally allocated
GPUs in all local tasks. The list of GPUs can be partitioned and
assigned per-task with the wreck option `-o gpubind=per-task`.

This plugin may also be disabled with the option `-o gpubind=off`

Fixes #1562
Fixes #1598

@grondo grondo force-pushed the grondo:cuda_visible_devices branch from dfe969c to 8bec7ad Jul 24, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 24, 2018

Ok, this version should be better, though there is no CI testing yet.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 24, 2018

I've added support in wreckrun to fake gpus in R_lite if a scheduler is not loaded and the -g, --gpus-per-task option is used.

This feature is then used to sanity check the CUDA_VISIBLE_DEVICES support added by the Lua plugin.

I also added the stanza to allow the -o gpubind to wreckrun/submit.

If travis and peer review passes, this should be ready to go in.

@grondo grondo requested a review from SteVwonder Jul 24, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 24, 2018

Something in the new test isn't working in Travis. I'll have to run that down tonight.

@grondo grondo force-pushed the grondo:cuda_visible_devices branch from bbe8564 to 4c34f47 Jul 24, 2018

@SteVwonder
Copy link
Member

SteVwonder left a comment

LGTM! Some slick shell-fu going on in the tests. TIL that : is a thing and how to generate files inline with cat.

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 25, 2018

Hm, I guess my script-fu has failed though. Still failing in Travis but not on my other systems. ☹️

grondo added some commits Jul 24, 2018

wreck: support -o gpubind option
Add "gpubind" to list of accepted options for the wreckrun and
submit `-o` option.
wreckrun: add gpus to fake R_lite with -g option
When the -g, --gpus-per-task option is used and no scheduler is
loaded, add "fake" GPU resources to the generated R_lite. This
will be helpful in testing plugins and other parts of flux-core
that look for allocated GPUs in R_lite.
testsuite: wreck: test CUDA_VISIBLE_DEVICES support
Add basic sanity checks for proper operation of the CUDA_VISIBLE_DEVICES
plugin for wreck jobs.
build: add cuda_devices.lua plugin to Makefile
Add cuda_devices.lua to distribution

@grondo grondo force-pushed the grondo:cuda_visible_devices branch from 4c34f47 to 6eda6db Jul 25, 2018

@grondo

This comment has been minimized.

Copy link
Contributor Author

grondo commented Jul 25, 2018

Oh, duh. Forgot to add cuda_devices.lua to Makefile.am.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jul 25, 2018

Great! Merging.

@garlick garlick merged commit c91ac60 into flux-framework:master Jul 25, 2018

2 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.02%) to 79.412%
Details

@grondo grondo deleted the grondo:cuda_visible_devices branch Jul 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.