Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

submit/job/jsc: propagate gpu request information #1480

Merged
merged 2 commits into from Apr 20, 2018

Conversation

Projects
None yet
5 participants
@dongahn
Copy link
Contributor

dongahn commented Apr 20, 2018

I will appreciate if this can be reviewed and merged soon. I'm having issues with some of the tests in my GPU support at flux-sched PR (flux-framework/flux-sched#313) and having this merged into master should be helpful to narrow that down.

This is a slightly modified version of @twrs' change.

Add -g to wreck to allow for requesting gpus
Propate that request information to scheduler
through job module and jsc.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 20, 2018

Coverage Status

Coverage decreased (-0.05%) to 79.023% when pulling 07d6f2a on dongahn:gpu_support into f8aae6c on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Apr 20, 2018

Codecov Report

Merging #1480 into master will decrease coverage by 0.05%.
The diff coverage is 81.25%.

@@            Coverage Diff             @@
##           master    #1480      +/-   ##
==========================================
- Coverage   78.77%   78.72%   -0.06%     
==========================================
  Files         164      164              
  Lines       30328    30355      +27     
==========================================
+ Hits        23891    23896       +5     
- Misses       6437     6459      +22
Impacted Files Coverage Δ
src/modules/wreck/job.c 73.19% <50%> (ø) ⬆️
src/common/libjsc/jstatctl.c 74.26% <83.33%> (+0.25%) ⬆️
src/common/libflux/content.c 86.66% <0%> (-3.34%) ⬇️
src/common/libflux/ev_flux.c 97.56% <0%> (-2.44%) ⬇️
src/common/libflux/attr.c 90% <0%> (-2%) ⬇️
src/common/libflux/rpc.c 92.56% <0%> (-1.66%) ⬇️
src/cmd/flux-event.c 67.74% <0%> (-1.08%) ⬇️
src/connectors/local/local.c 87.4% <0%> (-0.75%) ⬇️
src/bindings/lua/flux-lua.c 81.58% <0%> (-0.69%) ⬇️
src/common/libflux/future.c 88.31% <0%> (-0.47%) ⬇️
... and 11 more
@grondo
Copy link
Contributor

grondo left a comment

Just a couple necessary changes. I'm surprised the issue with the pack/unpack format wasn't caught by any tests.

If you'd like I'll throw together a basic sanity test for the --ngpus option and push it onto this PR?

@@ -289,6 +290,11 @@ function wreck:parse_cmdline (arg)
self.ncores = self.ntasks
end

self.ngpus = 0

This comment has been minimized.

@grondo

grondo Apr 20, 2018

Contributor

I don't think this line is necessary due to

ngpus = self.ngpus or 0

below. However, it doesn't hurt so it can stay for now.

@@ -180,6 +180,7 @@ static flux_future_t *send_create_event (flux_t *h, struct wreck_job *job)
"ntasks", job->ntasks,
"ncores", job->ncores,
"nnodes", job->nnodes,
"ngpus", job->ngpus,

This comment has been minimized.

@grondo

grondo Apr 20, 2018

Contributor

The extra s:i parameter for the "ngpus", job->gpus pair is missing in the flux_msg_pack() format here.

@@ -399,6 +400,7 @@ static void job_create_cb (flux_t *h, flux_msg_handler_t *w,
"ntasks", &job->ntasks,
"nnodes", &job->nnodes,
"ncores", &job->ncores,
"ngpus", &job->ngpus,

This comment has been minimized.

@grondo

grondo Apr 20, 2018

Contributor

Extra s?i parameter (or s?:i for consistency) is required for the "ngpus", &job->ngpus pair in the format string for flux_request_unpack() here as above

This comment has been minimized.

@dongahn

dongahn Apr 20, 2018

Author Contributor

Heh.. I think this explains weird sched failure I was getting yesterday.

I had to ask though. Why s? not s ?

This comment has been minimized.

@grondo

grondo Apr 20, 2018

Contributor

In the past these request messages only included members that were actually set by the user. If not set they were assumed zero.

It may make sense now to have these members of this message be required.

@@ -43,6 +43,7 @@ local default_opts = {
['help'] = { char = 'h' },
['verbose'] = { char = 'v' },
['ntasks'] = { char = 'n', arg = "N" },
['ngpus'] = { char = 'g', arg = "g" },

This comment has been minimized.

@grondo

grondo Apr 20, 2018

Contributor

Unfortunately the options structure in the Lua code is not auto-documenting like optparse. Documentation of the --ngpus option should also be added to wreck:usage().

Also consider adding to the wreckrun and submit manpages, but I think unfortunately those need a full editing pass so we can save it for later if you'd like.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 20, 2018

Thanks @grondo for the review. The pack/unpack is my bad -- copy & paste error...Will get to those suggestions.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 20, 2018

@dongahn, it may be helpful to add these sanity-check tests to t2000-wreck.t along with this PR. If you want I can push this commit directly on top of your branch here

diff --git a/t/t2000-wreck.t b/t/t2000-wreck.t
index bac86ac..a5c780d 100755
--- a/t/t2000-wreck.t
+++ b/t/t2000-wreck.t
@@ -209,6 +209,18 @@ test_expect_success 'wreckrun: -t2 -N${SIZE} sets correct ntasks in kvs' '
        n=$(flux kvs get --json ${LWJ}.ntasks) &&
        test "$n" = $((${SIZE}*2))
 '
+test_expect_success 'wreckrun: ngpus is 0 by default' '
+       flux wreckrun -n 2 /bin/true &&
+       LWJ=$(last_job_path) &&
+       n=$(flux kvs get --json ${LWJ}.ngpus) &&
+       test "$n" = "0"
+'
+test_expect_success 'wreckrun: -g, --ngpus sets ngpus in kvs' '
+       flux wreckrun -n 2 -g 4 /bin/true &&
+       LWJ=$(last_job_path) &&
+       n=$(flux kvs get --json ${LWJ}.ngpus) &&
+       test "$n" = "4"
+'
 
 test_expect_success 'wreckrun: fallback to old rank.N.cores format works' '
        flux wreckrun -N2 -n2 \
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 20, 2018

Thanks @grondo for the review. The pack/unpack is my bad -- copy & paste error...Will get to those suggestions.

No problem, it was unfortunate we didn't get your PR merged before the job module was rewritten.

I also noticed that ngpus field isn't pulled from the request message in flux_request_unpack() in job_submit_only(). This is why I had suggested we move the unpack calls into a single place in the wreck_job interface. Sorry about that.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Apr 20, 2018

I also noticed that ngpus field isn't pulled from the request message in flux_request_unpack() in job_submit_only(). This is why I had suggested we move the unpack calls into a single place in the wreck_job interface. Sorry about that.

My bad for not following through. As we just discussed, let's just add this here and get this merged, then I'll follow up with a PR to move the pack/unpack to wreck_job as originally proposed so we don't have this code duplication going forward.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 20, 2018

@dongahn, it may be helpful to add these sanity-check tests to t2000-wreck.t along with this PR. If you want I can push this commit directly on top of your branch here

I will take care. I have actually GPU scheduling tests in sched with butte hwloc xml + RDL so this sanity check should do it.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 20, 2018

it was unfortunate we didn't get your PR merged before the job module was rewritten.

My bad for not following through.

I had some issues with the old PR + time constrainsts that prevented the PR to go in. Not your faults at all!

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 20, 2018

@twrs: is -g submit option supposed to request N GPUs per node or a total of N GPUs? From the code, it looks to me like it requests N GPUs per node. Should the long option be named --gpus-per-node?

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 20, 2018

Or maybe N GPUs per task?

dongahn added some commits Apr 19, 2018

submit/job/jsc: propagate gpu request information
Augmented version of @twrs' change.

Add -g to wreck to allow for requesting gpus
Propate that request information to scheduler
through job module and jsc.

@dongahn dongahn referenced this pull request Apr 20, 2018

Merged

GPU scheduling support #313

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 20, 2018

Hmmm. One CI configuration failed with a timeout. Kick it again.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 20, 2018

I also took liberty to implement -g as N GPUs per task.

@grondo

grondo approved these changes Apr 20, 2018

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 20, 2018

Ok, looks like you addressed all my comments. Thanks!

@grondo grondo merged commit 28de35f into flux-framework:master Apr 20, 2018

4 checks passed

codecov/patch 81.25% of diff hit (target 78.77%)
Details
codecov/project Absolute coverage decreased by -0.05% but relative coverage increased by +2.47% compared to f8aae6c
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.05%) to 79.023%
Details
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Apr 20, 2018

Merged now in hopes it helps you make progress on your flux-sched PR. If we need to change the behavior of -g, --gpus-per-task we can do that later.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Apr 20, 2018

Great! Thank you, this is helpful!

@grondo grondo referenced this pull request May 10, 2018

Closed

0.9.0 Release #1479

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.