Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libschedutil: extend hello protocol #2231

Merged
merged 3 commits into from Jul 13, 2019

Conversation

@garlick
Copy link
Member

commented Jul 13, 2019

Per discussion in flux-framework/flux-sched#493.

The job-manger - scheduler "hello" handshake supports scheduler state recovery. The initial interface in libschedutil issued a callback containing only R for each job that has resources allocated. Internally the job manager returned an array of job ID's, which libschedutil used to look up corresponding R values in the KVS.

In addition to simply marking resources allocated, flux-sched wants to reconstruct queue entries for running jobs, so it needs the same metadata that was passed in with the alloc request. This is low overhead since the data is part of the active job record that job-manager keeps in memory. This PR extends the hello handshake to send an array of "job objects" instead of an array of job ID's. The job object contains job ID, priority, userid, and submit time.

The hello_f callback prototype is expanded to include the new information in addition to R, and the two users in flux-core (sched-simple and the test dummy scheduler) are updated to use the new prototype, although they don't do anything with the new data.

Problem: flux-sched wants to reconstruct queue entries
for running jobs when restarting, but "hello" response
only includes an array of job ID's.

Return an array of job objects in response to the hello
request, and in each object include the same data that
would have been received in an alloc request.

Update the hello response parser in libschedutil to
parse the new message format.
@@ -18,7 +18,13 @@
* Failure of the callback aborts iteration and causes schedutil_hello()
* to return -1 with errno passed through.
*/
typedef int (hello_f)(flux_t *h, const char *R, void *arg);
typedef int (hello_f)(flux_t *h,

This comment has been minimized.

Copy link
@dongahn

dongahn Jul 13, 2019

Contributor

A nit: maybe revise the comment about saying R and other info being passed?

@dongahn

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2019

LGTM! Just added one inline comment: take it or not.

Copy link
Contributor

left a comment

LGTM! Thanks @garlick.

Add job id, priority, userid, and submit time to the hello_f
callback, now that this information is supplied in the
hello response message.

Update users in sched-simple.

Update sched-dummy test code.
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 13, 2019

I'll go ahead and squash that last commit and force a push.

@garlick garlick force-pushed the garlick:schedutil_hello_extended branch from 0ecca0d to 6164f84 Jul 13, 2019
@grondo
grondo approved these changes Jul 13, 2019
Copy link
Contributor

left a comment

This is great!

In the interest of expediency I don't necessarily suggest it here, but even though sched-dummy doesn't use the extra metadata it might be good to add some sanity checks in its hello_cb just to catch easy errors (since there's no users in flux-core)

@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 13, 2019

Yeah that was bugging me a bit, let me try to do something there.

@codecov-io

This comment has been minimized.

Copy link

commented Jul 13, 2019

Codecov Report

Merging #2231 into master will decrease coverage by 0.01%.
The diff coverage is 76.92%.

@@            Coverage Diff             @@
##           master    #2231      +/-   ##
==========================================
- Coverage   80.72%   80.71%   -0.02%     
==========================================
  Files         202      202              
  Lines       32295    32297       +2     
==========================================
- Hits        26069    26067       -2     
- Misses       6226     6230       +4
Impacted Files Coverage Δ
src/modules/sched-simple/sched.c 73.26% <100%> (ø) ⬆️
src/modules/job-manager/alloc.c 76.23% <66.66%> (ø) ⬆️
src/common/libschedutil/hello.c 55.17% <77.77%> (-4.09%) ⬇️
src/common/libflux/mrpc.c 87.79% <0%> (-1.19%) ⬇️
src/cmd/flux-module.c 84.19% <0%> (+0.23%) ⬆️
Emit some debug in the hello callback in sched-dummy,
then look for it in dmesg in t2203-job-manager-dummysched.t.
@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2019

restarted a builder that failed in valgrind with #2195

@grondo

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2019

Looks great, thanks!

@grondo grondo merged commit 26f0119 into flux-framework:master Jul 13, 2019
2 checks passed
2 checks passed
Summary 1 potential rule
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member Author

commented Jul 13, 2019

Thanks for that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.