Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

qmanager: Add hello/exception callback support #493

Merged
merged 10 commits into from Jul 15, 2019

Conversation

@dongahn
Copy link
Contributor

dongahn commented Jul 12, 2019

This PR fills in hello callback and exception callback as required by job-manager. The hello callback allows qmanager to fill in its running queue as part of load sequence and the exception callback allows for canceling the pending jobs (i.e., jobs with pending alloc requests).

Note that the hello callback is working in progress towards full resilience. More comprehensive solution with respect to resource will require a solution for Issue #470. This isn't planned until later this year.

Resolve #492.

@dongahn dongahn changed the title Hello support qmanager: Add hello/exception callback support Jul 12, 2019
@dongahn dongahn requested review from grondo, SteVwonder and garlick Jul 12, 2019
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Jul 12, 2019

Codecov Report

Merging #493 into master will increase coverage by 0.52%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #493      +/-   ##
==========================================
+ Coverage   74.94%   75.46%   +0.52%     
==========================================
  Files          60       60              
  Lines        6074     6119      +45     
==========================================
+ Hits         4552     4618      +66     
+ Misses       1522     1501      -21
Impacted Files Coverage Δ
qmanager/policies/base/queue_policy_base.hpp 100% <100%> (ø) ⬆️
qmanager/modules/qmanager.cpp 75.72% <70.37%> (+9.06%) ⬆️
src/common/libschedutil/hello.c 55.17% <77.77%> (+32.95%) ⬆️
qmanager/policies/base/queue_policy_base_impl.hpp 82.35% <86.95%> (+10.18%) ⬆️
src/common/libschedutil/alloc.c 66.07% <0%> (+5.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef69e63...777107f. Read the comment docs.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jul 12, 2019

Recovery of userid, t_submit, priority from the job eventlog within the hello phase looks good to me, and so does including the jobid in the free callback. It might be useful to move the eventlog stuff into libschedutil alongside the code that reads in R from the KVS rather than embedding it in qmanager.

19c37e9 is the heart of this PR and doesn't have any commit description, so I would suggest adding a few paragraphs of explanation. Splitting out the changes to libschedutil to their own commit would be helpful in case we backport to flux-core, or to enable standalone testing.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 12, 2019

It might be useful to move the eventlog stuff into libschedutil alongside the code that reads in R from the KVS rather than embedding it in qmanager.

Excellent suggestion! Other schedulers may need this through libschedutil. Will do.

Splitting out the changes to libschedutil to their own commit would be helpful in case we backport to flux-core, or to enable standalone testing.

I thought about this. But qmanager will no longer build at that commit. If having self contained commit is more important than making each commit buildable, I can do this. It seems this is one of those cases.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jul 12, 2019

OK by me then.

Incidentally, my comment about standalone testing applies to the first suggestion (move eventlog code from qmanager to libschedutil), not the second one about splitting the commit. Sorry if that was confusing/silly.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 12, 2019

Maybe, I can move the event code stuff into libschedutil without the API change -- one commit that solely contains schedutil changes alone.

And then the next commit can contain the bulk of the hello callback support and two-line API changes in libschedutil. This way, at least back porting becomes an inch easier...

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jul 12, 2019

Should we think about exporting libeventlog from libflux-core.so (or its own dso)? It seems more general purpose and less likely to change than libschedutil.

Also, are the changes here going to be easy to resolve with flux-framework/flux-core#2226?

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 12, 2019

Should we think about exporting libeventlog from libflux-core.so (or its own dso)? It seems more general purpose and less likely to change than libschedutil.

I think so -- ultimately. I imagine there would be more external users of libeventlog than schedulers and it does make sense. If this happens, I can latch on to it. (I don't think it is a priority though).

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 12, 2019

Also, are the changes here going to be easy to resolve with flux-framework/flux-core#2226?

We will see. But my changes will be pretty isolated with a minimal API change so hopefully this doesn't give too much trouble to @SteVwonder.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jul 13, 2019

Just occurred to me - you can request both R and the eventlog through the job-info module with one RPC (instead of going direct to KVS)...

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jul 13, 2019

But doesn't job-info have to do 2 rpcs, so 3 in total and higher latency?

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jul 13, 2019

I think job-info fetches keys in parallel so probably the same latency. Yeah, never mind - he's already got code to fetch and parse the eventlog, so I think I'm not being especially helpful here!

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jul 13, 2019

Let's just fix the job manager to send the entire queue object in the hello response. I'll propose something in flux-core. Stand by...

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 13, 2019

Let's just fix the job manager to send the entire queue object in the hello response. I'll propose something in flux-core. Stand by...

OK. I will hold until I see your changes then. Thanks.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 15, 2019

OK. I will hold until I see your changes then. Thanks.

@garlick: Thank you for the hello protocol changes. Just pushed the changes to make use of them. I believe I addressed all the comments that came up so far. Let me know if these are okay I will squash them before merge.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jul 15, 2019

I think it's OK to squash.

I'll have one more look after you do that but I think the job-manager interface part looks good.

@dongahn dongahn force-pushed the dongahn:hello_support branch from 59d1a56 to 90a861d Jul 15, 2019
Done by @garlick. The commit msg:

job-manager: include job metadata in hello response

Problem: flux-sched wants to reconstruct queue entries
for running jobs when restarting, but "hello" response
only includes an array of job ID's.

Return an array of job objects in response to the hello
request, and in each object include the same data that
would have been received in an alloc request.

Update the hello response parser in libschedutil to
parse the new message format.
@dongahn dongahn force-pushed the dongahn:hello_support branch from 90a861d to 9858a79 Jul 15, 2019
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 15, 2019

Thanks @garlick. Squashed and forced a push.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jul 15, 2019

315cce4 (add hello/exception callback) seems to include more change than is described in the commit message, e.g. modifications to the free callback error handling, a reordering of qmanager_new(), and queue policy interface changes. Add to commit message or split out interface changes to their own commit?

LGTM otherwise.

dongahn added 2 commits Jul 15, 2019
Make it easy to construct a job_t object without
having to set its class variables individually.
@dongahn dongahn force-pushed the dongahn:hello_support branch from 9858a79 to 14dacd6 Jul 15, 2019
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 15, 2019

315cce4 (add hello/exception callback) seems to include more change than is described in the commit message, e.g. modifications to the free callback error handling, a reordering of qmanager_new(), and queue policy interface changes. Add to commit message or split out interface changes to their own commit?

@garlick: I'd be happy to do that. I just split the commit into multiples. To minimize later squash, I took liberty forcing a push. From my perspective, this is ready to be merged.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 15, 2019

Wait... just one more change.

dongahn added 7 commits Jul 15, 2019
lookup: Look up a job whose jobid is id

reconstruct: Append a job into the internal running-job queue
to reconstruct the queue state. Particularly helpful
when the client code has a queue redundancy implemented.
E.g., job-manager manages the main queue state and this
helps qmanager make use of job-manager’s queue info
to recover from failures or restart.
Test the correctness of qmanager as it is unloaded and reloaded.

Test hello callback and exception callback as part of this.
Register schedutil operation callbacks first to
avoid potential synchronization problems. E.g.,
an alloc request can be issued by job-manager
before our alloc request handler is registered.
Exception callback enables a pending job (a job
with a pending alloc request) to be removed from
the scheduler queue when an exception is raised.
Running jobs that received an exception will still
be handled when corresponding free requests are
received.
Check if the target job that received an exception
with severity=0 is in the pending queue and if so
remove it from the system; otherwise, no-op.
Hello callback allows the scheduler to reconstruct
the state for the currently running jobs.
Use the newly added jobid, priority, submit time
as well as the original R to reconstruct
its running job queue. This callback is
invoked for each running job by libschedutil.
@dongahn dongahn force-pushed the dongahn:hello_support branch from 14dacd6 to 777107f Jul 15, 2019
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 15, 2019

Done.

Copy link
Member

garlick left a comment

LGTM. I like the way you split that up.

Did you want me to press the button or did you want someone wither a better understanding of the new architecture to do it?

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 15, 2019

@garlick:

Thank you.

Did you want me to press the button or did you want someone wither a better understanding of the new architecture to do it?

Please go ahead and press the button. If there are any problems with other parts of the code, we can revisit that later. This will help meet our next milestones!

@garlick garlick merged commit 12e58f9 into flux-framework:master Jul 15, 2019
3 checks passed
3 checks passed
codecov/patch 80% of diff hit (target 74.94%)
Details
codecov/project 75.46% (+0.52%) compared to ef69e63
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Jul 15, 2019

Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.