Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add flux job drain #2092

Merged
merged 10 commits into from Mar 26, 2019

Conversation

@garlick
Copy link
Member

commented Mar 21, 2019

Add a "drain" method to the job manager which 1) shuts off new job submissions, then 2) waits for the queue size to reach zero. This supports the use case of an initial program that submits some jobs and exits as soon as they are all transitioned to INACTIVE.

Add a corresponding command flux job drain [--timeout SECONDS] which uses the drain RPC. No API wrapper as yet - not sure if there will be a need? (Easy to add if needed).

This was described in #2044. I did make two changes from that description:

  • flux job drain --timeout N was implemented with flux_future_wait_for(), not built in to the protocol.
  • Multiple waiters are allowed.
@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

Very cool!

Should there be a way to cancel a drain?

It does seem like one use case might be to submit a set of work, wait for all jobs to complete, then "undrain" and do it over again (or perhaps once the queue has been emptied after a drain, it is open for business again by default)

I also wasn't sure what should happen if you ctrl-c out of a flux job drain, seems like it should cancel the drain if there are no more waiters, but I don't have a strong opinion on that.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 22, 2019

I think maybe a flux undrain might be a good addition.

On ctrl-c or timeout, it feels to me like that's aborting the wait, but one would expect the queue disable to have already occurrred? We can always change that behavior later once we get a feel for the tool, if it becomes evident that I am mistaken.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

On ctrl-c or timeout, it feels to me like that's aborting the wait, but one would expect the queue disable to have already occurrred?

Yeah, that makes sense. I think the only reason the behavior comes into question is because flux job drain is really two operations, one of which is canceled with ctrl-c and the other which is not. It is probably better if ctrl-c and timeout behave the same, so I think your current approach is correct (probably won't be called interactively anyway).

Copy link
Contributor

left a comment

Looks good!

I had one rambling comment inline, but feel free to ignore it if it is just a case of personal bias.

src/modules/job-manager/drain.c Show resolved Hide resolved
@garlick garlick force-pushed the garlick:job_drain branch from 2681883 to f0d2dc7 Mar 22, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 22, 2019

Just forced pushed with a bit of refactoring in the submit code, to implement submit_enable() and submit_disable() instead of the boolean pointer passing, as discussed above.

Also added flux job undrain.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 22, 2019

I'm getting a hang in t2202-job-manager.t here once every 10 runs or so (on fluke):

expecting success: 
	flux job drain 2>drain2.err &
	flux job undrain &&
	wait &&
	grep re-enabled drain2.err

I think I see a lt-flux job drain process still running, so I'm guessing it is the wait that is hanging. Could flux job drain in background be running after flux job undrain in rare cases?

@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 23, 2019

Thanks @grondo. I took your advice and moved that test to a small python script. LMK if the solution looks OK and I'll squash.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2019

I took your advice and moved that test to a small python script. LMK if the solution looks OK and I'll squash.

LGTM! Though I'm not the best judge of python code.

I wasn't able to get the test to hang, so the fix seems robust.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2019

Looks like one code path that is missed in testing is invoking the drain_complete_cb via the queue->empty_cb -- all the test cases seem to call drain with an empty queue. (This is from viewing coverage results).

@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 25, 2019

Hopefully this last addition addresses the missing coverage. I had to fix another issue so that the queue in that job-manager sharness test could actually reach zero size - there was missing logic to handle cleanup of jobs that were in the queue to be scheduled (but in the test never would be, since sched isn't loaded). Then it seemed necessary to create another python script to be able to ensure the order of a drain request followed by a cancel request.

@codecov-io

This comment has been minimized.

Copy link

commented Mar 25, 2019

Codecov Report

Merging #2092 into master will decrease coverage by 0.08%.
The diff coverage is 74.28%.

@@            Coverage Diff             @@
##           master    #2092      +/-   ##
==========================================
- Coverage    80.3%   80.22%   -0.09%     
==========================================
  Files         196      197       +1     
  Lines       31308    31437     +129     
==========================================
+ Hits        25141    25219      +78     
- Misses       6167     6218      +51
Impacted Files Coverage Δ
src/modules/job-exec/job-exec.c 73.43% <ø> (ø) ⬆️
src/modules/job-manager/event.c 73.3% <100%> (+0.24%) ⬆️
src/modules/job-manager/queue.c 86.58% <100%> (+1.25%) ⬆️
src/modules/job-ingest/job-ingest.c 73.72% <100%> (ø) ⬆️
src/modules/job-manager/alloc.c 75.24% <100%> (+0.75%) ⬆️
src/modules/job-manager/job-manager.c 64% <50%> (-4.89%) ⬇️
src/modules/job-manager/drain.c 66.07% <66.07%> (ø)
src/cmd/flux-job.c 84.23% <68.96%> (-1.31%) ⬇️
src/modules/job-manager/submit.c 73.07% <86.66%> (+8.07%) ⬆️
src/modules/job-ingest/worker.c 65.03% <0%> (-7.7%) ⬇️
... and 12 more
@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 25, 2019

Once #2104 goes in, is this PR ready for a merge?

@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 25, 2019

Yeah, I'll squash it down a bit first though.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 26, 2019

I have this squashed down and rebased on top of #2104. I'll hold off pushing until 2104 is merged.

garlick added 10 commits Mar 19, 2019
Add a way to register a callback when the size of
the queue drops to zero.
Add job-manager.drain RPC which disables job submission,
and responds when the queue size reaches zero.

There can be multiple drain requests active simultaneously,
allowing multiple waiters.

This provides a robust and simple mechanism for initial
programs that submit a bunch of work, wait for it to complete,
and then exit.

The job-manager.undrain RPC re-enables the queue and
causes any pending drain RPCs to receive an error.

job-manager: drain use new submit enable/disable
Add the commands
  flux job drain [--timeout DURATION]
  flux job undrain

Drain disables new jobs from being accepted,
and blocks until the queue is empty.

Undrain enables new job to be accepted.

Fixes #2044
Problem: job-manager may refuse a submit request if
it is being drained, but its explanatory error message
is not returned to the user.

Change the logic to ensure that the human readable error
is returned in the ingest submit response.

Also disable logging at LOG_ERR level when submit fails,
since the drain error may now be expected in normal operation.
Add coverage to job-manager sharness test for
flux job drain and undrain.

Add two python helpers to reliably create
conditions for:

1) A drain request to block until queue size
   is reduced to zero.
2) A drain request to be "canceled" by an
   undrain request.

These tests, driven by t2202-job-manager.t,
require requests to arrive at the job manager
in a specific order, which is hard to achieve
by running one flux job command in the foreground
and one in the background.

Misc cleanup:
1) remove unused kvs_active() function
2) don't filter CLEANUP jobs from list_jobs()
   now that jobs are removed upon cancellation
   when scheduler isn't loaded
3) include job-info in test message about
   loading/unloading modules.
Add submit_enable() and submit_disable() to be used by
the drain/undrain request handlers to enable and disable
job submission.

Job submission is initially enabled, then can be toggled on
and off with the above functions.

If disabled, the submit request responds with a human-readable
error message.
As noted in the valgrind "job" workload, it's a bit
cleaner to simply call flux job drain at the end
of the script than to limit jobs to serial execution
and wait on the last one's eventlog.
Problem:  There is a dangling "flags." in the comment block at
the top of event.c.

Remove it.
Problem: there is a typo, a formatting error, and an old
reference to "release final=1" (no a boolean) in the
comment block at the top of job-exec.c

Tidy up.
Problem: if a job is queued for scheduling (SCHED state)
awaiting a slot to send an alloc request, and the job then
transitions to CLEANUP state, it is not cleaned up.

Add code to conditionally dequeue the job from the alloc queue
when CLEANUP state is entered.
@garlick garlick force-pushed the garlick:job_drain branch from f54ffe1 to 7ac6d6d Mar 26, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 26, 2019

This is ready to go in IMHO once travis turns green.

@codecov-io

This comment has been minimized.

Copy link

commented Mar 26, 2019

Codecov Report

Merging #2092 into master will decrease coverage by 0.08%.
The diff coverage is 74.28%.

@@            Coverage Diff             @@
##           master    #2092      +/-   ##
==========================================
- Coverage   80.35%   80.26%   -0.09%     
==========================================
  Files         196      197       +1     
  Lines       31320    31437     +117     
==========================================
+ Hits        25166    25233      +67     
- Misses       6154     6204      +50
Impacted Files Coverage Δ
src/modules/job-exec/job-exec.c 73.43% <ø> (ø) ⬆️
src/modules/job-manager/event.c 73.3% <100%> (+0.24%) ⬆️
src/modules/job-manager/queue.c 86.58% <100%> (+1.25%) ⬆️
src/modules/job-ingest/job-ingest.c 73.72% <100%> (ø) ⬆️
src/modules/job-manager/alloc.c 75.24% <100%> (+0.75%) ⬆️
src/modules/job-manager/job-manager.c 64% <50%> (-4.89%) ⬇️
src/modules/job-manager/drain.c 66.07% <66.07%> (ø)
src/cmd/flux-job.c 84.23% <68.96%> (-1.31%) ⬇️
src/modules/job-manager/submit.c 73.07% <86.66%> (+8.07%) ⬆️
src/modules/job-ingest/worker.c 65.03% <0%> (-7.7%) ⬇️
... and 6 more
@grondo grondo merged commit 9628070 into flux-framework:master Mar 26, 2019
2 checks passed
2 checks passed
Mergify — Summary 1 potential rule
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 26, 2019

LGTM, merged.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 26, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.