Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add job cancellation infrastructure to job-manager etc #1976

Merged
merged 7 commits into from Feb 2, 2019

Conversation

@garlick
Copy link
Member

commented Feb 1, 2019

I needed to add job cancellation support to be able to test the new scheduler interface. Rather than submit it with the scheduler stuff, it's submitted here as a standalone PR.

The "purge" function is removed

  • flux_job_purge() API function
  • flux job purge ID command
  • job-manager.purge service method

and is replaced with

  • flux_job_cancel() API function
  • An optional FLUX_JOB_PURGEflag (to get the old behavior of purging KVS)
  • flux job cancel [--purge] ID command
  • job-manager.cancel service method

In addition, cancel logs a cancel event to the job's eventlog (if it's not being purged), and publishes a job-cancel event message. The code is restructured to support the addition of other tasks that must complete before the response is sent to the user, such as freeing resources back to the scheduler or stopping execution.

Some fairly minimal test coverage was added.

Finally, there was a bit of unrelated, minor cleanup in job-manager and job-ingest.

@codecov-io

This comment has been minimized.

Copy link

commented Feb 1, 2019

Codecov Report

Merging #1976 into master will decrease coverage by 0.07%.
The diff coverage is 67.88%.

@@            Coverage Diff             @@
##           master    #1976      +/-   ##
==========================================
- Coverage   80.05%   79.97%   -0.08%     
==========================================
  Files         195      195              
  Lines       35005    35068      +63     
==========================================
+ Hits        28023    28046      +23     
- Misses       6982     7022      +40
Impacted Files Coverage Δ
src/modules/job-ingest/validate.c 83.92% <ø> (-1.08%) ⬇️
src/modules/job-manager/job-manager.c 68.85% <100%> (ø) ⬆️
src/common/libjob/job.c 68.18% <100%> (ø) ⬆️
src/modules/job-manager/queue.c 91.04% <100%> (ø) ⬆️
src/modules/job-manager/active.c 73.77% <100%> (+0.43%) ⬆️
src/modules/job-manager/cancel.c 63.86% <63.86%> (ø)
src/cmd/flux-job.c 89.62% <88.88%> (+0.04%) ⬆️
src/common/libflux/handle.c 84.16% <0%> (-2.04%) ⬇️
src/modules/connector-local/local.c 73.77% <0%> (-1.04%) ⬇️
... and 3 more
@garlick garlick force-pushed the garlick:job_cancel branch from 0040170 to 37191a3 Feb 1, 2019
@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 1, 2019

Any way to test these conditions? I gather not, but thought I'd ask


if ((flags & FLUX_JOB_PURGE)) {
--
  | 267 | + | if ((job->flags & FLUX_JOB_RESOURCE_REQUESTED)) {
  | 268 | + | errno = EPERM;
  | 269 | + | errstr = "cannot purge job with pending resource request";
  | 270 | + | goto error;
  | 271 | + | }
  | 272 | + | if ((job->flags & FLUX_JOB_RESOURCE_ALLOCATED)) {
  | 273 | + | errno = EPERM;
  | 274 | + | errstr = "cannot purge job with allocated resources";
  | 275 | + | goto error;
  | 276 | + | }
  | 277 | + | }


@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 1, 2019

Not currently but will make sure that is covered in next PR for the scheduler API (when finally those flags can be set).

@garlick garlick force-pushed the garlick:job_cancel branch from 37191a3 to 2413db2 Feb 2, 2019
@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 2, 2019

I could merge this, but logically should #1985 go next?

garlick added 7 commits Jan 26, 2019
Now that the broker cleans up subprocess state when
the exec service is torn down, the "simulated disconnect"
code from the job-ingest module is no longer needed.

Remove it.
Problem: the job queue is sorted, in part, by
submit time, but uses the job id FLUID as an
approximation of submit time, when it could use
the actual submit time.

Change the queue compare callback to compare
job->t_submit instead of job->id.
Replace the specialized flux_job_purge() API function
with a generalized flux_job_cancel().  Add a
FLUX_JOB_PURGE flag which emulates the previous
behavior.

Add a new job flag, FLUX_JOB_CANCELED, which appears
as a 'c' in flux job list output.

In the job-manager, rneame purge to cancel.

Rename the flux-job purge subcommand to cancel.

Replace all occurrences of "flux job purge" with
"flux job cancel --purge" in sharness tests.
If cancel request is sent without the FLUX_JOB_PURGE
flag, then remove it from the queue and log 'cancel'
event to the job's KVS eventlog.

In addition, in preparation for the systems growing
a scheduler and exec subsystem:

- Set FLUX_JOB_CANCEL bit on the job in case job needs
  to remain in queue during cleanup.

- Publish 'job-cancel' event message that may be
  generally useful for aborting sched/exec operations
  for a job.
Problem: if job-manager module is reloaded, canceled
jobs (that have not been --purged) return to the queue.

If a job's event log has the 'cancel' event, then
set the FLUX_JOB_CANCELED flag on the job when
reloading job state from the KVS.  Then, at least for now,
if a job with that bit set is retrieved during reload,
don't add it to the queue.
In the t2201-job-cmd.t sharness test, Don't use
--purge option with every flux job cancel,
now that it is not required.
Verify that cancelling a job
- appends 'cancel' event to event log
- disappears from the queue
- doesn't reappear if job-manager is reloaded
@garlick garlick force-pushed the garlick:job_cancel branch from 2413db2 to 3111bfd Feb 2, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 2, 2019

Restarted one builder (Centos 7 caliper) that failed with

expecting success: 
    flux hwloc reload --rank=all $sierra &&
    flux kvs get resource.hwloc.by_rank | $jq -S . > sierra.out &&
    cat <<-EOF | $jq -S . > sierra.expected &&
	{"[0-1]":
          {"NUMANode": 6, "Package": 2, "Core": 44, "PU": 176, "GPU": 4}
        }
	EOF
    test_cmp sierra.expected sierra.out
./sharness.sh: line 344:  7890 Segmentation fault      (core dumped) flux hwloc reload --rank=all $sierra
not ok 11 - hwloc: reload xml with GPU resources

I could merge this, but logically should #1985 go next?

Either way is fine with me.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 2, 2019

Did you happen to see if Travis is getting core backtraces anymore?

Either way is fine with me.

Since the flux-hwloc PR was merged the point is moot.

@grondo grondo merged commit 0f688ce into flux-framework:master Feb 2, 2019
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 2, 2019

Apologies on both counts: I didn't check for a stack trace and I didn't notice your comment here before merging the flux-hwloc PR. :-(

@garlick garlick deleted the garlick:job_cancel branch Feb 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.