add job cancellation infrastructure to job-manager etc #1976

garlick · 2019-02-01T00:13:07Z

I needed to add job cancellation support to be able to test the new scheduler interface. Rather than submit it with the scheduler stuff, it's submitted here as a standalone PR.

The "purge" function is removed

flux_job_purge() API function
flux job purge ID command
job-manager.purge service method

and is replaced with

flux_job_cancel() API function
An optional FLUX_JOB_PURGEflag (to get the old behavior of purging KVS)
flux job cancel [--purge] ID command
job-manager.cancel service method

In addition, cancel logs a cancel event to the job's eventlog (if it's not being purged), and publishes a job-cancel event message. The code is restructured to support the addition of other tasks that must complete before the response is sent to the user, such as freeing resources back to the scheduler or stopping execution.

Some fairly minimal test coverage was added.

Finally, there was a bit of unrelated, minor cleanup in job-manager and job-ingest.

codecov-io · 2019-02-01T00:35:03Z

Codecov Report

Merging #1976 into master will decrease coverage by 0.07%.
The diff coverage is 67.88%.

@@            Coverage Diff             @@
##           master    #1976      +/-   ##
==========================================
- Coverage   80.05%   79.97%   -0.08%     
==========================================
  Files         195      195              
  Lines       35005    35068      +63     
==========================================
+ Hits        28023    28046      +23     
- Misses       6982     7022      +40

Impacted Files	Coverage Δ
src/modules/job-ingest/validate.c	`83.92% <ø> (-1.08%)`	⬇️
src/modules/job-manager/job-manager.c	`68.85% <100%> (ø)`	⬆️
src/common/libjob/job.c	`68.18% <100%> (ø)`	⬆️
src/modules/job-manager/queue.c	`91.04% <100%> (ø)`	⬆️
src/modules/job-manager/active.c	`73.77% <100%> (+0.43%)`	⬆️
src/modules/job-manager/cancel.c	`63.86% <63.86%> (ø)`
src/cmd/flux-job.c	`89.62% <88.88%> (+0.04%)`	⬆️
src/common/libflux/handle.c	`84.16% <0%> (-2.04%)`	⬇️
src/modules/connector-local/local.c	`73.77% <0%> (-1.04%)`	⬇️
... and 3 more

grondo · 2019-02-01T23:10:31Z

Any way to test these conditions? I gather not, but thought I'd ask


if ((flags & FLUX_JOB_PURGE)) {
--
  | 267 | + | if ((job->flags & FLUX_JOB_RESOURCE_REQUESTED)) {
  | 268 | + | errno = EPERM;
  | 269 | + | errstr = "cannot purge job with pending resource request";
  | 270 | + | goto error;
  | 271 | + | }
  | 272 | + | if ((job->flags & FLUX_JOB_RESOURCE_ALLOCATED)) {
  | 273 | + | errno = EPERM;
  | 274 | + | errstr = "cannot purge job with allocated resources";
  | 275 | + | goto error;
  | 276 | + | }
  | 277 | + | }

garlick · 2019-02-01T23:32:30Z

Not currently but will make sure that is covered in next PR for the scheduler API (when finally those flags can be set).

grondo · 2019-02-02T02:18:31Z

I could merge this, but logically should #1985 go next?

Now that the broker cleans up subprocess state when the exec service is torn down, the "simulated disconnect" code from the job-ingest module is no longer needed. Remove it.

Problem: the job queue is sorted, in part, by submit time, but uses the job id FLUID as an approximation of submit time, when it could use the actual submit time. Change the queue compare callback to compare job->t_submit instead of job->id.

Replace the specialized flux_job_purge() API function with a generalized flux_job_cancel(). Add a FLUX_JOB_PURGE flag which emulates the previous behavior. Add a new job flag, FLUX_JOB_CANCELED, which appears as a 'c' in flux job list output. In the job-manager, rneame purge to cancel. Rename the flux-job purge subcommand to cancel. Replace all occurrences of "flux job purge" with "flux job cancel --purge" in sharness tests.

Problem: if job-manager module is reloaded, canceled jobs (that have not been --purged) return to the queue. If a job's event log has the 'cancel' event, then set the FLUX_JOB_CANCELED flag on the job when reloading job state from the KVS. Then, at least for now, if a job with that bit set is retrieved during reload, don't add it to the queue.

If cancel request is sent without the FLUX_JOB_PURGE flag, then remove it from the queue and log 'cancel' event to the job's KVS eventlog. In addition, in preparation for the systems growing a scheduler and exec subsystem: - Set FLUX_JOB_CANCEL bit on the job in case job needs to remain in queue during cleanup. - Publish 'job-cancel' event message that may be generally useful for aborting sched/exec operations for a job.

In the t2201-job-cmd.t sharness test, Don't use --purge option with every flux job cancel, now that it is not required.

Verify that cancelling a job - appends 'cancel' event to event log - disappears from the queue - doesn't reappear if job-manager is reloaded

garlick · 2019-02-02T05:09:29Z

Restarted one builder (Centos 7 caliper) that failed with

expecting success: 
    flux hwloc reload --rank=all $sierra &&
    flux kvs get resource.hwloc.by_rank | $jq -S . > sierra.out &&
    cat <<-EOF | $jq -S . > sierra.expected &&
	{"[0-1]":
          {"NUMANode": 6, "Package": 2, "Core": 44, "PU": 176, "GPU": 4}
        }
	EOF
    test_cmp sierra.expected sierra.out
./sharness.sh: line 344:  7890 Segmentation fault      (core dumped) flux hwloc reload --rank=all $sierra
not ok 11 - hwloc: reload xml with GPU resources

I could merge this, but logically should #1985 go next?

Either way is fine with me.

grondo · 2019-02-02T06:29:00Z

Did you happen to see if Travis is getting core backtraces anymore?

Either way is fine with me.

Since the flux-hwloc PR was merged the point is moot.

garlick · 2019-02-02T15:33:25Z

Apologies on both counts: I didn't check for a stack trace and I didn't notice your comment here before merging the flux-hwloc PR. :-(

garlick force-pushed the job_cancel branch from 0040170 to 37191a3 Compare February 1, 2019 01:04

garlick force-pushed the job_cancel branch from 37191a3 to 2413db2 Compare February 2, 2019 00:09

garlick added 7 commits February 1, 2019 19:01

modules/job-ingest: drop disconnect work-around

5131dcd

Now that the broker cleans up subprocess state when the exec service is torn down, the "simulated disconnect" code from the job-ingest module is no longer needed. Remove it.

modules/job-manager: sort on t_submit not id

9d57df8

Problem: the job queue is sorted, in part, by submit time, but uses the job id FLUID as an approximation of submit time, when it could use the actual submit time. Change the queue compare callback to compare job->t_submit instead of job->id.

testsuite: update flux job cancel test

04222ad

In the t2201-job-cmd.t sharness test, Don't use --purge option with every flux job cancel, now that it is not required.

testsuite: cover job cancellation

3111bfd

Verify that cancelling a job - appends 'cancel' event to event log - disappears from the queue - doesn't reappear if job-manager is reloaded

garlick force-pushed the job_cancel branch from 2413db2 to 3111bfd Compare February 2, 2019 03:01

grondo merged commit 0f688ce into flux-framework:master Feb 2, 2019

garlick mentioned this pull request Feb 2, 2019

not ok 11 - hwloc: reload xml with GPU resources #1986

Closed

garlick deleted the job_cancel branch February 4, 2019 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add job cancellation infrastructure to job-manager etc #1976

add job cancellation infrastructure to job-manager etc #1976

garlick commented Feb 1, 2019

codecov-io commented Feb 1, 2019 •

edited

Loading

grondo commented Feb 1, 2019

garlick commented Feb 1, 2019

grondo commented Feb 2, 2019

garlick commented Feb 2, 2019

grondo commented Feb 2, 2019

garlick commented Feb 2, 2019

add job cancellation infrastructure to job-manager etc #1976

add job cancellation infrastructure to job-manager etc #1976

Conversation

garlick commented Feb 1, 2019

codecov-io commented Feb 1, 2019 • edited Loading

Codecov Report

grondo commented Feb 1, 2019

garlick commented Feb 1, 2019

grondo commented Feb 2, 2019

garlick commented Feb 2, 2019

grondo commented Feb 2, 2019

garlick commented Feb 2, 2019

codecov-io commented Feb 1, 2019 •

edited

Loading