Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
add job cancellation infrastructure to job-manager etc #1976
I needed to add job cancellation support to be able to test the new scheduler interface. Rather than submit it with the scheduler stuff, it's submitted here as a standalone PR.
The "purge" function is removed
and is replaced with
In addition, cancel logs a
Some fairly minimal test coverage was added.
Finally, there was a bit of unrelated, minor cleanup in
@@ Coverage Diff @@ ## master #1976 +/- ## ========================================== - Coverage 80.05% 79.97% -0.08% ========================================== Files 195 195 Lines 35005 35068 +63 ========================================== + Hits 28023 28046 +23 - Misses 6982 7022 +40
Any way to test these conditions? I gather not, but thought I'd ask
Now that the broker cleans up subprocess state when the exec service is torn down, the "simulated disconnect" code from the job-ingest module is no longer needed. Remove it.
Problem: the job queue is sorted, in part, by submit time, but uses the job id FLUID as an approximation of submit time, when it could use the actual submit time. Change the queue compare callback to compare job->t_submit instead of job->id.
Replace the specialized flux_job_purge() API function with a generalized flux_job_cancel(). Add a FLUX_JOB_PURGE flag which emulates the previous behavior. Add a new job flag, FLUX_JOB_CANCELED, which appears as a 'c' in flux job list output. In the job-manager, rneame purge to cancel. Rename the flux-job purge subcommand to cancel. Replace all occurrences of "flux job purge" with "flux job cancel --purge" in sharness tests.
If cancel request is sent without the FLUX_JOB_PURGE flag, then remove it from the queue and log 'cancel' event to the job's KVS eventlog. In addition, in preparation for the systems growing a scheduler and exec subsystem: - Set FLUX_JOB_CANCEL bit on the job in case job needs to remain in queue during cleanup. - Publish 'job-cancel' event message that may be generally useful for aborting sched/exec operations for a job.
Problem: if job-manager module is reloaded, canceled jobs (that have not been --purged) return to the queue. If a job's event log has the 'cancel' event, then set the FLUX_JOB_CANCELED flag on the job when reloading job state from the KVS. Then, at least for now, if a job with that bit set is retrieved during reload, don't add it to the queue.
In the t2201-job-cmd.t sharness test, Don't use --purge option with every flux job cancel, now that it is not required.
Restarted one builder (Centos 7 caliper) that failed with
Either way is fine with me.