-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add job cancellation infrastructure to job-manager etc #1976
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1976 +/- ##
==========================================
- Coverage 80.05% 79.97% -0.08%
==========================================
Files 195 195
Lines 35005 35068 +63
==========================================
+ Hits 28023 28046 +23
- Misses 6982 7022 +40
|
Any way to test these conditions? I gather not, but thought I'd ask
|
Not currently but will make sure that is covered in next PR for the scheduler API (when finally those flags can be set). |
I could merge this, but logically should #1985 go next? |
Now that the broker cleans up subprocess state when the exec service is torn down, the "simulated disconnect" code from the job-ingest module is no longer needed. Remove it.
Problem: the job queue is sorted, in part, by submit time, but uses the job id FLUID as an approximation of submit time, when it could use the actual submit time. Change the queue compare callback to compare job->t_submit instead of job->id.
Replace the specialized flux_job_purge() API function with a generalized flux_job_cancel(). Add a FLUX_JOB_PURGE flag which emulates the previous behavior. Add a new job flag, FLUX_JOB_CANCELED, which appears as a 'c' in flux job list output. In the job-manager, rneame purge to cancel. Rename the flux-job purge subcommand to cancel. Replace all occurrences of "flux job purge" with "flux job cancel --purge" in sharness tests.
Problem: if job-manager module is reloaded, canceled jobs (that have not been --purged) return to the queue. If a job's event log has the 'cancel' event, then set the FLUX_JOB_CANCELED flag on the job when reloading job state from the KVS. Then, at least for now, if a job with that bit set is retrieved during reload, don't add it to the queue.
If cancel request is sent without the FLUX_JOB_PURGE flag, then remove it from the queue and log 'cancel' event to the job's KVS eventlog. In addition, in preparation for the systems growing a scheduler and exec subsystem: - Set FLUX_JOB_CANCEL bit on the job in case job needs to remain in queue during cleanup. - Publish 'job-cancel' event message that may be generally useful for aborting sched/exec operations for a job.
In the t2201-job-cmd.t sharness test, Don't use --purge option with every flux job cancel, now that it is not required.
Verify that cancelling a job - appends 'cancel' event to event log - disappears from the queue - doesn't reappear if job-manager is reloaded
Restarted one builder (Centos 7 caliper) that failed with
Either way is fine with me. |
Did you happen to see if Travis is getting core backtraces anymore?
Since the flux-hwloc PR was merged the point is moot. |
Apologies on both counts: I didn't check for a stack trace and I didn't notice your comment here before merging the flux-hwloc PR. :-( |
I needed to add job cancellation support to be able to test the new scheduler interface. Rather than submit it with the scheduler stuff, it's submitted here as a standalone PR.
The "purge" function is removed
flux_job_purge()
API functionflux job purge ID
commandjob-manager.purge
service methodand is replaced with
flux_job_cancel()
API functionFLUX_JOB_PURGE
flag (to get the old behavior of purging KVS)flux job cancel [--purge] ID
commandjob-manager.cancel
service methodIn addition, cancel logs a
cancel
event to the job's eventlog (if it's not being purged), and publishes ajob-cancel
event message. The code is restructured to support the addition of other tasks that must complete before the response is sent to the user, such as freeing resources back to the scheduler or stopping execution.Some fairly minimal test coverage was added.
Finally, there was a bit of unrelated, minor cleanup in
job-manager
andjob-ingest
.