Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
job-manager: add job state transition announcements #2109
As discussed in #2094, this PR adds a simple notification mechanism for bulk job state transitions based on published event messages. This is just the job-manager end. The events could be consumed by subscribing to
Eventually we may want to put a nicer interface in front of this, but I think that could be a separate PR and will require some more design (and maybe some experience using the raw event interface first?)
Each job state transition is represented as a [jobid,state] tuple, and is batched up in array form for publication after the event that triggered the state transition has landed in the KVS. Each published message can included state transitions for multiple jobs, and multiple transitions per job. Here's a little snippet of
I went with this form rather than a possibly more compact one (such as state => id array in an object) so that the order of all the transitions is preserved, in case that ends up being important.
Along the way I added the DEPEND state (which immediately triggers a
I still need to add a test that subscribes to these events and ensures all the state transitions are published for a group of jobs, but wanted to get this posted for early review.
event_job_post() and event_job_post_pack() both include a callback mechanism, so that the caller can be informed upon completion of the batched KVS commit of eventlog entries. This was originally intended for both error handling and as a way to usefully tie state notification to KVS updates. The errors are now handled directly in the callback. We will generate state notification events directly as well, so it is best for simplicity to remove this mechanism that is otherwise not used.
Since the states are identified as strings anyway, did you consider using the RFC 21 state names instead of single character representations? In the end I suppose it doesn't matter because likely only code and scripts will be examining these messages, but some of the single character choices do conflict with existing resource manager "single character" states, so just wondering if saving a few characters is worth it.
Ok, looks like this now:
I think I will drop that last commit that tries to cover unloading the job manager with KVS commits/event pubs in flight as it's racy with respect to the timing of the KVS/rank 0 broker. For two codecov runs, I got coverage the first time and not the second time. If this goes in then people will waste time trying to find coverage drops that are just due to this race. I'll force a push and then I this is probably ready to go in @chu11.
@@ Coverage Diff @@ ## master #2109 +/- ## ========================================== + Coverage 80.29% 80.31% +0.02% ========================================== Files 197 197 Lines 31483 31546 +63 ========================================== + Hits 25278 25337 +59 - Misses 6205 6209 +4
Add a JSON array to the KVS batch machinery in event.c that tracks [jobid, state] tuples. When the post function handles an event that triggers a job state change, add the job and its new state to this object. Once the triggering event(s) are committed to the KVS, publish a job-state event message, notifying interested parties that a set of jobs have transitioned to a new state. The fact that this comes after the KVS commit has completed maintains the invariant from RFC 21 that notification of certain state changes indicate that info, such as exception events or exec finish events, is fetchable from the KVS. The timed batch machinery, already there to mitigate the load on the KVS from bursts of job events, now helps reduce the number of published event messages. Each message will contain more job ID's, and the latency will increase a bit from event occurrence to notification.
Add functions for converting between strings and job states. Use flux_job_statetostr() instead of a local function in 'flux job list'. Add coverage for new functions to libjob unit test.
Transition NEW->DEPEND state on the submit event. Then emit a depend event in DEPEND state that transitions to SCHED state. Update job unit test. Update job-info-security sharness test: - add a 'flux job wait-event depend' before reading the eventlog to rewrite it. - use a method of editing eventlog in place that should be a bit more robust towards preserving formatting:
Refactor event_job_post() so that event_batch_pub_state() can be called separately. Then call it for the 'submit' event to get the first state transition (NEW->DEPEND) announced.
Add a python script that subscribes to job state change event notifications, then submits a specified number of jobs, and verifies that notifications are received indicating that each job progressed through an expected set of states, in an expected order. Notifications from jobs not submitted by the script are ignored. Drive the sript from a sharness test, on rank0, and simultaneously across all ranks.