job-manager: improve internal eventlog support #2025
This PR is definitely a WIP, but I wanted to get it out there for some early review if that's OK.
The job eventlog is a sort of fine grained synchronization facility for single jobs, and also a record of what has happened so far to a job sitting in the queue. In addition, the job manager implements a "restart" capability where active jobs are pulled in from the KVS and their job manager state recreated by replaying the job's eventlog.
The first thing this PR does is ensure that full in-memory job state can be recreated from the eventlog, as the number of events increases. The "restart" code is refactored so that replay occurs in a new constructor for a
The second part of this PR adds a proxy service in the job manager to allow guests to access their jobs' eventlogs by job id. There's a matching job.h api function and a
The API function:
/* Lookup eventlog for job, using job-manager as proxy to the KVS. * Response contains raw RFC 18 KVS eventlog string, which may be decoded * with flux_kvs_eventlog_decode(). */ flux_future_t *flux_job_lookup_eventlog (flux_t *h, flux_jobid_t id, int flags);
Obviously a watch flag is the big missing piece here. I paused here partly because I felt we might want to do better with the way events are returned by that API call. It currently retrieves an eventlog snapshot in RFC 18 format, what you get from a
Problem: key=value attributes are not removed from the optional 'note' field of the raise RPC request. Depending on parser details, this risks allowing a user to override exception userid field. Reject raise requests with '=' in the note field. Verify this change in the job-manager sharness test.
Problem: userid and priority are written to separate keys in the KVS job schema. job-manager has to read these to recreate job state when it restarts. Add userid and priority to job eventlog "submit" event, so that the job manager can get info when it replays the eventlog and won't need to access the other keys.
Problem: job-manager unit tests are growing new dependencies on queue.o, job.o, and util.o. The current method of listing each for each test does not scale. Add those objects to the $(test_ldadd) macro.
Problem: code is growing little bits of one-off event context parsing code, which if we keep going that way will result in code duplication. Add some more generic util functions for parsing key=int, key=string attributes, and free form text following attributres. Add unit tests.
Problem: the code for recreating the queue from job data in the KVS is quickly becoming unreadable. Add job_create_from_eventlog(), which creates a a struct job by replaying a snapshot of the job's KVS eventlog. RFC 21 documents that the job state can be recreated by replaying the eventlog. This function implies that this property extends to the whole in-memory job-manager state for a job. This depends on the "submit" event including the userid and initial priority, just added in an earlier commit to job-ingest. If those are missing, creation fails.
Change the restart logic to use the newly added flux_job_create_from_eventlog() constructor to recreate in-memory job state (soley) from the job's eventlog in the KVS. A side benefit fo this cleanup is dropping two synchronous KVS lookups per job to fetch priority and userid, which which should speed up restart. It also allows the one-off context parsing code in restart.c to be dropped.
Problem: util KVS helper functions only need the job id to construct a key, not the whole struct job. This limits the context in which these functions are usable to active jobs only. Change the function footprints and update tests and users.
Add a helper function for doing a KVS lookup on a key within the active or inactive directory of a job.
Just forced a push which drops what was described above as the second part of this PR, e.g.
The "first" part is good cleanup though, even if we end up replacing the job-manager util functions that parse the event contexts with functions in the public API, so here it is by itself.
@@ Coverage Diff @@ ## master #2025 +/- ## ========================================== - Coverage 80.58% 80.54% -0.04% ========================================== Files 180 181 +1 Lines 28914 29093 +179 ========================================== + Hits 23300 23434 +134 - Misses 5614 5659 +45
Can't find any issues with this PR, though 1bdce58 makes me wonder if we need to adjust eventlog to make designing entry formats with extra data safer, or more robust against potential issues like this?
However, for now the approach taken seems fine, so I think this PR is ready to merge?