New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eliminate duplicate KVS restart in job-list and job-manager #5837
Conversation
Ah looks like some tests that expect job-list to be up to date right after a restart are failing in CI. Maybe it would be better if any requests received while the journal backlog is being processed are deferred until it is done. I'll have a look at that. |
419310e
to
a873cdd
Compare
Well the same job-list tests are failing in alpine as before, so I must've made a bad assumption about what is going on. However, I still think it's probably a good change to defer any requests received during initialization until after it completes, so I'd vote to keep those commits. |
If it is any help, I was testing on this branch and it doesn't appear
|
My theory this morning is that the backlog is processed but R and jobspec are still being fetched. I may be over my head now - I added jobspec and R to the journal stream (since we have the redacted versions) but tracking down how to make job-list not fetch those if available in the journal is slow going. |
a873cdd
to
4f85507
Compare
OK, for fun I pushed the commit that adds R and jobspec to the journal stream. @chu11 is there any chance you could grab this branch, review what I've done so far, and make some suggestions on how to make use of R and jobspec in the journal stream? The first problem to solve is to make sure that when the A follow on question is whether we can use the enhanced journal content to avoid looking anything up in the KVS. I think we might need to add something to the journal for R and jobspec updates to get there, at least. |
Didn't do a full perusal of the PR but if my skimming of the commits is correct, the inactive jobs are being processed through the journal code just as though they were "live" jobs. Your suspicion is a good one, R may not yet be loaded or some events have not yet been processed. Unfortunately in the tests that fail it doesn't output what the state of the job the job is currently in, that'd be a good clue.
You can see this in of course, if we can always read in jobspec/R from the journal instead of looking it up, some of that whole transition logic may be able to disappear (i.e. reading a
some of these appear to be cut & paste errors in the tests, where it should be "verify expiration preserved across restart" (as one example) ... but anyways, they aren't being preserved for one reason or another and they do appear to be all R related or eventlog entries later in the journal, so it does suggest that some stuff isn't loaded yet. |
alternate idea i thought of this morning, under the assumption that i.e. just put some
|
Problem: the message handler for job-list.stats-get handles a response failure by sending an error response, which will also likely fail. Just log the error and proceed.
Problem: the job eventlog is needed for future changes to allow inactive jobs to be obtained via the journal. When creating a job from the KVS, simply keep the eventlog that is passed into job_create_from_eventlog(). When creating a job from scratch, create an empty one, then append events to the eventlog each time one is posted via event_job_post_entry().
Problem: test code cannot look at the eventlog kept in memory by the job manager. Provide access via the job-manager.getattr RPC (key="eventlog").
4f85507
to
17a4ee6
Compare
Thanks for the tips @chu11! I just pushed an attempt to eliminate the asynch lookups altogether and just process R and jobspec via the journal. It wasn't that complicated after all. |
Hmm, looks like the sched valgrind test has detected some malfeasance:
|
17a4ee6
to
e683a77
Compare
I think I found that one. Now I"m wondering if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, great work! I tested this out and also took a quick pass through the PR, with only trivial comments.
LGTM!
src/modules/job-manager/journal.c
Outdated
double t = flux_reactor_now (r) - t_start; | ||
|
||
fsd_format_duration (fsd, sizeof (fsd), t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timing can probably be dropped since a "begin sending journal backlog" message is logged and its timestamp can easily be compared with more precise resolutoin than fsd_format_duration()
. e.g. in my testing even with 10K jobs this printed (0s)
[ +0.364148] job-manager[0]: begin sending journal backlog: 10240 jobs
[ +0.818333] job-manager[0]: finished sending journal backlog (0s)
@@ -217,10 +217,6 @@ int mod_main (flux_t *h, int argc, char **argv) | |||
flux_log_error (h, "initialization error"); | |||
goto done; | |||
} | |||
if (job_state_init_from_kvs (ctx->jsctx) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message typo: s/sentinal/sentinel/
const flux_msg_t *msg; | ||
|
||
while ((msg = flux_msglist_pop (ctx->deferred_requests))) { | ||
if (flux_requeue (ctx->h, msg, FLUX_RQ_TAIL) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever!
Thanks! @chu11 did you want to give this a look before it gets merged? I do have another commit that gets rid of |
Problem: consumers of the journal have to scan the KVS, a costly operation, to obtain the full set of job IDs and their job state, then reconcile possibly overlapping events received in the journal. This protocol can be simplified now that the job manager stores inactive jobs and their eventlogs. Change the journal response message payload from: {"events":[ {"id":I "eventlog_seq":i "entry":o} {"id":I "eventlog_seq":i "entry":o} {"id":I "eventlog_seq":i "entry":o} ... } to {"id":I "events":[ o o o ... ]} where 'o' is a RFC 18 eventlog entry. Upon receipt of a request by a journal consumer, the job manager sends a response message for each active job. If the "full" flag is set in the request, inactive jobs are sent also. Once the events for existing jobs have been sent, the job manager sends a sentinel response: {"id":FLUX_JOBID_ANY, "events":[]} The purpose of the sentinel is to allow a journal consumer to delimit the stream of historical events from new events, in case it wishes to ingest all historical events as part of its initialization. Following the sentinel, the job manager sends a journal response message for each new event. The per-job event sequence number and associated logic has been dropped since now the potential for overlapping events between KVS and journal has been eliminated. The 'job-manager.journal-size-limit' config key is dropped since journal streams are now generated directly than pushed through a shared list.
Problem: job-list sometimes takes a long time to initialize, duplicating "KVS restart" work already performed by the job manager. Drop the code for walking the KVS and processing each job's eventlog. This was performed synchronously, delaying 'flux module load' and therefore instance startup. Instead, set full=true in the job-manager.journal RPC and process the journal asynchronously. The handy "pause" functionality which was added for testing is employed to allow the initial onslaught of journal response messages to be consumed (stored on a message list) quickly without processing. Once the journal sentinel is received, the backlog is processed all at once (without giving control to the reactor during processing), then the module is "unpaused" and begins processing job events normally. job-list already employs an open loop "eventually consistent" model so the fact that queries could be received in parallel with receiving the journal backlog should pose no correctness problems per se. Since there is now only one source of job events, the problem of synchronizing two event sources that was solved with a sequence number is no longer a problem. Drop the sequence number. Now there is only one code path for processing events in job-info: the one for processing "live" events. Since the order of arrival of jobs in the initial journal backlog is non-deterministic (based on order of job manager hash traversal rather than temporal order), job-list cannot rely on that order. Use zlistx_insert() instead of zlistx_add_end() to add jobs to the running and inactive lists. The former inserts in the defined sort order while the latter ignores the defined sort order.
Problem: the events_journal_stream test program still uses the old journal protocol. Update protocol.
Problem: t2210-job-manager-events-journal.t includes a test for the now defunct event sequence numbers. Drop test.
Problem: t2210-job-manager-events-journal includes tests for job-manager.journal-size-limit, but that config key is no longer supported. Drop tests.
Problem: t2210-job-manager-events-journal watches the journal but won't receive (initially) inactive jobs without specifying full:true. Add full:true to the job manager journal request.
Problem: two tests in t2260-job-list modify eventlogs in the KVS then restart job-list, but that is no longer sufficient now that job-list does not restart directly from the kvs. Change the tests to reload the job manager also.
Problem: issues/t4331-job-manager-purged-events.sh ensures that purged jobs are not listed in a new journal stream, but that is no longer possible since journal streams are generated directly from the current active and inactive job list for each consumer, not from a shared list of events. Drop test.
Problem: the journal-size-limit is no longer supported but it is still documented in the man page. Drop it from the man page.
Problem: requests received during the initial processing of the journal backlog are processed without any job data. Instead of using the existing 'pause' flag to defer initial journal processing, add an 'initialized' flag. The initialized flag can be used by request handlers to defer processing until after the backlog has been ingested. Deferring requests while pause is true would break tests that rely on events being paused but not requests.
Problem: the job_state code will need to call a function that operates on the main context to process deferred requests, but it doesn't have access to it. Instead of passing in 'isctx', pass 'ctx', then change jsctx->isctx uses to jsctx->ctx->isctx.
Problem: job-list requests should not be processed until the module is fully initialized. While jsctx->initialized is false, requests are placed on a queue. When jsctx->initialized is set true after the initial journal backlog processing, messages on the queue are requeued in the flux_t handle for reactive processing.
Problem: job-list, the primary consumer of the job manager journal, has to look up R and jobspec in the KVS, but the job manager already has redacted versions of these in memory, which should be suitable. When generating the initial backlog, send them along if set in the job structure. When processing live events, send jobspec with the validate event and R with the alloc event.
Problem: ENODATA is not treated specially (as per RFC 6) when the journal RPC gets an error response. Detect that case, log it, and stop the reactor without error. Also, don't decode the eventlog initally, as it is not used yet. Let any (unlikely) protocol errors be handled when the message is processed.
e683a77
to
a8ab819
Compare
Well I went ahead and pushed that extra commit and the fixups from @grondo's review. Can always lop it off it if you like! |
Wow, that was nice cleanup! FWIW, last commit LGTM. |
oops just realized I prefixed a bunch of commits with |
Problem: when R and jobspec are obtained from the journal, they do not need to be parsed from encoded json. Add some new functions that operate directly on job->jobspec and job->R, which should be set from the journal request continuation: job_parse_jobspec_cached() job_parse_R_cached()
Problem: now that the journal provides copies of jobspec and R (redacted), it is no longer necessary to request those artifacts from job-info. Short-circuit the asynchronous fetching of jobspec and R. Process them immediately in the DEPEND and RUN states, respectively.
Problem: job-list no longer looks up R and jobspec asynchronously so there is no need to keep a list of active futures. Drop the list.
Problem: job-list now obtains the eventlog, R, and jobspec exclusively from the job manager journal, but tests try to elicit various errors by corrupting those objects in the KVS and restarting job-list. There is not a way to get corrupted objects past the job-manager, so just drop those tests.
Problem: code still exists for deferring job updates until job-info lookups of R or jobspec have completed, but now that those objects are included in the journal this cannot happen. Drop unnecessary code.
a8ab819
to
abdb9b6
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #5837 +/- ##
==========================================
- Coverage 83.30% 83.29% -0.02%
==========================================
Files 510 510
Lines 82905 82594 -311
==========================================
- Hits 69065 68794 -271
+ Misses 13840 13800 -40
|
Sorry I missed that in review! |
NP! Well, maybe we ought to just merge this since it sounds like @chu11 might be offline for a bit, and we can ask him to look things over when he gets back. Given the amount of testing that relies on job-list working properly, I'm fairly confident that we're OK. Setting MWP. |
I took at a skim at the changes from this PR and things look good to me. Nothing obviously stands out. I love how much code was removed! |
Problem: job-list takes a long time to start when there are many jobs in the KVS.
Both job-list and job-manager have duplicate code for walking the KVS and recreating lists of active and inactive jobs. Furthermore, job-list requires workarounds to be sure that events from two sources (KVS and job manager journal) are not processed out of order for any given job.
This eliminates the KVS walk in job-list and instead changes the job manager journal so it can feed all the jobs (active and inactive) to job-list on startup. Furthermore, job-list processes the journal after the reactor starts so module loading should be quick.
Note that as submitted, this PR will break git bisect. I thought it was more important to split things up for review but we can squash some commits if need be.
It would be very good to get @chu11's review on this as job-list is somewhat unfamiliar territory for me. I guess the important thing to know is that the entire code path for sourcing events form the KVS is removed and everything goes through the one for "live" events. The job manager sends all of the eventlogs for existing jobs first, then begins processing new events so order should be preserved. It does mean that even inactive jobs move through the job states and are transferred from list to list, whereas before I think they were identified as inactive at the beginning and placed directly in the inactive list.
(This is old work revived from march 2022)