Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eliminate duplicate KVS restart in job-list and job-manager #5837

Merged
merged 22 commits into from Mar 30, 2024

Conversation

garlick
Copy link
Member

@garlick garlick commented Mar 27, 2024

Problem: job-list takes a long time to start when there are many jobs in the KVS.

Both job-list and job-manager have duplicate code for walking the KVS and recreating lists of active and inactive jobs. Furthermore, job-list requires workarounds to be sure that events from two sources (KVS and job manager journal) are not processed out of order for any given job.

This eliminates the KVS walk in job-list and instead changes the job manager journal so it can feed all the jobs (active and inactive) to job-list on startup. Furthermore, job-list processes the journal after the reactor starts so module loading should be quick.

Note that as submitted, this PR will break git bisect. I thought it was more important to split things up for review but we can squash some commits if need be.

It would be very good to get @chu11's review on this as job-list is somewhat unfamiliar territory for me. I guess the important thing to know is that the entire code path for sourcing events form the KVS is removed and everything goes through the one for "live" events. The job manager sends all of the eventlogs for existing jobs first, then begins processing new events so order should be preserved. It does mean that even inactive jobs move through the job states and are transferred from list to list, whereas before I think they were identified as inactive at the beginning and placed directly in the inactive list.

(This is old work revived from march 2022)

@garlick
Copy link
Member Author

garlick commented Mar 27, 2024

Ah looks like some tests that expect job-list to be up to date right after a restart are failing in CI. Maybe it would be better if any requests received while the journal backlog is being processed are deferred until it is done. I'll have a look at that.

@garlick
Copy link
Member Author

garlick commented Mar 27, 2024

Well the same job-list tests are failing in alpine as before, so I must've made a bad assumption about what is going on. However, I still think it's probably a good change to defer any requests received during initialization until after it completes, so I'd vote to keep those commits.

@grondo
Copy link
Contributor

grondo commented Mar 28, 2024

If it is any help, I was testing on this branch and it doesn't appear flux jobs requests are paused until job-list is ready. The simple test case just runs 4096 jobs (probably could be less) then runs flux module reload job-list && flux jobs -nac0 | wc -l which seems to indicate the results are returned before the backlog is fully processed. (I could be misunderstanding, so apologies if so!)

f(s=4,d=0,builddir) grondo@corona82:~/git/flux-core.git$ flux jobs -nac0 | wc -l
4096
f(s=4,d=0,builddir) grondo@corona82:~/git/flux-core.git$ flux module reload job-list && sleep 0.25  && flux jobs -nac0 | wc -l
2064
f(s=4,d=0,builddir) grondo@corona82:~/git/flux-core.git$ flux module reload job-list && sleep 0.25  && flux jobs -nac0 | wc -l
0

@garlick
Copy link
Member Author

garlick commented Mar 28, 2024

My theory this morning is that the backlog is processed but R and jobspec are still being fetched.

I may be over my head now - I added jobspec and R to the journal stream (since we have the redacted versions) but tracking down how to make job-list not fetch those if available in the journal is slow going.

@garlick
Copy link
Member Author

garlick commented Mar 28, 2024

OK, for fun I pushed the commit that adds R and jobspec to the journal stream. @chu11 is there any chance you could grab this branch, review what I've done so far, and make some suggestions on how to make use of R and jobspec in the journal stream?

The first problem to solve is to make sure that when the jsctx->initialized flag is set, R and jobspec have been fully ingested.

A follow on question is whether we can use the enhanced journal content to avoid looking anything up in the KVS. I think we might need to add something to the journal for R and jobspec updates to get there, at least.

@chu11
Copy link
Member

chu11 commented Mar 29, 2024

My theory this morning is that the backlog is processed but R and jobspec are still being fetched.

The first problem to solve is to make sure that when the jsctx->initialized flag is set, R and jobspec have been fully ingested.

Didn't do a full perusal of the PR but if my skimming of the commits is correct, the inactive jobs are being processed through the journal code just as though they were "live" jobs.

Your suspicion is a good one, R may not yet be loaded or some events have not yet been processed. Unfortunately in the tests that fail it doesn't output what the state of the job the job is currently in, that'd be a good clue.

make some suggestions on how to make use of R and jobspec in the journal stream?

You can see this in process_state_transition_update(). You'll see the call chain for state_depend_lookup() and state_run_lookup() where the lookups are done and after they are completed and the job state is transitioned for the job. So some of that code could be "collapsed" if you can get R/jobspec right away.

of course, if we can always read in jobspec/R from the journal instead of looking it up, some of that whole transition logic may be able to disappear (i.e. reading a resource-event no longer needs to wait for R to be retrieved). But we can probably deal with that another day.

Error: not ok 161 - verify nnodes preserved across restart
Error: not ok 165 - verify ranks/nodelist preserved across restart
Error: not ok 169 - verify task count preserved across restart
Error: not ok 176 - verify task count preserved across restart
Error: not ok 179 - verify task count preserved across restart
Error: not ok 182 - verify task count preserved across restart
Error: ERROR: t2260-job-list.t - exited with status 1

some of these appear to be cut & paste errors in the tests, where it should be "verify expiration preserved across restart" (as one example) ... but anyways, they aren't being preserved for one reason or another and they do appear to be all R related or eventlog entries later in the journal, so it does suggest that some stuff isn't loaded yet.

@chu11
Copy link
Member

chu11 commented Mar 29, 2024

alternate idea i thought of this morning, under the assumption that job-list is still processing things, we could simply put some waits into the tests to wait for those particular tests to become inactive. The fact job-list is still loading things may not necessarily be something we care about?

i.e. just put some

flux job list-ids --wait-state=inactive ids ... in there for the ids we care about.

Problem: the message handler for job-list.stats-get handles a
response failure by sending an error response, which will also
likely fail.

Just log the error and proceed.
Problem: the job eventlog is needed for future changes to allow
inactive jobs to be obtained via the journal.

When creating a job from the KVS, simply keep the eventlog that is
passed into job_create_from_eventlog().  When creating a job from
scratch, create an empty one, then append events to the eventlog
each time one is posted via event_job_post_entry().
Problem: test code cannot look at the eventlog kept in memory
by the job manager.

Provide access via the job-manager.getattr RPC (key="eventlog").
@garlick
Copy link
Member Author

garlick commented Mar 29, 2024

Thanks for the tips @chu11! I just pushed an attempt to eliminate the asynch lookups altogether and just process R and jobspec via the journal. It wasn't that complicated after all.

@garlick
Copy link
Member Author

garlick commented Mar 29, 2024

Hmm, looks like the sched valgrind test has detected some malfeasance:

 ==19321== Thread 14:
  ==19321== Invalid read of size 1
  ==19321==    at 0x12117809: process_updates (job_state.c:547)
  ==19321==    by 0x12117C2E: job_transition_state (job_state.c:616)
  ==19321==    by 0x121180CA: journal_revert_job (job_state.c:637)
  ==19321==    by 0x121180CA: journal_process_event (job_state.c:1174)
  ==19321==    by 0x121180CA: journal_process_events (job_state.c:1204)
  ==19321==    by 0x12118D27: job_events_journal_continuation (job_state.c:1253)
  ==19321==    by 0x48A2C9A: ev_invoke_pending (ev.c:3770)
  ==19321==    by 0x48A6566: ev_run (ev.c:4190)
  ==19321==    by 0x48A6566: ev_run (ev.c:4021)
  ==19321==    by 0x4879307: flux_reactor_run (reactor.c:124)
  ==19321==    by 0x12117095: mod_main (job-list.c:245)
  ==19321==    by 0x11ACD5: module_thread (module.c:208)
  ==19321==    by 0x4B1E043: start_thread (pthread_create.c:442)
  ==19321==    by 0x4B9D85F: clone (clone.S:100)
  ==19321==  Address 0x1b87cb61 is 33 bytes inside a block of size 40 free'd
  ==19321==    at 0x484317B: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
  ==19321==    by 0x1211722C: job_update_destroy.part.0 (job_state.c:300)
  ==19321==    by 0x1212C554: fzlist_remove (zlist.c:317)
  ==19321==    by 0x121176CA: process_updates (job_state.c:548)
  ==19321==    by 0x12117808: process_state_transition_update (job_state.c:431)
  ==19321==    by 0x12117808: process_updates (job_state.c:540)
  ==19321==    by 0x12117C2E: job_transition_state (job_state.c:616)
  ==19321==    by 0x121180CA: journal_revert_job (job_state.c:637)
  ==19321==    by 0x121180CA: journal_process_event (job_state.c:1174)
  ==19321==    by 0x121180CA: journal_process_events (job_state.c:1204)
  ==19321==    by 0x12118D27: job_events_journal_continuation (job_state.c:1253)
  ==19321==    by 0x48A2C9A: ev_invoke_pending (ev.c:3770)
  ==19321==    by 0x48A6566: ev_run (ev.c:4190)
  ==19321==    by 0x48A6566: ev_run (ev.c:4021)
  ==19321==    by 0x4879307: flux_reactor_run (reactor.c:124)
  ==19321==    by 0x12117095: mod_main (job-list.c:245)
  ==19321==    by 0x11ACD5: module_thread (module.c:208)
  ==19321==    by 0x4B1E043: start_thread (pthread_create.c:442)
  ==19321==    by 0x4B9D85F: clone (clone.S:100)
  ==19321==  Block was alloc'd at
  ==19321==    at 0x48455EF: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
  ==19321==    by 0x12117BCF: job_update_create (job_state.c:309)
  ==19321==    by 0x12117BCF: add_state_transition (job_state.c:340)
  ==19321==    by 0x12117BCF: job_transition_state (job_state.c:607)
  ==19321==    by 0x121180CA: journal_revert_job (job_state.c:637)
  ==19321==    by 0x121180CA: journal_process_event (job_state.c:1174)
  ==19321==    by 0x121180CA: journal_process_events (job_state.c:1204)
  ==19321==    by 0x12118D27: job_events_journal_continuation (job_state.c:1253)
  ==19321==    by 0x48A2C9A: ev_invoke_pending (ev.c:3770)
  ==19321==    by 0x48A6566: ev_run (ev.c:4190)
  ==19321==    by 0x48A6566: ev_run (ev.c:4021)
  ==19321==    by 0x4879307: flux_reactor_run (reactor.c:124)
  ==19321==    by 0x12117095: mod_main (job-list.c:245)

@garlick
Copy link
Member Author

garlick commented Mar 29, 2024

I think I found that one. Now I"m wondering if job->updates is needed at all anymore.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, great work! I tested this out and also took a quick pass through the PR, with only trivial comments.

LGTM!

Comment on lines 178 to 180
double t = flux_reactor_now (r) - t_start;

fsd_format_duration (fsd, sizeof (fsd), t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timing can probably be dropped since a "begin sending journal backlog" message is logged and its timestamp can easily be compared with more precise resolutoin than fsd_format_duration(). e.g. in my testing even with 10K jobs this printed (0s)

[  +0.364148] job-manager[0]: begin sending journal backlog: 10240 jobs
[  +0.818333] job-manager[0]: finished sending journal backlog (0s)

@@ -217,10 +217,6 @@ int mod_main (flux_t *h, int argc, char **argv)
flux_log_error (h, "initialization error");
goto done;
}
if (job_state_init_from_kvs (ctx->jsctx) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit message typo: s/sentinal/sentinel/

const flux_msg_t *msg;

while ((msg = flux_msglist_pop (ctx->deferred_requests))) {
if (flux_requeue (ctx->h, msg, FLUX_RQ_TAIL) < 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever!

@garlick
Copy link
Member Author

garlick commented Mar 30, 2024

Thanks! @chu11 did you want to give this a look before it gets merged?

I do have another commit that gets rid of job->updates and related machinery for deferring updates. Should I tack that on here or submit as a cleanup PR?

Problem: consumers of the journal have to scan the KVS, a costly
operation, to obtain the full set of job IDs and their job state,
then reconcile possibly overlapping events received in the journal.
This protocol can be simplified now that the job manager stores
inactive jobs and their eventlogs.

Change the journal response message payload from:
  {"events":[
    {"id":I "eventlog_seq":i "entry":o}
    {"id":I "eventlog_seq":i "entry":o}
    {"id":I "eventlog_seq":i "entry":o}
    ...
  }
to
  {"id":I "events":[
    o
    o
    o
    ...
  ]}
where 'o' is a RFC 18 eventlog entry.

Upon receipt of a request by a journal consumer, the job manager sends a
response message for each active job.  If the "full" flag is set in the
request, inactive jobs are sent also.  Once the events for existing jobs
have been sent, the job manager sends a sentinel response:
  {"id":FLUX_JOBID_ANY, "events":[]}
The purpose of the sentinel is to allow a journal consumer to delimit
the stream of historical events from new events, in case it wishes to
ingest all historical events as part of its initialization.

Following the sentinel, the job manager sends a journal response message for
each new event.

The per-job event sequence number and associated logic has been dropped
since now the potential for overlapping events between KVS and journal
has been eliminated.

The 'job-manager.journal-size-limit' config key is dropped since journal
streams are now generated directly than pushed through a shared list.
Problem: job-list sometimes takes a long time to initialize,
duplicating "KVS restart" work already performed by the job manager.

Drop the code for walking the KVS and processing each job's eventlog.
This was performed synchronously, delaying 'flux module load' and therefore
instance startup.  Instead, set full=true in the job-manager.journal RPC
and process the journal asynchronously.

The handy "pause" functionality which was added for testing is employed
to allow the initial onslaught of journal response messages to be consumed
(stored on a message list) quickly without processing.  Once the journal
sentinel is received, the backlog is processed all at once (without giving
control to the reactor during processing), then the module is "unpaused"
and begins processing job events normally.  job-list already employs an
open loop "eventually consistent" model so the fact that queries could be
received in parallel with receiving the journal backlog should pose no
correctness problems per se.

Since there is now only one source of job events, the problem of
synchronizing two event sources that was solved with a sequence number
is no longer a problem.  Drop the sequence number.

Now there is only one code path for processing events in job-info: the
one for processing "live" events.  Since the order of arrival of jobs
in the initial journal backlog is non-deterministic (based on order
of job manager hash traversal rather than temporal order), job-list cannot
rely on that order.  Use zlistx_insert() instead of zlistx_add_end() to
add jobs to the running and inactive lists.  The former inserts in the
defined sort order while the latter ignores the defined sort order.
Problem: the events_journal_stream test program still uses the
old journal protocol.

Update protocol.
Problem: t2210-job-manager-events-journal.t includes a test for the
now defunct event sequence numbers.

Drop test.
Problem: t2210-job-manager-events-journal includes tests for
job-manager.journal-size-limit, but that config key is no longer
supported.

Drop tests.
Problem: t2210-job-manager-events-journal watches the journal
but won't receive (initially) inactive jobs without specifying
full:true.

Add full:true to the job manager journal request.
Problem: two tests in t2260-job-list modify eventlogs in the KVS
then restart job-list, but that is no longer sufficient now that
job-list does not restart directly from the kvs.

Change the tests to reload the job manager also.
Problem: issues/t4331-job-manager-purged-events.sh ensures that
purged jobs are not listed in a new journal stream, but that is no
longer possible since journal streams are generated directly from
the current active and inactive job list for each consumer, not
from a shared list of events.

Drop test.
Problem: the journal-size-limit is no longer supported but it is
still documented in the man page.

Drop it from the man page.
Problem: requests received during the initial processing
of the journal backlog are processed without any job data.

Instead of using the existing 'pause' flag to defer initial journal
processing, add an 'initialized' flag.  The initialized flag can be used
by request handlers to defer processing until after the backlog has been
ingested.  Deferring requests while pause is true would break tests that
rely on events being paused but not requests.
Problem: the job_state code will need to call a function
that operates on the main context to process deferred requests,
but it doesn't have access to it.

Instead of passing in 'isctx', pass 'ctx', then change jsctx->isctx
uses to jsctx->ctx->isctx.
Problem: job-list requests should not be processed until the
module is fully initialized.

While jsctx->initialized is false, requests are placed on a queue.
When jsctx->initialized is set true after the initial journal backlog
processing, messages on the queue are requeued in the flux_t handle
for reactive processing.
Problem: job-list, the primary consumer of the job manager
journal, has to look up R and jobspec in the KVS, but the job
manager already has redacted versions of these in memory,
which should be suitable.

When generating the initial backlog, send them along if set in
the job structure.

When processing live events, send jobspec with the validate
event and R with the alloc event.
Problem: ENODATA is not treated specially (as per RFC 6) when
the journal RPC gets an error response.

Detect that case, log it, and stop the reactor without error.
Also, don't decode the eventlog initally, as it is not used yet.
Let any (unlikely) protocol errors be handled when the message is
processed.
@garlick
Copy link
Member Author

garlick commented Mar 30, 2024

Well I went ahead and pushed that extra commit and the fixups from @grondo's review. Can always lop it off it if you like!

@grondo
Copy link
Contributor

grondo commented Mar 30, 2024

Wow, that was nice cleanup! FWIW, last commit LGTM.

@garlick
Copy link
Member Author

garlick commented Mar 30, 2024

oops just realized I prefixed a bunch of commits with job-info: instead of job-list:. Fixing.

Problem: when R and jobspec are obtained from the journal,
they do not need to be parsed from encoded json.

Add some new functions that operate directly on job->jobspec
and job->R, which should be set from the journal request
continuation:

job_parse_jobspec_cached()
job_parse_R_cached()
Problem: now that the journal provides copies of jobspec and R
(redacted), it is no longer necessary to request those artifacts
from job-info.

Short-circuit the asynchronous fetching of jobspec and R.
Process them immediately in the DEPEND and RUN states, respectively.
Problem: job-list no longer looks up R and jobspec asynchronously
so there is no need to keep a list of active futures.

Drop the list.
Problem: job-list now obtains the eventlog, R, and jobspec
exclusively from the job manager journal, but tests try to
elicit various errors by corrupting those objects in the KVS
and restarting job-list.

There is not a way to get corrupted objects past the job-manager,
so just drop those tests.
Problem: code still exists for deferring job updates until
job-info lookups of R or jobspec have completed, but now that
those objects are included in the journal this cannot happen.

Drop unnecessary code.
Copy link

codecov bot commented Mar 30, 2024

Codecov Report

Merging #5837 (abdb9b6) into master (69711aa) will decrease coverage by 0.02%.
The diff coverage is 72.88%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5837      +/-   ##
==========================================
- Coverage   83.30%   83.29%   -0.02%     
==========================================
  Files         510      510              
  Lines       82905    82594     -311     
==========================================
- Hits        69065    68794     -271     
+ Misses      13840    13800      -40     
Files Coverage Δ
src/modules/job-manager/event.c 78.52% <33.33%> (-0.53%) ⬇️
src/modules/job-manager/getattr.c 63.76% <33.33%> (-1.39%) ⬇️
src/modules/job-manager/job.c 89.09% <75.00%> (-0.82%) ⬇️
src/modules/job-list/job_data.c 93.67% <58.33%> (-1.61%) ⬇️
src/modules/job-list/list.c 70.52% <58.82%> (-2.63%) ⬇️
src/modules/job-list/job-list.c 72.54% <46.66%> (-1.92%) ⬇️
src/modules/job-list/job_state.c 75.30% <81.33%> (-0.59%) ⬇️
src/modules/job-manager/journal.c 80.34% <77.27%> (+1.10%) ⬆️

... and 13 files with indirect coverage changes

@grondo
Copy link
Contributor

grondo commented Mar 30, 2024

Sorry I missed that in review!

@garlick
Copy link
Member Author

garlick commented Mar 30, 2024

NP! Well, maybe we ought to just merge this since it sounds like @chu11 might be offline for a bit, and we can ask him to look things over when he gets back. Given the amount of testing that relies on job-list working properly, I'm fairly confident that we're OK.

Setting MWP.

@mergify mergify bot merged commit 57ac53a into flux-framework:master Mar 30, 2024
34 of 35 checks passed
@garlick garlick deleted the inactive_job_journal branch March 30, 2024 18:03
@chu11
Copy link
Member

chu11 commented Apr 1, 2024

I took at a skim at the changes from this PR and things look good to me. Nothing obviously stands out. I love how much code was removed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants