job-manager: make some replay errors non-fatal #5150

garlick · 2023-05-04T17:07:46Z

Problem: if a few jobs get messed up in the KVS due to an improper shutdown, recovery is a tedious process involving starting flux in --recovery mode, fixing one job, and starting again.

When a job cannot be replayed from the KVS and the reason is that the directory is incomplete, log the failure at LOG_ERR level but let replay continue and ultimately the flux restart be successful.

If a job has more serious problems like incorrect content in the eventlog, treat that as a fatal error as before. This avoids breaking the 'valid' tests that check backwards compatibility with older kvs dumps, which might use an older eventlog format.

Update t2219-job-manage-restart.t to expect warnings rather than failure when such jobs are encountered during replay.

Fixes #5147

chu11

LGTM, just one commit message typo

chu11 · 2023-05-04T17:14:16Z

src/modules/job-list/job_state.c

@@ -673,7 +673,7 @@ static int depthfirst_map_one (struct job_state_ctx *jsctx,
    flux_future_t *f3 = NULL;


commit message "treads" -> "treats"

chu11 · 2023-05-04T19:38:20Z

something just occurred to me. Other than a studious admin looking through the logs, how would anyone know there are jobs that are bad in the KVS? Will they just linger there forever? Although I didn't look into it thoroughly yet, I'm not sure that flux job purge would do anything b/c the jobs aren't known in the job-manager? i.e. loaded into its data structures.

Not necessarily in this issue, but perhaps we need some uhh "alert"-ish way to know? perhaps a "flux job kvs-status" or something?

Edit: random thought, the job ids could be saved off into a "bad list", and flux job purge --special-bad-ids could handle it?

garlick · 2023-05-04T21:43:44Z

I'm not sure that flux job purge would do anything b/c the jobs aren't known in the job-manager? i.e. loaded into its data structures.

Great point!

Edit: random thought, the job ids could be saved off into a "bad list", and flux job purge --special-bad-ids could handle it?

Heh like a lost+found? I like it.

garlick · 2023-05-04T21:45:33Z

I'll go ahead and set MWP on this one and we can think about your idea separately. Thanks for the review!

garlick · 2023-05-05T14:00:12Z

One of the runners was complaining of an uninitialized variable (I don't agree but...) so fixed that and forced a push.

garlick · 2023-05-05T15:14:36Z

Anybody know what caused this mergeify failure?

The branch protection setting Require branches to be up to date before merging is not compatible with update_method=rebase if update_bot_account isn't set.

grondo · 2023-05-05T15:34:33Z

I have no idea, I have not seen that one before...

garlick · 2023-05-05T16:12:39Z

@Mergifyio refresh

mergify · 2023-05-05T16:12:41Z

refresh

✅ Pull request refreshed

chu11 · 2023-05-05T22:39:08Z

@Mergifyio rebase

Problem: if depthfirst_map_one() fails to look up R, futures are leaked. Ensure that particular failure unwinds allocations like the others.

Problem: the job manager now treats jobs that cannot be loaded from the KVS as a non-fatal error, but job-list treats them as fatal still. Relax error handling so that replay continues if a job cannot be loaded. Most likely the job will have already been logged by the job manager so introduce no new logging here.

Problem: if a few jobs get messed up in the KVS due to an improper shutdown, recovery is a tedious process involving starting flux in --recovery mode, fixing one job, and starting again. When a job cannot be replayed from the KVS and the reason is that the directory is incomplete, log the failure at LOG_ERR level but let replay continue and ultimately the flux restart be successful. If a job has more serious problems like incorrect content in the eventlog, treat that as a fatal error as before. This avoids breaking the 'valid' tests that check backwards compatibility with older kvs dumps, which might use an older eventlog format. Update t2219-job-manage-restart.t to expect warnings rather than failure when such jobs are encountered during replay. Fixes flux-framework#5147

mergify · 2023-05-05T22:39:31Z

rebase

✅ Branch has been successfully rebased

codecov · 2023-05-05T23:25:22Z

Codecov Report

Merging #5150 (fc5b7ab) into master (ebd4459) will decrease coverage by 0.03%.
The diff coverage is 88.23%.

@@            Coverage Diff             @@
##           master    #5150      +/-   ##
==========================================
- Coverage   83.14%   83.11%   -0.03%     
==========================================
  Files         453      453              
  Lines       77777    77788      +11     
==========================================
- Hits        64669    64657      -12     
- Misses      13108    13131      +23

Impacted Files	Coverage Δ
src/modules/job-list/job_state.c	`74.22% <66.66%> (+0.38%)`	⬆️
src/modules/job-manager/restart.c	`79.81% <92.85%> (+0.60%)`	⬆️

... and 14 files with indirect coverage changes

chu11 · 2023-05-05T23:29:16Z

with mergify down, going to hit the button

chu11 approved these changes May 4, 2023

View reviewed changes

garlick force-pushed the issue#5147 branch from d403aea to 702474e Compare May 4, 2023 17:51

garlick added the merge-when-passing label May 4, 2023

garlick force-pushed the issue#5147 branch from 702474e to 1ab9e41 Compare May 5, 2023 13:59

garlick mentioned this pull request May 5, 2023

rfc24: add repeat key to data event flux-framework/rfc#378

Merged

garlick added 3 commits May 5, 2023 22:39

job-list: fix memory leak on error path

2e23a37

Problem: if depthfirst_map_one() fails to look up R, futures are leaked. Ensure that particular failure unwinds allocations like the others.

chu11 force-pushed the issue#5147 branch from 1ab9e41 to fc5b7ab Compare May 5, 2023 22:39

chu11 merged commit b0baf4e into flux-framework:master May 5, 2023
29 of 30 checks passed

garlick deleted the issue#5147 branch March 1, 2024 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-manager: make some replay errors non-fatal #5150

job-manager: make some replay errors non-fatal #5150

garlick commented May 4, 2023

chu11 left a comment

chu11 May 4, 2023

chu11 commented May 4, 2023 •

edited

garlick commented May 4, 2023 •

edited

garlick commented May 4, 2023

garlick commented May 5, 2023

garlick commented May 5, 2023

grondo commented May 5, 2023

garlick commented May 5, 2023

mergify bot commented May 5, 2023

chu11 commented May 5, 2023

mergify bot commented May 5, 2023

codecov bot commented May 5, 2023

chu11 commented May 5, 2023

		@@ -673,7 +673,7 @@ static int depthfirst_map_one (struct job_state_ctx *jsctx,
		flux_future_t *f3 = NULL;

job-manager: make some replay errors non-fatal #5150

job-manager: make some replay errors non-fatal #5150

Conversation

garlick commented May 4, 2023

chu11 left a comment

Choose a reason for hiding this comment

chu11 May 4, 2023

Choose a reason for hiding this comment

chu11 commented May 4, 2023 • edited

garlick commented May 4, 2023 • edited

garlick commented May 4, 2023

garlick commented May 5, 2023

garlick commented May 5, 2023

grondo commented May 5, 2023

garlick commented May 5, 2023

mergify bot commented May 5, 2023

✅ Pull request refreshed

chu11 commented May 5, 2023

mergify bot commented May 5, 2023

✅ Branch has been successfully rebased

codecov bot commented May 5, 2023

Codecov Report

chu11 commented May 5, 2023

chu11 commented May 4, 2023 •

edited

garlick commented May 4, 2023 •

edited