Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: make some replay errors non-fatal #5150

Merged
merged 3 commits into from May 5, 2023

Conversation

garlick
Copy link
Member

@garlick garlick commented May 4, 2023

Problem: if a few jobs get messed up in the KVS due to an improper shutdown, recovery is a tedious process involving starting flux in --recovery mode, fixing one job, and starting again.

When a job cannot be replayed from the KVS and the reason is that the directory is incomplete, log the failure at LOG_ERR level but let replay continue and ultimately the flux restart be successful.

If a job has more serious problems like incorrect content in the eventlog, treat that as a fatal error as before. This avoids breaking the 'valid' tests that check backwards compatibility with older kvs dumps, which might use an older eventlog format.

Update t2219-job-manage-restart.t to expect warnings rather than failure when such jobs are encountered during replay.

Fixes #5147

Copy link
Member

@chu11 chu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one commit message typo

@@ -673,7 +673,7 @@ static int depthfirst_map_one (struct job_state_ctx *jsctx,
flux_future_t *f3 = NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit message "treads" -> "treats"

@chu11
Copy link
Member

chu11 commented May 4, 2023

something just occurred to me. Other than a studious admin looking through the logs, how would anyone know there are jobs that are bad in the KVS? Will they just linger there forever? Although I didn't look into it thoroughly yet, I'm not sure that flux job purge would do anything b/c the jobs aren't known in the job-manager? i.e. loaded into its data structures.

Not necessarily in this issue, but perhaps we need some uhh "alert"-ish way to know? perhaps a "flux job kvs-status" or something?

Edit: random thought, the job ids could be saved off into a "bad list", and flux job purge --special-bad-ids could handle it?

@garlick
Copy link
Member Author

garlick commented May 4, 2023

I'm not sure that flux job purge would do anything b/c the jobs aren't known in the job-manager? i.e. loaded into its data structures.

Great point!

Edit: random thought, the job ids could be saved off into a "bad list", and flux job purge --special-bad-ids could handle it?

Heh like a lost+found? I like it.

@garlick
Copy link
Member Author

garlick commented May 4, 2023

I'll go ahead and set MWP on this one and we can think about your idea separately. Thanks for the review!

@garlick
Copy link
Member Author

garlick commented May 5, 2023

One of the runners was complaining of an uninitialized variable (I don't agree but...) so fixed that and forced a push.

@garlick
Copy link
Member Author

garlick commented May 5, 2023

Anybody know what caused this mergeify failure?

The branch protection setting Require branches to be up to date before merging is not compatible with update_method=rebase if update_bot_account isn't set.

@grondo
Copy link
Contributor

grondo commented May 5, 2023

I have no idea, I have not seen that one before...

@garlick
Copy link
Member Author

garlick commented May 5, 2023

@Mergifyio refresh

@mergify
Copy link
Contributor

mergify bot commented May 5, 2023

refresh

✅ Pull request refreshed

@chu11
Copy link
Member

chu11 commented May 5, 2023

@Mergifyio rebase

Problem: if depthfirst_map_one() fails to look up R,
futures are leaked.

Ensure that particular failure unwinds allocations like
the others.
Problem: the job manager now treats jobs that cannot be loaded
from the KVS as a non-fatal error, but job-list treats them
as fatal still.

Relax error handling so that replay continues if a job cannot be
loaded.  Most likely the job will have already been logged by
the job manager so introduce no new logging here.
Problem: if a few jobs get messed up in the KVS due to an
improper shutdown, recovery is a tedious process involving
starting flux in --recovery mode, fixing one job, and starting
again.

When a job cannot be replayed from the KVS and the reason is
that the directory is incomplete, log the failure at LOG_ERR
level but let replay continue and ultimately the flux restart
be successful.

If a job has more serious problems like incorrect content in
the eventlog, treat that as a fatal error as before.  This
avoids breaking the 'valid' tests that check backwards
compatibility with older kvs dumps, which might use an older
eventlog format.

Update t2219-job-manage-restart.t to expect warnings rather
than failure when such jobs are encountered during replay.

Fixes flux-framework#5147
@mergify
Copy link
Contributor

mergify bot commented May 5, 2023

rebase

✅ Branch has been successfully rebased

@codecov
Copy link

codecov bot commented May 5, 2023

Codecov Report

Merging #5150 (fc5b7ab) into master (ebd4459) will decrease coverage by 0.03%.
The diff coverage is 88.23%.

@@            Coverage Diff             @@
##           master    #5150      +/-   ##
==========================================
- Coverage   83.14%   83.11%   -0.03%     
==========================================
  Files         453      453              
  Lines       77777    77788      +11     
==========================================
- Hits        64669    64657      -12     
- Misses      13108    13131      +23     
Impacted Files Coverage Δ
src/modules/job-list/job_state.c 74.22% <66.66%> (+0.38%) ⬆️
src/modules/job-manager/restart.c 79.81% <92.85%> (+0.60%) ⬆️

... and 14 files with indirect coverage changes

@chu11
Copy link
Member

chu11 commented May 5, 2023

with mergify down, going to hit the button

@chu11 chu11 merged commit b0baf4e into flux-framework:master May 5, 2023
29 of 30 checks passed
@garlick garlick deleted the issue#5147 branch March 1, 2024 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

broker: recovery should be not be aborted on messed up job directory
3 participants