New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-manager: make some replay errors non-fatal #5150
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one commit message typo
@@ -673,7 +673,7 @@ static int depthfirst_map_one (struct job_state_ctx *jsctx, | |||
flux_future_t *f3 = NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit message "treads" -> "treats"
something just occurred to me. Other than a studious admin looking through the logs, how would anyone know there are jobs that are bad in the KVS? Will they just linger there forever? Although I didn't look into it thoroughly yet, I'm not sure that Not necessarily in this issue, but perhaps we need some uhh "alert"-ish way to know? perhaps a "flux job kvs-status" or something? Edit: random thought, the job ids could be saved off into a "bad list", and |
Great point!
Heh like a lost+found? I like it. |
I'll go ahead and set MWP on this one and we can think about your idea separately. Thanks for the review! |
One of the runners was complaining of an uninitialized variable (I don't agree but...) so fixed that and forced a push. |
Anybody know what caused this mergeify failure?
|
I have no idea, I have not seen that one before... |
@Mergifyio refresh |
✅ Pull request refreshed |
@Mergifyio rebase |
Problem: if depthfirst_map_one() fails to look up R, futures are leaked. Ensure that particular failure unwinds allocations like the others.
Problem: the job manager now treats jobs that cannot be loaded from the KVS as a non-fatal error, but job-list treats them as fatal still. Relax error handling so that replay continues if a job cannot be loaded. Most likely the job will have already been logged by the job manager so introduce no new logging here.
Problem: if a few jobs get messed up in the KVS due to an improper shutdown, recovery is a tedious process involving starting flux in --recovery mode, fixing one job, and starting again. When a job cannot be replayed from the KVS and the reason is that the directory is incomplete, log the failure at LOG_ERR level but let replay continue and ultimately the flux restart be successful. If a job has more serious problems like incorrect content in the eventlog, treat that as a fatal error as before. This avoids breaking the 'valid' tests that check backwards compatibility with older kvs dumps, which might use an older eventlog format. Update t2219-job-manage-restart.t to expect warnings rather than failure when such jobs are encountered during replay. Fixes flux-framework#5147
✅ Branch has been successfully rebased |
Codecov Report
@@ Coverage Diff @@
## master #5150 +/- ##
==========================================
- Coverage 83.14% 83.11% -0.03%
==========================================
Files 453 453
Lines 77777 77788 +11
==========================================
- Hits 64669 64657 -12
- Misses 13108 13131 +23
|
with mergify down, going to hit the button |
Problem: if a few jobs get messed up in the KVS due to an improper shutdown, recovery is a tedious process involving starting flux in
--recovery
mode, fixing one job, and starting again.When a job cannot be replayed from the KVS and the reason is that the directory is incomplete, log the failure at LOG_ERR level but let replay continue and ultimately the flux restart be successful.
If a job has more serious problems like incorrect content in the eventlog, treat that as a fatal error as before. This avoids breaking the 'valid' tests that check backwards compatibility with older kvs dumps, which might use an older eventlog format.
Update t2219-job-manage-restart.t to expect warnings rather than failure when such jobs are encountered during replay.
Fixes #5147