broker: recovery should be not be aborted on messed up job directory #5147

garlick · 2023-05-04T01:56:35Z

Problem: a job with a missing eventlog causes flux to refuse to start with

flux[273338]: job-manager.err[0]: restart failed: lookup job.0005.b398.c700.0400.eventlog: No such file o>

We should probably try to continue starting and ignore jobs that are incomplete.

The text was updated successfully, but these errors were encountered:

garlick · 2023-05-04T13:58:02Z

This is a pretty easy fix, so I'll work up a quick PR.

Problem: if a few jobs get messed up in the KVS due to an improper shutdown, recovery is a tedious process involving starting flux in --recovery mode, fixing one job, and starting again. When a job cannot be replayed from the KVS and the reason is that the directory is incomplete, log the failure at LOG_ERR level but let replay continue and ultimately the flux restart be successful. If a job has more serious problems like incorrect content in the eventlog, treat that as a fatal error as before. This avoids breaking the 'valid' tests that check backwards compatibility with older kvs dumps, which might use an older eventlog format. Update t2219-job-manage-restart.t to expect warnings rather than failure when such jobs are encountered during replay. Fixes flux-framework#5147

garlick self-assigned this May 4, 2023

garlick mentioned this issue May 4, 2023

job-manager: make some replay errors non-fatal #5150

Merged

chu11 closed this as completed in #5150 May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broker: recovery should be not be aborted on messed up job directory #5147

broker: recovery should be not be aborted on messed up job directory #5147

garlick commented May 4, 2023

garlick commented May 4, 2023

broker: recovery should be not be aborted on messed up job directory #5147

broker: recovery should be not be aborted on messed up job directory #5147

Comments

garlick commented May 4, 2023

garlick commented May 4, 2023