New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: recovery should be not be aborted on messed up job directory #5147
Comments
This is a pretty easy fix, so I'll work up a quick PR. |
garlick
added a commit
to garlick/flux-core
that referenced
this issue
May 4, 2023
Problem: if a few jobs get messed up in the KVS due to an improper shutdown, recovery is a tedious process involving starting flux in --recovery mode, fixing one job, and starting again. When a job cannot be replayed from the KVS and the reason is that the directory is incomplete, log the failure at LOG_ERR level but let replay continue and ultimately the flux restart be successful. If a job has more serious problems like incorrect content in the eventlog, treat that as a fatal error as before. This avoids breaking the 'valid' tests that check backwards compatibility with older kvs dumps, which might use an older eventlog format. Update t2219-job-manage-restart.t to expect warnings rather than failure when such jobs are encountered during replay. Fixes flux-framework#5147
garlick
added a commit
to garlick/flux-core
that referenced
this issue
May 4, 2023
Problem: if a few jobs get messed up in the KVS due to an improper shutdown, recovery is a tedious process involving starting flux in --recovery mode, fixing one job, and starting again. When a job cannot be replayed from the KVS and the reason is that the directory is incomplete, log the failure at LOG_ERR level but let replay continue and ultimately the flux restart be successful. If a job has more serious problems like incorrect content in the eventlog, treat that as a fatal error as before. This avoids breaking the 'valid' tests that check backwards compatibility with older kvs dumps, which might use an older eventlog format. Update t2219-job-manage-restart.t to expect warnings rather than failure when such jobs are encountered during replay. Fixes flux-framework#5147
garlick
added a commit
to garlick/flux-core
that referenced
this issue
May 5, 2023
Problem: if a few jobs get messed up in the KVS due to an improper shutdown, recovery is a tedious process involving starting flux in --recovery mode, fixing one job, and starting again. When a job cannot be replayed from the KVS and the reason is that the directory is incomplete, log the failure at LOG_ERR level but let replay continue and ultimately the flux restart be successful. If a job has more serious problems like incorrect content in the eventlog, treat that as a fatal error as before. This avoids breaking the 'valid' tests that check backwards compatibility with older kvs dumps, which might use an older eventlog format. Update t2219-job-manage-restart.t to expect warnings rather than failure when such jobs are encountered during replay. Fixes flux-framework#5147
chu11
pushed a commit
to garlick/flux-core
that referenced
this issue
May 5, 2023
Problem: if a few jobs get messed up in the KVS due to an improper shutdown, recovery is a tedious process involving starting flux in --recovery mode, fixing one job, and starting again. When a job cannot be replayed from the KVS and the reason is that the directory is incomplete, log the failure at LOG_ERR level but let replay continue and ultimately the flux restart be successful. If a job has more serious problems like incorrect content in the eventlog, treat that as a fatal error as before. This avoids breaking the 'valid' tests that check backwards compatibility with older kvs dumps, which might use an older eventlog format. Update t2219-job-manage-restart.t to expect warnings rather than failure when such jobs are encountered during replay. Fixes flux-framework#5147
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem: a job with a missing eventlog causes flux to refuse to start with
We should probably try to continue starting and ignore jobs that are incomplete.
The text was updated successfully, but these errors were encountered: