-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
t2812-flux-job-last.t sporadic hangs in CI #5815
Comments
Another occurrence reported in #5829. Also hung in the CI for #5817. This one seems to be getting pretty common. Last time I started looking at it, it didn't make any sense how a job in INACTIVE could get |
I made the following change to report the job state before and after diff --git a/src/modules/job-manager/restart.c b/src/modules/job-manager/restart.c
index 978409a0f..af3935dd3 100644
--- a/src/modules/job-manager/restart.c
+++ b/src/modules/job-manager/restart.c
@@ -259,6 +259,7 @@ done:
static int restart_map_cb (struct job *job, void *arg, flux_error_t *error)
{
struct job_manager *ctx = arg;
+ flux_job_state_t state = job->state;
if (zhashx_insert (ctx->active_jobs, &job->id, job) < 0) {
errprintf (error,
@@ -272,7 +273,8 @@ static int restart_map_cb (struct job *job, void *arg, flux_error_t *error)
wait_notify_active (ctx->wait, job);
if (event_job_action (ctx->event, job) < 0) {
flux_log_error (ctx->h,
- "replay warning: %s action failed on job %s",
+ "replay warning: %s->%s action failed on job %s",
+ flux_job_statetostr (state, "L"),
flux_job_statetostr (job->state, "L"),
idf58 (job->id));
}
I was unable to reproduce the issue locally, but did reproduce it in CI with this change:
I'm still not exactly sure how we're hitting this specific error (somehow |
Problem: t2812-flux-job-last.t tests flux-dump/restore under the assumption all jobs are inactive at the time the KVS is dumped. However, if a job is still active, e.g. in CLEANUP, then this can result in a failure at restart with an EROFS error for the active job. Since Flux does not yet have good support for restart with active jobs, and the test assumes all jobs are inactive at the time `flux dump` is called, ensure the assumption is true by calling `flux queue idle` before `flux dump`. Fixes flux-framework#5815
I've seen this one occur a couple times in CI.
t2812-flux-job-last.t
gets stuck in CI here:The text was updated successfully, but these errors were encountered: