Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

t2812-flux-job-last.t sporadic hangs in CI #5815

Closed
grondo opened this issue Mar 21, 2024 · 2 comments · Fixed by #5835
Closed

t2812-flux-job-last.t sporadic hangs in CI #5815

grondo opened this issue Mar 21, 2024 · 2 comments · Fixed by #5835

Comments

@grondo
Copy link
Contributor

grondo commented Mar 21, 2024

I've seen this one occur a couple times in CI. t2812-flux-job-last.t gets stuck in CI here:

  ok 7 - flux job last N lists the last N jobs
  flux job last N lists the last N jobs

 expecting success: 
  	flux job last "[:]" >lastdump.exp &&
  	flux dump dump.tgz &&
  	flux start -o,-Scontent.restore=dump.tgz \
  		flux job last "[:]" >lastdump.out &&
  	test_cmp lastdump.exp lastdump.out
  
  
  flux-dump: archived 1 keys
  flux-dump: archived 2 keys
  flux-dump: archived 3 keys
  flux-dump: archived 4 keys
  flux-dump: archived 5 keys
  flux-dump: archived 6 keys
  flux-dump: archived 7 keys
  flux-dump: archived 8 keys
  flux-dump: archived 9 keys
  flux-dump: archived 57 keys
  Mar 21 17:35:35.080540 job-manager.err[0]: replay warning: INACTIVE action failed on job fHYCzL7: Read-only file system
  Mar 21 17:35:35.399411 job-manager.err[0]: sched.alloc-response: id=fHZgycT already allocated
@grondo
Copy link
Contributor Author

grondo commented Mar 25, 2024

Another occurrence reported in #5829.

Also hung in the CI for #5817. This one seems to be getting pretty common.

Last time I started looking at it, it didn't make any sense how a job in INACTIVE could get EROFS from calling event_job_action(). However, I did notice that the warning is printed using job->state after event_job_action() is called. The state before the action should probably be captured and used in the error message. Perhaps this would give some insight into the source of the error.

@grondo
Copy link
Contributor Author

grondo commented Mar 27, 2024

I made the following change to report the job state before and after event_job_action() in restart_map_cb():

diff --git a/src/modules/job-manager/restart.c b/src/modules/job-manager/restart.c
index 978409a0f..af3935dd3 100644
--- a/src/modules/job-manager/restart.c
+++ b/src/modules/job-manager/restart.c
@@ -259,6 +259,7 @@ done:
 static int restart_map_cb (struct job *job, void *arg, flux_error_t *error)
 {
     struct job_manager *ctx = arg;
+    flux_job_state_t state = job->state;
 
     if (zhashx_insert (ctx->active_jobs, &job->id, job) < 0) {
         errprintf (error,
@@ -272,7 +273,8 @@ static int restart_map_cb (struct job *job, void *arg, flux_error_t *error)
         wait_notify_active (ctx->wait, job);
     if (event_job_action (ctx->event, job) < 0) {
         flux_log_error (ctx->h,
-                        "replay warning: %s action failed on job %s",
+                        "replay warning: %s->%s action failed on job %s",
+                        flux_job_statetostr (state, "L"),
                         flux_job_statetostr (job->state, "L"),
                         idf58 (job->id));
     }

I was unable to reproduce the issue locally, but did reproduce it in CI with this change:

  Mar 27 15:40:14.155704 job-manager.err[0]: replay warning: CLEANUP->INACTIVE action failed on job fHvSoZJ: Read-only file system
  Mar 27 15:40:14.491309 job-manager.err[0]: sched.alloc-response: id=fHztmQK already allocated

I'm still not exactly sure how we're hitting this specific error (somehow job->eventlog_readonly is set before the job is INACTIVE), but in this test I think the assumption is that all the jobs are inactive before flux dump is called, but there is nothing in the test that guarantees that. Probably a flux queue idle will solve this particular issue.

grondo added a commit to grondo/flux-core that referenced this issue Mar 27, 2024
Problem: t2812-flux-job-last.t tests flux-dump/restore under
the assumption all jobs are inactive at the time the KVS is
dumped. However, if a job is still active, e.g. in CLEANUP, then
this can result in a failure at restart with an EROFS error for the
active job.

Since Flux does not yet have good support for restart with active jobs,
and the test assumes all jobs are inactive at the time `flux dump`
is called, ensure the assumption is true by calling `flux queue idle`
before `flux dump`.

Fixes flux-framework#5815
@mergify mergify bot closed this as completed in #5835 Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant