t2812-flux-job-last.t sporadic hangs in CI #5815

grondo · 2024-03-21T18:07:10Z

I've seen this one occur a couple times in CI. t2812-flux-job-last.t gets stuck in CI here:

  ok 7 - flux job last N lists the last N jobs
  flux job last N lists the last N jobs

 expecting success: 
  	flux job last "[:]" >lastdump.exp &&
  	flux dump dump.tgz &&
  	flux start -o,-Scontent.restore=dump.tgz \
  		flux job last "[:]" >lastdump.out &&
  	test_cmp lastdump.exp lastdump.out
  
  
  flux-dump: archived 1 keys
  flux-dump: archived 2 keys
  flux-dump: archived 3 keys
  flux-dump: archived 4 keys
  flux-dump: archived 5 keys
  flux-dump: archived 6 keys
  flux-dump: archived 7 keys
  flux-dump: archived 8 keys
  flux-dump: archived 9 keys
  flux-dump: archived 57 keys
  Mar 21 17:35:35.080540 job-manager.err[0]: replay warning: INACTIVE action failed on job fHYCzL7: Read-only file system
  Mar 21 17:35:35.399411 job-manager.err[0]: sched.alloc-response: id=fHZgycT already allocated

The text was updated successfully, but these errors were encountered:

grondo · 2024-03-25T21:36:49Z

Another occurrence reported in #5829.

Also hung in the CI for #5817. This one seems to be getting pretty common.

Last time I started looking at it, it didn't make any sense how a job in INACTIVE could get EROFS from calling event_job_action(). However, I did notice that the warning is printed using job->state after event_job_action() is called. The state before the action should probably be captured and used in the error message. Perhaps this would give some insight into the source of the error.

grondo · 2024-03-27T16:02:06Z

I made the following change to report the job state before and after event_job_action() in restart_map_cb():

diff --git a/src/modules/job-manager/restart.c b/src/modules/job-manager/restart.c
index 978409a0f..af3935dd3 100644
--- a/src/modules/job-manager/restart.c
+++ b/src/modules/job-manager/restart.c
@@ -259,6 +259,7 @@ done:
 static int restart_map_cb (struct job *job, void *arg, flux_error_t *error)
 {
     struct job_manager *ctx = arg;
+    flux_job_state_t state = job->state;
 
     if (zhashx_insert (ctx->active_jobs, &job->id, job) < 0) {
         errprintf (error,
@@ -272,7 +273,8 @@ static int restart_map_cb (struct job *job, void *arg, flux_error_t *error)
         wait_notify_active (ctx->wait, job);
     if (event_job_action (ctx->event, job) < 0) {
         flux_log_error (ctx->h,
-                        "replay warning: %s action failed on job %s",
+                        "replay warning: %s->%s action failed on job %s",
+                        flux_job_statetostr (state, "L"),
                         flux_job_statetostr (job->state, "L"),
                         idf58 (job->id));
     }

I was unable to reproduce the issue locally, but did reproduce it in CI with this change:

  Mar 27 15:40:14.155704 job-manager.err[0]: replay warning: CLEANUP->INACTIVE action failed on job fHvSoZJ: Read-only file system
  Mar 27 15:40:14.491309 job-manager.err[0]: sched.alloc-response: id=fHztmQK already allocated

I'm still not exactly sure how we're hitting this specific error (somehow job->eventlog_readonly is set before the job is INACTIVE), but in this test I think the assumption is that all the jobs are inactive before flux dump is called, but there is nothing in the test that guarantees that. Probably a flux queue idle will solve this particular issue.

Problem: t2812-flux-job-last.t tests flux-dump/restore under the assumption all jobs are inactive at the time the KVS is dumped. However, if a job is still active, e.g. in CLEANUP, then this can result in a failure at restart with an EROFS error for the active job. Since Flux does not yet have good support for restart with active jobs, and the test assumes all jobs are inactive at the time `flux dump` is called, ensure the assumption is true by calling `flux queue idle` before `flux dump`. Fixes flux-framework#5815

grondo mentioned this issue Mar 25, 2024

sched.alloc-response: id=fF1SDRy already allocated #5829

Closed

grondo mentioned this issue Mar 27, 2024

testsuite: fix potential hang in t2812-flux-job-last.t #5835

Merged

mergify bot closed this as completed in #5835 Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t2812-flux-job-last.t sporadic hangs in CI #5815

t2812-flux-job-last.t sporadic hangs in CI #5815

grondo commented Mar 21, 2024

grondo commented Mar 25, 2024

grondo commented Mar 27, 2024

t2812-flux-job-last.t sporadic hangs in CI #5815

t2812-flux-job-last.t sporadic hangs in CI #5815

Comments

grondo commented Mar 21, 2024

grondo commented Mar 25, 2024

grondo commented Mar 27, 2024