job-exec: job takes longer than necessary to terminate after a node failure #5811

grondo · 2024-03-21T02:16:02Z

When the job-exec module handles a node failure on a critical rank it raises a job exception but does not also notify the rank 0 job shell that a shell has gone away. This causes the job shell to wait indefinitely in the output plugin for EOF from the missing shell.

This was partially fixed by #5780, but only for non-critical ranks.

Problem: When a critical job shell is lost, the job execution service raises a fatal job exception, but does not also notify the leader shell that another shell has been lost. This can lead to the leader shell waiting unnecessarily for EOF from the lost shell, which delays job exit. Always notify the leader (shell rank 0) of lost shells. Tweak the call signature of lost_shell() to take a 'critical' parameter that indicates if the lost shell was critical or not to make the intent more obvious. Only raise a non-fatal (informational) job exception for non-critical shells, and raise a fatal job exception for critical shells. Fixes flux-framework#5811

grondo mentioned this issue Mar 21, 2024

job-exec: improve cleanup after lost shell events #5813

Merged

mergify bot closed this as completed in #5813 Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-exec: job takes longer than necessary to terminate after a node failure #5811

job-exec: job takes longer than necessary to terminate after a node failure #5811

grondo commented Mar 21, 2024 •

edited

Loading

job-exec: job takes longer than necessary to terminate after a node failure #5811

job-exec: job takes longer than necessary to terminate after a node failure #5811

Comments

grondo commented Mar 21, 2024 • edited Loading

grondo commented Mar 21, 2024 •

edited

Loading