-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-exec: job takes longer than necessary to terminate after a node failure #5811
Comments
grondo
added a commit
to grondo/flux-core
that referenced
this issue
Mar 21, 2024
Problem: When a critical job shell is lost, the job execution service raises a fatal job exception, but does not also notify the leader shell that another shell has been lost. This can lead to the leader shell waiting unnecessarily for EOF from the lost shell, which delays job exit. Always notify the leader (shell rank 0) of lost shells. Tweak the call signature of lost_shell() to take a 'critical' parameter that indicates if the lost shell was critical or not to make the intent more obvious. Only raise a non-fatal (informational) job exception for non-critical shells, and raise a fatal job exception for critical shells. Fixes flux-framework#5811
grondo
added a commit
to grondo/flux-core
that referenced
this issue
Mar 21, 2024
Problem: When a critical job shell is lost, the job execution service raises a fatal job exception, but does not also notify the leader shell that another shell has been lost. This can lead to the leader shell waiting unnecessarily for EOF from the lost shell, which delays job exit. Always notify the leader (shell rank 0) of lost shells. Tweak the call signature of lost_shell() to take a 'critical' parameter that indicates if the lost shell was critical or not to make the intent more obvious. Only raise a non-fatal (informational) job exception for non-critical shells, and raise a fatal job exception for critical shells. Fixes flux-framework#5811
grondo
added a commit
to grondo/flux-core
that referenced
this issue
Mar 21, 2024
Problem: When a critical job shell is lost, the job execution service raises a fatal job exception, but does not also notify the leader shell that another shell has been lost. This can lead to the leader shell waiting unnecessarily for EOF from the lost shell, which delays job exit. Always notify the leader (shell rank 0) of lost shells. Tweak the call signature of lost_shell() to take a 'critical' parameter that indicates if the lost shell was critical or not to make the intent more obvious. Only raise a non-fatal (informational) job exception for non-critical shells, and raise a fatal job exception for critical shells. Fixes flux-framework#5811
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When the job-exec module handles a node failure on a critical rank it raises a job exception but does not also notify the rank 0 job shell that a shell has gone away. This causes the job shell to wait indefinitely in the output plugin for EOF from the missing shell.
This was partially fixed by #5780, but only for non-critical ranks.
The text was updated successfully, but these errors were encountered: