Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-exec: job takes longer than necessary to terminate after a node failure #5811

Closed
grondo opened this issue Mar 21, 2024 · 0 comments · Fixed by #5813
Closed

job-exec: job takes longer than necessary to terminate after a node failure #5811

grondo opened this issue Mar 21, 2024 · 0 comments · Fixed by #5813

Comments

@grondo
Copy link
Contributor

grondo commented Mar 21, 2024

When the job-exec module handles a node failure on a critical rank it raises a job exception but does not also notify the rank 0 job shell that a shell has gone away. This causes the job shell to wait indefinitely in the output plugin for EOF from the missing shell.

This was partially fixed by #5780, but only for non-critical ranks.

grondo added a commit to grondo/flux-core that referenced this issue Mar 21, 2024
Problem: When a critical job shell is lost, the job execution service
raises a fatal job exception, but does not also notify the leader
shell that another shell has been lost. This can lead to the leader
shell waiting unnecessarily for EOF from the lost shell, which delays
job exit.

Always notify the leader (shell rank 0) of lost shells. Tweak the
call signature of lost_shell() to take a 'critical' parameter that
indicates if the lost shell was critical or not to make the intent
more obvious. Only raise a non-fatal (informational) job exception
for non-critical shells, and raise a fatal job exception for critical
shells.

Fixes flux-framework#5811
grondo added a commit to grondo/flux-core that referenced this issue Mar 21, 2024
Problem: When a critical job shell is lost, the job execution service
raises a fatal job exception, but does not also notify the leader
shell that another shell has been lost. This can lead to the leader
shell waiting unnecessarily for EOF from the lost shell, which delays
job exit.

Always notify the leader (shell rank 0) of lost shells. Tweak the
call signature of lost_shell() to take a 'critical' parameter that
indicates if the lost shell was critical or not to make the intent
more obvious. Only raise a non-fatal (informational) job exception
for non-critical shells, and raise a fatal job exception for critical
shells.

Fixes flux-framework#5811
grondo added a commit to grondo/flux-core that referenced this issue Mar 21, 2024
Problem: When a critical job shell is lost, the job execution service
raises a fatal job exception, but does not also notify the leader
shell that another shell has been lost. This can lead to the leader
shell waiting unnecessarily for EOF from the lost shell, which delays
job exit.

Always notify the leader (shell rank 0) of lost shells. Tweak the
call signature of lost_shell() to take a 'critical' parameter that
indicates if the lost shell was critical or not to make the intent
more obvious. Only raise a non-fatal (informational) job exception
for non-critical shells, and raise a fatal job exception for critical
shells.

Fixes flux-framework#5811
@mergify mergify bot closed this as completed in #5813 Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant