New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Nodes with pods deleted out-of-band should be Errored, not Failed #2855
Conversation
@wktmeow Would it be possible for you to test this? You can build your own images from this branch. Take a look at https://github.com/argoproj/argo/blob/master/docs/running-locally.md |
Absolutely, I'll try to get to it today. Thank you for the quick fix! |
@wktmeow Gentle bump :) |
It works! Thank you for the quick fix, sorry for the delay in testing. |
_, onExitPod := pod.Labels[common.LabelKeyOnExit] | ||
if woc.wf.Spec.Shutdown == wfv1.ShutdownStrategyTerminate || (woc.wf.Spec.Shutdown == wfv1.ShutdownStrategyStop && !onExitPod) { | ||
newDeadline = &time.Time{} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular code fixes #2914
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, we cannot delete Running pods. We rely on the executor to do send the SIGTERM to the main process in order to collect artifacts of the main container.
This is a conditional approval to remove the deletion of running pods, but an approval of other changes.
newPhase = wfv1.NodeFailed | ||
message = "pod termination" | ||
newPhase = wfv1.NodeError | ||
message = "pod deleted during operation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this needs to be reverted as well since you will be reverting the logic of deletion of running pods..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is unrelated. This makes sure that Nodes that have pods deleted out-of-band get marked as Error
instead of Failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's simply a node label change
Kudos, SonarCloud Quality Gate passed! 0 Bugs |
This PR contains 3 fixes:
When stopping Running pods due toWill no longer be fixed here, see fix: Nodes with pods deleted out-of-band should be Errored, not Failed #2855 (review)Shutdown
orActiveDeadlineSeconds
, we should first attempt to delete the pods using the K8s API so that they go though a graceful shutdown – just as it is done for Pending pods. If this API-based stop does not work, we can always fall back to the usual method of execution control directly to the pod.We also fix the method of execution control to account for difference between
Stop
andTerminate
. Fixesargo stop
can result in errored workflow #2914We also fix an issue where a Node's Pods that are deleted outside of Argo (i.e., not by logic such as
Shutdown
) getting marked asFailed
when it should really be "Errored". Fixes Preempted nodes are treated as Fails but should be Errors #2881