-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: workflow stuck in running state when using activeDeadlineSeconds on template level. Fixes: #12329 #12761
fix: workflow stuck in running state when using activeDeadlineSeconds on template level. Fixes: #12329 #12761
Conversation
98c4206
to
0133ffe
Compare
… on template level. Fixes: argoproj#12329 Signed-off-by: shuangkun <tsk2013uestc@163.com>
0133ffe
to
cd5a7f0
Compare
Signed-off-by: shuangkun <tsk2013uestc@163.com>
I've started looking at this, so I assigned it to myself, but if anyone feels they know the code well here and prefers to reassign it to themselves, I'm good. :) |
Can you help me understand the sequence of events that led to the Issue? So, the containers were in "Pending" state and then the Pod's activeDeadlineSeconds was reached. I see at that point from the documentation that Kubernetes "will actively try to mark it failed and kill associated containers". Do we understand what's preventing Kubernetes from marking it Terminated at that point? |
Kubernetes has marked it as Failed, but Argo observed Failed when recording but still reset it from Faild to Pending because its wait container teminting is nil. |
I think it’s because the waiting container has never been running, so it cannot receive termination. |
Thanks for the links. Okay, I see from the first one that it links to this official documentation which says "A container in the Terminated state began execution and then either ran to completion or failed for some reason." So, you're right that if it didn't begin execution, it seems it should just stay Pending I guess. |
Part of me is wondering what Kubernetes will do in this case - should it just leave it in the Pending state? In any case, though, maybe your change is reasonable - if the purpose of that code is to make sure we save off any outputs from the wait container, but we hit the activeDeadlineSeconds (or terminated for some other reason), then maybe we don't want to allow the container to get into the Running state in the first place I suppose. |
Although, I guess this code doesn't prevent it from getting into a Running state, does it? It just presumes that it won't get into a Running state, right? (and if it did, presumably we won't wait to save the outputs) |
Can you help point me to where in the code after this point that this |
If the pod Failed, the kubernetes wont't update it. - image: quay.io/argoproj/argoexec:latest
imageID: ""
lastState: {}
name: wait
ready: false
restartCount: 0
started: false
state:
waiting:
reason: PodInitializing
hostIP: 172.18.0.2
initContainerStatuses:
- containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
image: quay.io/argoproj/argoexec:latest
imageID: quay.io/argoproj/argoexec@sha256:61c8e55d00437565a2823c802cece0b3323aed758e870a5055768337a87d3546
lastState: {}
name: init
ready: true
restartCount: 0
state:
terminated:
containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
exitCode: 0
finishedAt: "2024-03-13T05:34:25Z"
reason: Completed
startedAt: "2024-03-13T05:34:21Z"
message: Pod was active on the node longer than the specified deadline
phase: Failed So if argo did't update node phase, the node is always show pending for workflow, and the workflow stuck. boundaryID: memoized-bug-4nh77
displayName: fanout(3:4)
finishedAt: null
hostNodeName: kind-control-plane
id: memoized-bug-4nh77-4195166634
inputs:
parameters:
- name: item
value: "4"
message: Pod was active on the node longer than the specified deadline
name: memoized-bug-4nh77[0].fanout(3:4)
phase: Pending
progress: 0/1
startedAt: "2024-03-13T05:34:20Z"
templateName: echo
templateScope: local/memoized-bug-4nh77
type: Pod |
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Yes, it does not prevent pod Running. |
fd5334d
to
86c2f4e
Compare
Okay, so the code being changed isn't particular to failed pods, but for any pod for which If I try to look at those myself, I believe:
If I look at the original bug, I see the InitContainer succeeded, and the wait and main containers are in Pending. So, how did we get to new.Phase.Completed? |
There seems to be something wrong with this description.
The Pod from yesterday is still in this state now. K8s version is v1.25.3, Argo is latest. Phase is Failed. status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-03-13T05:34:26Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-03-13T05:34:20Z"
reason: PodFailed
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-03-13T05:34:20Z"
reason: PodFailed
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2024-03-13T05:34:20Z"
status: "True"
type: PodScheduled
containerStatuses:
- image: alpine:latest
imageID: ""
lastState: {}
name: main
ready: false
restartCount: 0
started: false
state:
waiting:
reason: PodInitializing
- image: quay.io/argoproj/argoexec:latest
imageID: ""
lastState: {}
name: wait
ready: false
restartCount: 0
started: false
state:
waiting:
reason: PodInitializing
hostIP: 172.18.0.2
initContainerStatuses:
- containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
image: quay.io/argoproj/argoexec:latest
imageID: quay.io/argoproj/argoexec@sha256:61c8e55d00437565a2823c802cece0b3323aed758e870a5055768337a87d3546
lastState: {}
name: init
ready: true
restartCount: 0
state:
terminated:
containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
exitCode: 0
finishedAt: "2024-03-13T05:34:25Z"
reason: Completed
startedAt: "2024-03-13T05:34:21Z"
message: Pod was active on the node longer than the specified deadline
phase: Failed
podIP: 10.244.0.123
podIPs:
- ip: 10.244.0.123
qosClass: Burstable
reason: DeadlineExceeded
startTime: "2024-03-13T05:34:20Z" |
86c2f4e
to
61611e3
Compare
Interesting. Thanks for sending that. Okay, so the Pod is clearly in a failed state. So, going through my cases above of Phase==Complete:
We won't hold off any longer if for some reason the wait container is in Pending. I can't really think of a case where this would be a problem. I will probably approve this - may just want to take another look later with fresh eyes. |
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Thanks! |
Backported cleanly to |
… on template level. Fixes: argoproj#12329 (argoproj#12761) Signed-off-by: shuangkun <tsk2013uestc@163.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com> Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> (cherry picked from commit 16cfef9)
… on template level. Fixes: argoproj#12329 (argoproj#12761) Signed-off-by: shuangkun <tsk2013uestc@163.com> Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com> Co-authored-by: Julie Vogelman <julievogelman0@gmail.com> (cherry picked from commit 16cfef9)
Fixes: #12329
Root cause is when pod set faied from pending, because wait container's Terminated is nil, So this node is reset to pending and causing workflow stuck.
Motivation
Let workflow Failed when activeDeadlineSeconds exceed.
Modifications
Set pod Failed when pod set Failed from Pending. (Let workflow failed)
Verification
Local test and add some unit test and e2e