Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: workflow stuck in running state when using activeDeadlineSeconds on template level. Fixes: #12329 #12761

Merged

Conversation

shuangkun
Copy link
Member

@shuangkun shuangkun commented Mar 7, 2024

Fixes: #12329
Root cause is when pod set faied from pending, because wait container's Terminated is nil, So this node is reset to pending and causing workflow stuck.

Motivation

Let workflow Failed when activeDeadlineSeconds exceed.

Modifications

Set pod Failed when pod set Failed from Pending. (Let workflow failed)

Verification

Local test and add some unit test and e2e

@shuangkun shuangkun marked this pull request as draft March 7, 2024 15:22
@shuangkun shuangkun added the area/controller Controller issues, panics label Mar 7, 2024
@shuangkun shuangkun force-pushed the fix/StuckWhenExceedActiveDeadlineSeconds branch from 98c4206 to 0133ffe Compare March 8, 2024 03:52
… on template level. Fixes: argoproj#12329

Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun force-pushed the fix/StuckWhenExceedActiveDeadlineSeconds branch from 0133ffe to cd5a7f0 Compare March 8, 2024 06:43
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun marked this pull request as ready for review March 8, 2024 07:25
@shuangkun shuangkun added the prioritized-review For members of the Sustainability Effort label Mar 11, 2024
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun closed this Mar 11, 2024
@shuangkun shuangkun reopened this Mar 11, 2024
@juliev0 juliev0 self-assigned this Mar 12, 2024
@juliev0
Copy link
Contributor

juliev0 commented Mar 12, 2024

I've started looking at this, so I assigned it to myself, but if anyone feels they know the code well here and prefers to reassign it to themselves, I'm good. :)

@juliev0
Copy link
Contributor

juliev0 commented Mar 12, 2024

Can you help me understand the sequence of events that led to the Issue? So, the containers were in "Pending" state and then the Pod's activeDeadlineSeconds was reached. I see at that point from the documentation that Kubernetes "will actively try to mark it failed and kill associated containers". Do we understand what's preventing Kubernetes from marking it Terminated at that point?

@shuangkun
Copy link
Member Author

Kubernetes has marked it as Failed, but Argo observed Failed when recording but still reset it from Faild to Pending because its wait container teminting is nil.

@shuangkun
Copy link
Member Author

I think it’s because the waiting container has never been running, so it cannot receive termination.

@shuangkun
Copy link
Member Author

@juliev0
Copy link
Contributor

juliev0 commented Mar 13, 2024

kubernetes/kubernetes#108733

kubernetes/kubernetes#108366

Thanks for the links. Okay, I see from the first one that it links to this official documentation which says "A container in the Terminated state began execution and then either ran to completion or failed for some reason." So, you're right that if it didn't begin execution, it seems it should just stay Pending I guess.

@juliev0
Copy link
Contributor

juliev0 commented Mar 13, 2024

Part of me is wondering what Kubernetes will do in this case - should it just leave it in the Pending state?

In any case, though, maybe your change is reasonable - if the purpose of that code is to make sure we save off any outputs from the wait container, but we hit the activeDeadlineSeconds (or terminated for some other reason), then maybe we don't want to allow the container to get into the Running state in the first place I suppose.

@juliev0
Copy link
Contributor

juliev0 commented Mar 13, 2024

Part of me is wondering what Kubernetes will do in this case - should it just leave it in the Pending state?

In any case, though, maybe your change is reasonable - if the purpose of that code is to make sure we save off any outputs from the wait container, but we hit the activeDeadlineSeconds (or terminated for some other reason), then maybe we don't want to allow the container to get into the Running state in the first place I suppose.

Although, I guess this code doesn't prevent it from getting into a Running state, does it? It just presumes that it won't get into a Running state, right? (and if it did, presumably we won't wait to save the outputs)

@juliev0
Copy link
Contributor

juliev0 commented Mar 13, 2024

Can you help point me to where in the code after this point that this node.Phase can enable vs prevent workflow shutdown?

@shuangkun
Copy link
Member Author

shuangkun commented Mar 13, 2024

Can you help point me to where in the code after this point that this node.Phase can enable vs prevent workflow shutdown?

If the pod Failed, the kubernetes wont't update it.
The containerstatus is always:

  - image: quay.io/argoproj/argoexec:latest
    imageID: ""
    lastState: {}
    name: wait
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: PodInitializing
  hostIP: 172.18.0.2
  initContainerStatuses:
  - containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
    image: quay.io/argoproj/argoexec:latest
    imageID: quay.io/argoproj/argoexec@sha256:61c8e55d00437565a2823c802cece0b3323aed758e870a5055768337a87d3546
    lastState: {}
    name: init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
        exitCode: 0
        finishedAt: "2024-03-13T05:34:25Z"
        reason: Completed
        startedAt: "2024-03-13T05:34:21Z"
  message: Pod was active on the node longer than the specified deadline
  phase: Failed

So if argo did't update node phase, the node is always show pending for workflow, and the workflow stuck.

      boundaryID: memoized-bug-4nh77
      displayName: fanout(3:4)
      finishedAt: null
      hostNodeName: kind-control-plane
      id: memoized-bug-4nh77-4195166634
      inputs:
        parameters:
        - name: item
          value: "4"
      message: Pod was active on the node longer than the specified deadline
      name: memoized-bug-4nh77[0].fanout(3:4)
      phase: Pending
      progress: 0/1
      startedAt: "2024-03-13T05:34:20Z"
      templateName: echo
      templateScope: local/memoized-bug-4nh77
      type: Pod

Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun
Copy link
Member Author

shuangkun commented Mar 13, 2024

Although, I guess this code doesn't prevent it from getting into a Running state, does it? It just presumes that it won't get into a Running state, right? (and if it did, presumably we won't wait to save the outputs)

Yes, it does not prevent pod Running.

@shuangkun shuangkun force-pushed the fix/StuckWhenExceedActiveDeadlineSeconds branch from fd5334d to 86c2f4e Compare March 13, 2024 06:15
@shuangkun shuangkun requested a review from juliev0 March 13, 2024 06:52
@juliev0
Copy link
Contributor

juliev0 commented Mar 14, 2024

Okay, so the code being changed isn't particular to failed pods, but for any pod for which new.Phase.Completed()==true. So, we should make sure this will be okay for any of those cases. Have you verified that?

If I try to look at those myself, I believe:

  • this is the case of all containers having Succeeded, marking the Pod as Succeeded
  • this is Phase.Failed which Kubernetes says means "All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system."
  • this one wouldn't actually happen
  • this one seems like some json unmarshalling error, which I don't think would happen
  • this one is the InitContainer failing

If I look at the original bug, I see the InitContainer succeeded, and the wait and main containers are in Pending. So, how did we get to new.Phase.Completed?

@shuangkun
Copy link
Member Author

shuangkun commented Mar 14, 2024

Okay, so the code being changed isn't particular to failed pods, but for any pod for which new.Phase.Completed()==true. So, we should make sure this will be okay for any of those cases. Have you verified that?

If I try to look at those myself, I believe:

  • this is the case of all containers having Succeeded, marking the Pod as Succeeded
  • this is Phase.Failed which Kubernetes says means "All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system."

There seems to be something wrong with this description.

  • this one wouldn't actually happen
  • this one seems like some json unmarshalling error, which I don't think would happen
  • this one is the InitContainer failing

If I look at the original bug, I see the InitContainer succeeded, and the wait and main containers are in Pending. So, how did we get to new.Phase.Completed?

The Pod from yesterday is still in this state now. K8s version is v1.25.3, Argo is latest. Phase is Failed.

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-03-13T05:34:26Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-13T05:34:20Z"
    reason: PodFailed
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-13T05:34:20Z"
    reason: PodFailed
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-03-13T05:34:20Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: alpine:latest
    imageID: ""
    lastState: {}
    name: main
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: PodInitializing
  - image: quay.io/argoproj/argoexec:latest
    imageID: ""
    lastState: {}
    name: wait
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: PodInitializing
  hostIP: 172.18.0.2
  initContainerStatuses:
  - containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
    image: quay.io/argoproj/argoexec:latest
    imageID: quay.io/argoproj/argoexec@sha256:61c8e55d00437565a2823c802cece0b3323aed758e870a5055768337a87d3546
    lastState: {}
    name: init
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: containerd://66eb4112a8061ff6da5f48b0c15b9c5c5ccf70f8a4ffa079d258b4315b4bbf74
        exitCode: 0
        finishedAt: "2024-03-13T05:34:25Z"
        reason: Completed
        startedAt: "2024-03-13T05:34:21Z"
  message: Pod was active on the node longer than the specified deadline
  phase: Failed
  podIP: 10.244.0.123
  podIPs:
  - ip: 10.244.0.123
  qosClass: Burstable
  reason: DeadlineExceeded
  startTime: "2024-03-13T05:34:20Z"

Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun force-pushed the fix/StuckWhenExceedActiveDeadlineSeconds branch from 86c2f4e to 61611e3 Compare March 14, 2024 04:31
@juliev0
Copy link
Contributor

juliev0 commented Mar 14, 2024

There seems to be something wrong with this description.

Interesting. Thanks for sending that. Okay, so the Pod is clearly in a failed state.

So, going through my cases above of Phase==Complete:

  • If the Pod is Failed, we'll hold off if the wait container is in Running.
  • If the Pod is Succeeded, presumably all of the containers succeeded, so we won't ever hold off.
  • If the InitContainer failed (and perhaps Pod is not necessarily considered Failed itself), we'll hold off if the wait container is for some reason Running.

We won't hold off any longer if for some reason the wait container is in Pending. I can't really think of a case where this would be a problem. I will probably approve this - may just want to take another look later with fresh eyes.

Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
@shuangkun shuangkun requested a review from juliev0 March 14, 2024 06:23
@juliev0 juliev0 merged commit 16cfef9 into argoproj:main Mar 15, 2024
27 checks passed
@shuangkun
Copy link
Member Author

Thanks!

@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 19, 2024
agilgur5 pushed a commit that referenced this pull request Apr 19, 2024
… on template level. Fixes: #12329 (#12761)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
(cherry picked from commit 16cfef9)
@agilgur5
Copy link
Contributor

Backported cleanly to release-3.5 as 19a7ede

isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request May 6, 2024
… on template level. Fixes: argoproj#12329 (argoproj#12761)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
(cherry picked from commit 16cfef9)
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request May 7, 2024
… on template level. Fixes: argoproj#12329 (argoproj#12761)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
(cherry picked from commit 16cfef9)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics prioritized-review For members of the Sustainability Effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflow stuck in running state when using activeDeadlineSeconds on template level
3 participants