Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ContainerSet does not stop container task when Pod is removed #12210

Closed
2 of 3 tasks
mochja opened this issue Nov 16, 2023 · 3 comments · Fixed by #12756
Closed
2 of 3 tasks

ContainerSet does not stop container task when Pod is removed #12210

mochja opened this issue Nov 16, 2023 · 3 comments · Fixed by #12756
Assignees
Labels
area/templates/container-set P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/workaround There's a workaround, might not be great, but exists type/bug

Comments

@mochja
Copy link

mochja commented Nov 16, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I experience stuck workflows when using containerSet and pod gets deleted while workflow is running. This workflow is in running state and never finishes it also not possible to stop the workflow.

To reproduce run workflow example and delete the pod running the workflow.

Version

v.3.5.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

metadata:
  name: lovely-rhino
spec:
  templates:
    - name: init
      dag:
        tasks:
          - name: A
            template: run
            arguments: {}
    - name: run
      containerSet:
        containers:
          - name: main
            image: alpine:latest
            command:
              - /bin/sh
            args:
              - '-c'
              - sleep 9000
            resources: {}
  entrypoint: init
  arguments: {}
  ttlStrategy:
    secondsAfterCompletion: 300
  podGC:
    strategy: OnPodCompletion

Logs from the workflow controller

time="2023-11-16T00:24:51.907Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:51.955Z" level=info msg="Updated phase  -> Running" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:51.955Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:51.955Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:51.955Z" level=info msg="DAG node lovely-rhino-6452p initialized Running" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:51.955Z" level=warning msg="was unable to obtain the node for lovely-rhino-6452p-4192128298, taskName A"
time="2023-11-16T00:24:51.955Z" level=warning msg="was unable to obtain the node for lovely-rhino-6452p-4192128298, taskName A"
time="2023-11-16T00:24:51.955Z" level=info msg="All of node lovely-rhino-6452p.A dependencies [] completed" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:51.955Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:51.955Z" level=info msg="Pod node lovely-rhino-6452p-4192128298 initialized Pending" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:52.002Z" level=info msg="Created pod: lovely-rhino-6452p.A (lovely-rhino-6452p-run-4192128298)" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:52.003Z" level=info msg="Container node lovely-rhino-6452p-1411179001 initialized Pending" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:52.003Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:52.003Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:24:52.019Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Running resourceVersion=594776568 workflow=lovely-rhino-6452p
time="2023-11-16T00:25:02.005Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:25:02.005Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=0 workflow=lovely-rhino-6452p
time="2023-11-16T00:25:02.005Z" level=info msg="node lovely-rhino-6452p-1411179001 phase Pending -> Running" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:25:02.005Z" level=info msg="node changed" namespace=argo-workflows new.message= new.phase=Running new.progress=0/1 nodeID=lovely-rhino-6452p-4192128298 old.message= old.phase=Pending old.progress=0/1 workflow=lovely-rhino-6452p
time="2023-11-16T00:25:02.005Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:25:02.005Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:25:02.024Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Running resourceVersion=594776713 workflow=lovely-rhino-6452p
time="2023-11-16T00:25:12.025Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:25:12.025Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=0 workflow=lovely-rhino-6452p
time="2023-11-16T00:25:12.025Z" level=info msg="node unchanged" namespace=argo-workflows nodeID=lovely-rhino-6452p-4192128298 workflow=lovely-rhino-6452p
time="2023-11-16T00:25:12.025Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:25:12.025Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:26:14.316Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:26:14.317Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=0 workflow=lovely-rhino-6452p
time="2023-11-16T00:26:14.317Z" level=info msg="Workflow pod is missing" namespace=argo-workflows nodeName=lovely-rhino-6452p.A nodePhase=Running recentlyStarted=false workflow=lovely-rhino-6452p
time="2023-11-16T00:26:14.317Z" level=info msg="node lovely-rhino-6452p-4192128298 phase Running -> Error" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:26:14.317Z" level=info msg="node lovely-rhino-6452p-4192128298 message: pod deleted" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:26:14.317Z" level=info msg="node lovely-rhino-6452p-4192128298 finished: 2023-11-16 00:26:14.317181368 +0000 UTC" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:26:14.317Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:26:14.317Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:26:14.337Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Running resourceVersion=594777738 workflow=lovely-rhino-6452p
time="2023-11-16T00:26:24.338Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:26:24.338Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=0 workflow=lovely-rhino-6452p
time="2023-11-16T00:26:24.339Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=lovely-rhino-6452p
time="2023-11-16T00:26:24.339Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=lovely-rhino-6452p

Logs from in your workflow's wait container

N/A
@jonrich92
Copy link

We experience the same issue. This is problematic since the trigger of other workflows depend on this workflow.
For us it is not artificial but the pod is deleted by external means sometimes because our istio does not come up on a new node fast enough.

@shuangkun shuangkun self-assigned this Mar 6, 2024
@shuangkun
Copy link
Member

Reproduced it and will find the root cause to fix.

@shuangkun shuangkun added P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority and removed P3 Low priority labels Mar 7, 2024
shuangkun added a commit to shuangkun/argo-workflows that referenced this issue Mar 7, 2024
shuangkun added a commit to shuangkun/argo-workflows that referenced this issue Mar 19, 2024
shuangkun added a commit to shuangkun/argo-workflows that referenced this issue Mar 22, 2024
@agilgur5
Copy link
Member

agilgur5 commented Mar 24, 2024

Just for clarification, per #12756 (comment), the Pod and its containers are indeed deleted, it was just that the Workflow did not properly set the status of the child containers.

So if you get into this state somehow, you can probably just delete your Workflow without leaving leftover resources (unless you have other logic that didn't run/clean-up). Could also manually rewrite the state to match what #12756 does.

@agilgur5 agilgur5 added the solution/workaround There's a workaround, might not be great, but exists label Mar 24, 2024
agilgur5 pushed a commit that referenced this issue Mar 25, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 19, 2024
agilgur5 pushed a commit that referenced this issue Apr 19, 2024
…2210 (#12756)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
(cherry picked from commit cfe2bb7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/templates/container-set P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority solution/workaround There's a workaround, might not be great, but exists type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants