Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry a stopped workflow report Ancestor node not found #12156

Open
3 tasks done
hustclf opened this issue Nov 6, 2023 · 6 comments · May be fixed by #12164
Open
3 tasks done

Retry a stopped workflow report Ancestor node not found #12156

hustclf opened this issue Nov 6, 2023 · 6 comments · May be fixed by #12164
Labels
area/controller Controller issues, panics area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/templates/dag type/bug

Comments

@hustclf
Copy link
Contributor

hustclf commented Nov 6, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

What happened
I created a DAG workflow with four tasks (A, B, C, D). When tasks A and B were successfully completed, I stopped the workflow during the execution of task C. After retrying the workflow, I noticed that the workflow's phase immediately changed to 'Error' and displayed an error message stating that the ancestor node 'C' was not found.

  1. run the example workflow and stop it when running task C.
image
  1. retry the workflow
image

what you expected to happen?
I would like to be able to successfully retry a stopped workflow.

Version

v3.4.13

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: retry-stopped-dag
spec:
  entrypoint: diamond
  templates:
  - name: echo
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [echo, "{{inputs.parameters.message}}"]
  - name: sleep
    inputs:
      parameters:
      - name: message
    container:
      image: alpine:3.7
      command: [sleep, 36000s]
  - name: diamond
    dag:
      tasks:
      - name: A
        template: echo
        arguments:
          parameters: [{name: message, value: A}]
      - name: B
        depends: A.Succeeded
        template: echo
        arguments:
          parameters: [{name: message, value: B}]
      - name: C
        depends: B.Succeeded
        template: sleep
        arguments:
          parameters: [{name: message, value: C}]
      - name: D
        depends: C.Succeeded || C.Failed
        template: echo
        arguments:
          parameters: [{name: message, value: D}]

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

run workflow and stop it.

time="2023-11-06T17:08:09.524Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/retry-stopped-dag-sleep-189432235/labelPodCompleted
time="2023-11-06T17:08:09.625Z" level=info msg="signaled container" container=main error="<nil>" namespace=argo pod=retry-stopped-dag-sleep-189432235 stderr= stdout="killing 1 with terminated\n"
time="2023-11-06T17:08:09.626Z" level=info msg="https://0.0.0.0:50661/api/v1/namespaces/argo/pods/retry-stopped-dag-sleep-189432235/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"
time="2023-11-06T17:08:09.812Z" level=info msg="signaled container" container=wait error="<nil>" namespace=argo pod=retry-stopped-dag-sleep-189432235 stderr= stdout="killing 1 with terminated\n"
time="2023-11-06T17:08:12.813Z" level=info msg="cleaning up pod" action=killContainers key=argo/retry-stopped-dag-sleep-189432235/killContainers
time="2023-11-06T17:08:59.776Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:08:59.787Z" level=info msg="Updated phase  -> Running" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:08:59.787Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:08:59.787Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:08:59.787Z" level=info msg="DAG node retry-stopped-dag initialized Running" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:08:59.787Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:08:59.787Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:08:59.787Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:08:59.787Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-222987473, taskName A"
time="2023-11-06T17:08:59.787Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-222987473, taskName A"
time="2023-11-06T17:08:59.787Z" level=info msg="All of node retry-stopped-dag.A dependencies [] completed" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:08:59.787Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:08:59.788Z" level=info msg="Pod node retry-stopped-dag-222987473 initialized Pending" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:08:59.809Z" level=info msg="Created pod: retry-stopped-dag.A (retry-stopped-dag-echo-222987473)" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:08:59.809Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:08:59.809Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:08:59.809Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:08:59.809Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:08:59.809Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:08:59.809Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:08:59.809Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:08:59.810Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:08:59.826Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=45660 workflow=retry-stopped-dag
time="2023-11-06T17:09:00.776Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:00.777Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=retry-stopped-dag
time="2023-11-06T17:09:00.778Z" level=info msg="node changed" namespace=argo new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=retry-stopped-dag-222987473 old.message= old.phase=Pending old.progress=0/1 workflow=retry-stopped-dag
time="2023-11-06T17:09:00.778Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:00.778Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:00.778Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:00.778Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:00.778Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:00.778Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:00.778Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:00.778Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:00.778Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:00.778Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:00.778Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:00.787Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=45668 workflow=retry-stopped-dag
time="2023-11-06T17:09:01.901Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:01.902Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=retry-stopped-dag
time="2023-11-06T17:09:01.902Z" level=info msg="node unchanged" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:01.903Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:01.903Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:01.903Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:01.903Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:01.903Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:01.903Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:01.903Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:01.903Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:01.903Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:01.903Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:01.903Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:02.908Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:02.910Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=retry-stopped-dag
time="2023-11-06T17:09:02.918Z" level=info msg="node unchanged" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:02.919Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:02.919Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:02.919Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:02.920Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:02.920Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:02.920Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:02.920Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:02.920Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:02.920Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:02.920Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:02.920Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:03.950Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:03.951Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-stopped-dag
time="2023-11-06T17:09:03.951Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:03.951Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=retry-stopped-dag-222987473 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=retry-stopped-dag
time="2023-11-06T17:09:03.951Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:03.951Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:03.951Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:03.952Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:03.952Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:03.952Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:03.952Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:03.952Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:03.952Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:03.952Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:03.952Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:03.956Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/retry-stopped-dag-echo-222987473/terminateContainers
time="2023-11-06T17:09:03.968Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=45681 workflow=retry-stopped-dag
time="2023-11-06T17:09:06.061Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:06.062Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-stopped-dag
time="2023-11-06T17:09:06.062Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:06.063Z" level=info msg="node changed" namespace=argo new.message= new.phase=Succeeded new.progress=0/1 nodeID=retry-stopped-dag-222987473 old.message= old.phase=Running old.progress=0/1 workflow=retry-stopped-dag
time="2023-11-06T17:09:06.063Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:06.063Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:06.063Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:06.064Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:06.064Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-172654616, taskName B"
time="2023-11-06T17:09:06.064Z" level=info msg="All of node retry-stopped-dag.B dependencies [A] completed" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:06.064Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:06.064Z" level=info msg="Pod node retry-stopped-dag-172654616 initialized Pending" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:06.084Z" level=info msg="Created pod: retry-stopped-dag.B (retry-stopped-dag-echo-172654616)" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:06.084Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:06.084Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:06.084Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:06.084Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:06.084Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:06.084Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:06.101Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=45689 workflow=retry-stopped-dag
time="2023-11-06T17:09:06.108Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/retry-stopped-dag-echo-222987473/labelPodCompleted
time="2023-11-06T17:09:06.957Z" level=info msg="cleaning up pod" action=killContainers key=argo/retry-stopped-dag-echo-222987473/killContainers
time="2023-11-06T17:09:07.086Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:07.086Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-stopped-dag
time="2023-11-06T17:09:07.086Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:07.086Z" level=info msg="node changed" namespace=argo new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=retry-stopped-dag-172654616 old.message= old.phase=Pending old.progress=0/1 workflow=retry-stopped-dag
time="2023-11-06T17:09:07.087Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:07.087Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:07.087Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:07.087Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:07.087Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:07.087Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:07.087Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:07.087Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:07.096Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=45699 workflow=retry-stopped-dag
time="2023-11-06T17:09:09.007Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:09.008Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=1 workflow=retry-stopped-dag
time="2023-11-06T17:09:09.008Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:09.008Z" level=info msg="node unchanged" namespace=argo nodeID=retry-stopped-dag-172654616 workflow=retry-stopped-dag
time="2023-11-06T17:09:09.008Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:09.008Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:09.009Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:09.009Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:09.009Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:09.009Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:09.009Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:09.009Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:09.027Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=45699 workflow=retry-stopped-dag
time="2023-11-06T17:09:10.029Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:10.030Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=2 workflow=retry-stopped-dag
time="2023-11-06T17:09:10.030Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:10.030Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-172654616 workflow=retry-stopped-dag
time="2023-11-06T17:09:10.030Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=retry-stopped-dag-172654616 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=retry-stopped-dag
time="2023-11-06T17:09:10.030Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:10.030Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:10.031Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:10.031Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:10.031Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:10.031Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:10.031Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:10.031Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:10.035Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/retry-stopped-dag-echo-172654616/terminateContainers
time="2023-11-06T17:09:10.043Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=45711 workflow=retry-stopped-dag
time="2023-11-06T17:09:12.110Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:12.110Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=2 workflow=retry-stopped-dag
time="2023-11-06T17:09:12.110Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-172654616 workflow=retry-stopped-dag
time="2023-11-06T17:09:12.110Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:12.110Z" level=info msg="node changed" namespace=argo new.message= new.phase=Succeeded new.progress=0/1 nodeID=retry-stopped-dag-172654616 old.message= old.phase=Running old.progress=0/1 workflow=retry-stopped-dag
time="2023-11-06T17:09:12.111Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:12.111Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:12.111Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:12.111Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-189432235, taskName C"
time="2023-11-06T17:09:12.111Z" level=info msg="All of node retry-stopped-dag.C dependencies [B] completed" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:12.111Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:12.111Z" level=info msg="Pod node retry-stopped-dag-189432235 initialized Pending" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:12.118Z" level=info msg="Created pod: retry-stopped-dag.C (retry-stopped-dag-sleep-189432235)" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:12.118Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:12.118Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:12.118Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:12.118Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:12.129Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=45718 workflow=retry-stopped-dag
time="2023-11-06T17:09:12.134Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/retry-stopped-dag-echo-172654616/labelPodCompleted
time="2023-11-06T17:09:13.036Z" level=info msg="cleaning up pod" action=killContainers key=argo/retry-stopped-dag-echo-172654616/killContainers
time="2023-11-06T17:09:13.119Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:13.120Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=2 workflow=retry-stopped-dag
time="2023-11-06T17:09:13.120Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:13.120Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-172654616 workflow=retry-stopped-dag
time="2023-11-06T17:09:13.120Z" level=info msg="node changed" namespace=argo new.message=PodInitializing new.phase=Pending new.progress=0/1 nodeID=retry-stopped-dag-189432235 old.message= old.phase=Pending old.progress=0/1 workflow=retry-stopped-dag
time="2023-11-06T17:09:13.120Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:13.121Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:13.121Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:13.121Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:13.121Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:13.132Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=45727 workflow=retry-stopped-dag
time="2023-11-06T17:09:15.047Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:15.047Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=2 workflow=retry-stopped-dag
time="2023-11-06T17:09:15.047Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:15.047Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-172654616 workflow=retry-stopped-dag
time="2023-11-06T17:09:15.047Z" level=info msg="node unchanged" namespace=argo nodeID=retry-stopped-dag-189432235 workflow=retry-stopped-dag
time="2023-11-06T17:09:15.048Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:15.048Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:15.048Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:15.048Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:15.048Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:15.061Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=45727 workflow=retry-stopped-dag
time="2023-11-06T17:09:16.052Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:16.053Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=2 workflow=retry-stopped-dag
time="2023-11-06T17:09:16.053Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:16.053Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-172654616 workflow=retry-stopped-dag
time="2023-11-06T17:09:16.053Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=retry-stopped-dag-189432235 old.message=PodInitializing old.phase=Pending old.progress=0/1 workflow=retry-stopped-dag
time="2023-11-06T17:09:16.053Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:16.053Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:16.053Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:16.054Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:16.054Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:16.063Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=45737 workflow=retry-stopped-dag
time="2023-11-06T17:09:42.492Z" level=info msg="Processing workflow" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=2 workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-172654616 workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=info msg="task-result changed" namespace=argo nodeID=retry-stopped-dag-222987473 workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=info msg="node unchanged" namespace=argo nodeID=retry-stopped-dag-189432235 workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=info msg="Terminating pod as part of workflow shutdown" namespace=argo podName=retry-stopped-dag-sleep-189432235 shutdownStrategy=Stop workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=info msg="node retry-stopped-dag-189432235 phase Running -> Failed" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=info msg="node retry-stopped-dag-189432235 message: workflow shutdown with strategy:  Stop" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=info msg="node retry-stopped-dag-189432235 finished: 2023-11-06 09:09:42.493555 +0000 UTC" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:42.493Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:42.493Z" level=warning msg="was unable to obtain the node for retry-stopped-dag-273320330, taskName D"
time="2023-11-06T17:09:42.493Z" level=info msg="All of node retry-stopped-dag.D dependencies [C] completed" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=info msg="Pod node retry-stopped-dag-273320330 initialized Pending" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=info msg="node retry-stopped-dag-273320330 phase Pending -> Skipped" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.493Z" level=info msg="node retry-stopped-dag-273320330 message: workflow shutdown with strategy: Stop" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.494Z" level=info msg="node retry-stopped-dag-273320330 finished: 2023-11-06 09:09:42.494 +0000 UTC" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.494Z" level=info msg="Outbound nodes of retry-stopped-dag set to [retry-stopped-dag-273320330]" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.494Z" level=info msg="node retry-stopped-dag phase Running -> Failed" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.494Z" level=info msg="node retry-stopped-dag finished: 2023-11-06 09:09:42.494043 +0000 UTC" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.494Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.494Z" level=info msg=reconcileAgentPod namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.494Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.494Z" level=info msg="Updated message  -> Stopped with strategy 'Stop'" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.494Z" level=info msg="Marking workflow completed" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.494Z" level=info msg="Marking workflow as pending archiving" namespace=argo workflow=retry-stopped-dag
time="2023-11-06T17:09:42.498Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/retry-stopped-dag-sleep-189432235/terminateContainers
time="2023-11-06T17:09:42.498Z" level=info msg="https://0.0.0.0:50661/api/v1/namespaces/argo/pods/retry-stopped-dag-sleep-189432235/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=main&stderr=true&stdout=true&tty=false"
time="2023-11-06T17:09:42.499Z" level=info msg="cleaning up pod" action=deletePod key=argo/retry-stopped-dag-1340600742-agent/deletePod
time="2023-11-06T17:09:42.502Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=45755 workflow=retry-stopped-dag
time="2023-11-06T17:09:42.520Z" level=info msg="archiving workflow" namespace=argo uid=2a8b0dd2-55ef-46b7-8b02-3491ccb05372 workflow=retry-stopped-dag
time="2023-11-06T17:09:42.525Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/retry-stopped-dag-sleep-189432235/labelPodCompleted
time="2023-11-06T17:09:42.621Z" level=info msg="signaled container" container=main error="<nil>" namespace=argo pod=retry-stopped-dag-sleep-189432235 stderr= stdout="killing 1 with terminated\n"
time="2023-11-06T17:09:42.621Z" level=info msg="https://0.0.0.0:50661/api/v1/namespaces/argo/pods/retry-stopped-dag-sleep-189432235/exec?command=%2Fvar%2Frun%


### Logs from in your workflow's wait container

```text
kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

time="2023-11-06T09:09:03.603Z" level=info msg="No output artifacts"
time="2023-11-06T09:09:03.604Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: retry-stopped-dag/retry-stopped-dag-echo-222987473/main.log"
time="2023-11-06T09:09:03.604Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2023-11-06T09:09:03.604Z" level=info msg="Saving file to s3" bucket=my-bucket endpoint="minio:9000" key=retry-stopped-dag/retry-stopped-dag-echo-222987473/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-11-06T09:09:03.618Z" level=info msg="Save artifact" artifactName=main-logs duration=13.804172ms error="<nil>" key=retry-stopped-dag/retry-stopped-dag-echo-222987473/main.log
time="2023-11-06T09:09:03.618Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-11-06T09:09:03.618Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-11-06T09:09:03.646Z" level=info msg="Alloc=7178 TotalAlloc=13050 Sys=27773 NumGC=4 Goroutines=10"
time="2023-11-06T09:09:03.647Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
time="2023-11-06T09:09:03.647Z" level=info msg="Deadline monitor stopped"
time="2023-11-06T09:09:09.295Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-11-06T09:09:09.295Z" level=info msg="No output parameters"
time="2023-11-06T09:09:09.295Z" level=info msg="No output artifacts"
time="2023-11-06T09:09:09.296Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: retry-stopped-dag/retry-stopped-dag-echo-172654616/main.log"
time="2023-11-06T09:09:09.296Z" level=info msg="Creating minio client using static credentials" endpoint="minio:9000"
time="2023-11-06T09:09:09.296Z" level=info msg="Saving file to s3" bucket=my-bucket endpoint="minio:9000" key=retry-stopped-dag/retry-stopped-dag-echo-172654616/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-11-06T09:09:09.306Z" level=info msg="Save artifact" artifactName=main-logs duration=10.649704ms error="<nil>" key=retry-stopped-dag/retry-stopped-dag-echo-172654616/main.log
time="2023-11-06T09:09:09.306Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-11-06T09:09:09.306Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-11-06T09:09:09.318Z" level=info msg="Alloc=7480 TotalAlloc=13049 Sys=27773 NumGC=4 Goroutines=10"
@hustclf
Copy link
Contributor Author

hustclf commented Nov 6, 2023

After adding and printing logs in the argo-server, I found that after the execution of FormulateRetryWorkflow (workflow/util/util.go:825), tasks A, B, and D (in Skipped State) were all retained.

image

This caused the workflow-controller to encounter an error while executing buildLocalScopeFromTask(workflow/controller/dag.go:629) as task D could not find the ancestor node, resulting in the entire workflow becoming an Error.

image

@hustclf
Copy link
Contributor Author

hustclf commented Nov 6, 2023

Should we ignore the nodes that are marked as Skipped when executing FormulateRetryWorkflow?
After making the following changes, the retry started working properly.

diff --git a/workflow/util/util.go b/workflow/util/util.go
index 3083c4f63..8d29536eb 100644
--- a/workflow/util/util.go
+++ b/workflow/util/util.go
@@ -951,7 +951,7 @@ func FormulateRetryWorkflow(ctx context.Context, wf *wfv1.Workflow, restartSucce
                                        }
                                }
                        } else {
-                               if !containsNode(resetParentGroupNodes, node.ID) {
+                               if !containsNode(resetParentGroupNodes, node.ID) && node.Phase == wfv1.NodeSucceeded {
                                        log.Debugf("Node %s remains as is", node.Name)
                                        newWF.Status.Nodes.Set(node.ID, node)
                                }
image

@agilgur5 agilgur5 added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Nov 6, 2023
@agilgur5
Copy link
Member

agilgur5 commented Nov 7, 2023

Thanks for the investigation here!

Should we ignore the nodes that are marked as Skipped when executing FormulateRetryWorkflow?

Not necessarily. If a Skipped node is a child of a succeeded node, then we wouldn't need to retry it. If it's a child of a Failed or Errored node, then we would retry it. (also depends if you used --restart-successful or --node-field-selector in the CLI as well)

@agilgur5 agilgur5 added area/controller Controller issues, panics area/templates/dag labels Nov 7, 2023
@hustclf
Copy link
Contributor Author

hustclf commented Nov 8, 2023

Thanks for the investigation here!

Should we ignore the nodes that are marked as Skipped when executing FormulateRetryWorkflow?

Not necessarily. If a Skipped node is a child of a succeeded node, then we wouldn't need to retry it. If it's a child of a Failed or Errored node, then we would retry it. (also depends if you used --restart-successful or --node-field-selector in the CLI as well)

Thank you for your response.

In this example, I attempted to retry the workflow on UI without using the --restart-successful or --node-field-selector options. As a result, task C was marked as Failed and deleted in FormulateRetryWorkflow, while task D was marked as Skipped and retained.

This caused the controller to mark the workflow as being in an Error phase. It seems that we need to make some changes to the workflow-controller in order to ensure that the retry function works as expected ?.

@JasonChen86899
Copy link

Hi, your change is not true, you only change the code to adapt this sense.
Just like anton @agilgur5 said, skipped node of success will not need to retry. You change will will make skipped node retry again.
The wrong logic is the way this method FormulateRetryWorkflow traverses the nodes. we can traverse node in BFS, then this bug will be fixed

@JasonChen86899
Copy link

Hi, your change is not true, you only change the code to adapt this sense. Just like anton @agilgur5 said, skipped node of success will not need to retry. You change will will make skipped node retry again. The wrong logic is the way this method FormulateRetryWorkflow traverses the nodes. we can traverse node in BFS, then this bug will be fixed

@hustclf I thought about it again, and my previous statement about using bfs may lead to ambiguity. To be more precise, BFS should first traverse each node, record the dependent nodes (parent nodes) of each node, and then traverse and determine whether the current node should be retained after this operation is completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries area/templates/dag type/bug
Projects
None yet
3 participants