Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use retryStrategy and hooks in unison on intermediate steps/tasks #12120

Closed
2 of 3 tasks
MarcusMoe opened this issue Nov 1, 2023 · 10 comments · Fixed by #12192
Closed
2 of 3 tasks

Unable to use retryStrategy and hooks in unison on intermediate steps/tasks #12120

MarcusMoe opened this issue Nov 1, 2023 · 10 comments · Fixed by #12192
Assignees
Labels
area/hooks area/retryStrategy Template-level retryStrategy P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug

Comments

@MarcusMoe
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

I encountered a bug where the workflow is unable to complete using hooks and retryStrategy. When some-task fails or succeeds, I am using an exit-handler to send a status update to Github. The exit-handler has a retryStrategy to ensure that the status update is sent. While the status update is happening, the finish task depending on some-task continues correctly. Whether some-task fails or succeeds, a hook with the exit-handler is launched and the workflow gets stuck. The hook's retry template gets stuck in "Running" state, even though the pod has completed its task. This results in the whole workflow getting stuck in "Running" state as well.

Removing the finish task from the workflow fixes the issue, so it seems to only occure when hooks are launched from intermediate tasks/steps. Removing retryStrategy also removes the issue, as the pod is the only thing launched and it completes successfully. So far it seems to affect both DAGs and Steps.

I would like the hook with the exit-template to complete, allowing the workflow continue executing and eventually exit with whatever state it has achieved (Failed, Succeeded or Error).

Failed some-task with a failure hook:
Screenshot 2023-11-01 at 13 25 31

Successful some-task with a success hook:
Screenshot 2023-11-01 at 13 37 06

Version

v3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

apiVersion: argoproj.io/v1alpha1
kind: WorkflowTemplate
metadata:
  name: github-issue
spec:
  entrypoint: main
  templates:
    - name: main
      dag:
        tasks:
          - name: some-task
            template: some-task
            hooks:
              failure:
                template: exit-handler
                expression: tasks["some-task"].status == "Failed"
          - name: finish
            template: finish
            dependencies:
             - some-task

    # Some task template
    - name: some-task
      container:
        image: alpine:latest
        command: [sh, -c]
        args:
          - |
            echo "Doing great things";
            echo "Failing...";
            exit 1;

    # Exit handler template
    - name: exit-handler
      retryStrategy:
        limit: 5
        retryPolicy: "Always"
        backoff:
          duration: "1m"
          factor: 2
          maxDuration: "5m"
      container:
        image: alpine:latest
        command: [sh, -c]
        args:
          - |
            echo "Sending updates to github";
            sleep 5;
            echo "Done!";

    # Finish template
    - name: finish
      container:
        image: alpine:latest
        command: [sh, -c]
        args:
          - |
            echo "Finished!";

Logs from the workflow controller

time="2023-11-01T11:15:45.648Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:45.720Z" level=info msg="Updated phase  -> Running" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:45.720Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:45.720Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:45.720Z" level=info msg="DAG node github-issue initialized Running" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:45.721Z" level=warning msg="was unable to obtain the node for github-issue-2120228413, taskName finish"
time="2023-11-01T11:15:45.721Z" level=warning msg="was unable to obtain the node for github-issue-2857574236, taskName some-task"
time="2023-11-01T11:15:45.721Z" level=warning msg="was unable to obtain the node for github-issue-2857574236, taskName some-task"
time="2023-11-01T11:15:45.721Z" level=info msg="All of node github-issue.some-task dependencies [] completed" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:45.721Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:45.721Z" level=info msg="Pod node github-issue-2857574236 initialized Pending" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:45.815Z" level=info msg="Created pod: github-issue.some-task (github-issue-some-task-2857574236)" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:45.815Z" level=warning msg="was unable to obtain the node for github-issue-2120228413, taskName finish"
time="2023-11-01T11:15:45.815Z" level=warning msg="was unable to obtain the node for github-issue-2120228413, taskName finish"
time="2023-11-01T11:15:45.815Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:45.815Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:45.837Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Running resourceVersion=463111604 workflow=github-issue
time="2023-11-01T11:15:55.649Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:55.649Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=1 workflow=github-issue
time="2023-11-01T11:15:55.649Z" level=info msg="task-result changed" namespace=argo-workflows nodeID=github-issue-2857574236 workflow=github-issue
time="2023-11-01T11:15:55.650Z" level=info msg="Pod failed: Error (exit code 1)" displayName=some-task namespace=argo-workflows pod=github-issue-some-task-2857574236 templateName=some-task workflow=github-issue
time="2023-11-01T11:15:55.650Z" level=info msg="node changed" namespace=argo-workflows new.message="Error (exit code 1)" new.phase=Failed new.progress=0/1 nodeID=github-issue-2857574236 old.message= old.phase=Pending old.progress=0/1 workflow=github-issue
time="2023-11-01T11:15:55.650Z" level=warning msg="was unable to obtain the node for github-issue-2120228413, taskName finish"
time="2023-11-01T11:15:55.650Z" level=info msg="Running hooks" hookName=failure lifeCycleHook=failure namespace=argo-workflows node=github-issue.some-task.hooks.failure workflow=github-issue
time="2023-11-01T11:15:55.650Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:55.650Z" level=info msg="Retry node github-issue-1122108762 initialized Running" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:55.650Z" level=info msg="Pod node github-issue-3942332033 initialized Pending" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:55.683Z" level=info msg="Created pod: github-issue.some-task.hooks.failure(0) (github-issue-exit-handler-3942332033)" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:55.684Z" level=warning msg="was unable to obtain the node for github-issue-2120228413, taskName finish"
time="2023-11-01T11:15:55.684Z" level=info msg="Skipped node github-issue-2120228413 initialized Omitted (message: omitted: depends condition not met)" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:55.684Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:55.684Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:15:55.698Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Running resourceVersion=463111739 workflow=github-issue
time="2023-11-01T11:16:05.688Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:16:05.688Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=2 workflow=github-issue
time="2023-11-01T11:16:05.688Z" level=info msg="task-result changed" namespace=argo-workflows nodeID=github-issue-2857574236 workflow=github-issue
time="2023-11-01T11:16:05.688Z" level=info msg="task-result changed" namespace=argo-workflows nodeID=github-issue-3942332033 workflow=github-issue
time="2023-11-01T11:16:05.689Z" level=info msg="node changed" namespace=argo-workflows new.message= new.phase=Running new.progress=0/1 nodeID=github-issue-3942332033 old.message= old.phase=Pending old.progress=0/1 workflow=github-issue
time="2023-11-01T11:16:05.689Z" level=info msg="Pod failed: Error (exit code 1)" displayName=some-task namespace=argo-workflows pod=github-issue-some-task-2857574236 templateName=some-task workflow=github-issue
time="2023-11-01T11:16:05.689Z" level=info msg="node unchanged" namespace=argo-workflows nodeID=github-issue-2857574236 workflow=github-issue
time="2023-11-01T11:16:05.689Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:16:05.689Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:16:05.694Z" level=info msg="cleaning up pod" action=terminateContainers key=argo-workflows/github-issue-exit-handler-3942332033/terminateContainers
time="2023-11-01T11:16:05.703Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Running resourceVersion=463111868 workflow=github-issue
time="2023-11-01T11:16:17.120Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:16:17.120Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=2 workflow=github-issue
time="2023-11-01T11:16:17.120Z" level=info msg="task-result changed" namespace=argo-workflows nodeID=github-issue-2857574236 workflow=github-issue
time="2023-11-01T11:16:17.120Z" level=info msg="task-result changed" namespace=argo-workflows nodeID=github-issue-3942332033 workflow=github-issue
time="2023-11-01T11:16:17.120Z" level=info msg="node changed" namespace=argo-workflows new.message= new.phase=Succeeded new.progress=0/1 nodeID=github-issue-3942332033 old.message= old.phase=Running old.progress=0/1 workflow=github-issue
time="2023-11-01T11:16:17.121Z" level=info msg="Pod failed: Error (exit code 1)" displayName=some-task namespace=argo-workflows pod=github-issue-some-task-2857574236 templateName=some-task workflow=github-issue
time="2023-11-01T11:16:17.121Z" level=info msg="node unchanged" namespace=argo-workflows nodeID=github-issue-2857574236 workflow=github-issue
time="2023-11-01T11:16:17.121Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:16:17.121Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:16:17.139Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Running resourceVersion=463111991 workflow=github-issue
time="2023-11-01T11:16:27.140Z" level=info msg="Processing workflow" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:16:27.141Z" level=info msg="Task-result reconciliation" namespace=argo-workflows numObjs=2 workflow=github-issue
time="2023-11-01T11:16:27.141Z" level=info msg="task-result changed" namespace=argo-workflows nodeID=github-issue-2857574236 workflow=github-issue
time="2023-11-01T11:16:27.141Z" level=info msg="task-result changed" namespace=argo-workflows nodeID=github-issue-3942332033 workflow=github-issue
time="2023-11-01T11:16:27.141Z" level=info msg="node unchanged" namespace=argo-workflows nodeID=github-issue-3942332033 workflow=github-issue
time="2023-11-01T11:16:27.141Z" level=info msg="Pod failed: Error (exit code 1)" displayName=some-task namespace=argo-workflows pod=github-issue-some-task-2857574236 templateName=some-task workflow=github-issue
time="2023-11-01T11:16:27.141Z" level=info msg="node unchanged" namespace=argo-workflows nodeID=github-issue-2857574236 workflow=github-issue
time="2023-11-01T11:16:27.141Z" level=info msg="TaskSet Reconciliation" namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:16:27.141Z" level=info msg=reconcileAgentPod namespace=argo-workflows workflow=github-issue
time="2023-11-01T11:16:27.153Z" level=info msg="Workflow update successful" namespace=argo-workflows phase=Running resourceVersion=463111991 workflow=github-issue

Logs from in your workflow's wait container

time="2023-11-01T11:16:04.190Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2023-11-01T11:16:04.190Z" level=info msg="No output parameters"
time="2023-11-01T11:16:04.190Z" level=info msg="No output artifacts"
time="2023-11-01T11:16:04.190Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: github-issue/github-issue-exit-handler-3942332033/main.log"
time="2023-11-01T11:16:04.195Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-11-01T11:16:04.273Z" level=info msg="Saving file to s3" bucket=test-bucket endpoint=s3.amazonaws.com key=github-issue/github-issue-exit-handler-3942332033/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-11-01T11:16:04.354Z" level=info msg="Save artifact" artifactName=main-logs duration=163.750098ms error="<nil>" key=github-issue/github-issue-exit-handler-3942332033/main.log
time="2023-11-01T11:16:04.354Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-11-01T11:16:04.354Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-11-01T11:16:04.368Z" level=info msg="Alloc=11023 TotalAlloc=16877 Sys=23653 NumGC=4 Goroutines=12"
time="2023-11-01T11:15:49.147Z" level=info msg="No output parameters"
time="2023-11-01T11:15:49.147Z" level=info msg="No output artifacts"
time="2023-11-01T11:15:49.148Z" level=info msg="S3 Save path: /tmp/argo/outputs/logs/main.log, key: github-issue/github-issue-some-task-2857574236/main.log"
time="2023-11-01T11:15:49.154Z" level=info msg="Creating minio client using AWS SDK credentials"
time="2023-11-01T11:15:49.205Z" level=info msg="Saving file to s3" bucket=test-bucket endpoint=s3.amazonaws.com key=github-issue/github-issue-some-task-2857574236/main.log path=/tmp/argo/outputs/logs/main.log
time="2023-11-01T11:15:49.257Z" level=info msg="Save artifact" artifactName=main-logs duration=109.425749ms error="<nil>" key=github-issue/github-issue-some-task-2857574236/main.log
time="2023-11-01T11:15:49.257Z" level=info msg="not deleting local artifact" localArtPath=/tmp/argo/outputs/logs/main.log
time="2023-11-01T11:15:49.257Z" level=info msg="Successfully saved file: /tmp/argo/outputs/logs/main.log"
time="2023-11-01T11:15:49.276Z" level=info msg="Alloc=10592 TotalAlloc=16893 Sys=23653 NumGC=4 Goroutines=12"
time="2023-11-01T11:15:49.276Z" level=info msg="stopping progress monitor (context done)" error="context canceled"
@agilgur5 agilgur5 added area/hooks area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority labels Nov 2, 2023
@agilgur5
Copy link
Member

agilgur5 commented Nov 2, 2023

Similar to #12109, this may very well be duplicative of #11589, which was only fixed recently in #11839 and so should be available in the next patch release.

@MarcusMoe
Copy link
Author

Similar to #12109, this may very well be duplicative of #11589, which was only fixed recently in #11839 and so should be available in the next patch release.

This issue is still a problem in v3.5.1.

@agilgur5
Copy link
Member

agilgur5 commented Nov 7, 2023

This issue is still a problem in v3.5.1.

@toyamagu-2021 could you take a look at this and see if you can diagnose why retryStrategy + hooks still seems to fail in certain cases?

@toyamagu-2021
Copy link
Member

toyamagu-2021 commented Nov 11, 2023

I think this is the bug comes from the following line (only consider onExit hook because wrote at 2 years ago :) )

func getRetryNodeChildrenIds(node *wfv1.NodeStatus, nodes wfv1.Nodes) []string {
// A fulfilled Retry node will always reflect the status of its last child node, so its individual attempts don't interest us.
// To resume the traversal, we look at the children of the last child node and of any on exit nodes.
var childrenIds []string
for i := -1; i >= -len(node.Children); i-- {
node := getChildNodeIndex(node, nodes, i)
if node == nil {
continue
}
if strings.HasSuffix(node.Name, ".onExit") {
childrenIds = append(childrenIds, node.ID)
} else if len(node.Children) > 0 {
childrenIds = append(childrenIds, node.Children...)
}
}
return childrenIds
}

I can fix this, but @MarcusMoe you might want to use onExit hook for a workaround.
https://argoproj.github.io/argo-workflows/walk-through/exit-handlers/

image
image

@toyamagu-2021
Copy link
Member

toyamagu-2021 commented Nov 11, 2023

Or we can use step for a workaround if your use-case allows to use steps, not DAG.
(This issue is DAG specific problem)

image

@toyamagu-2021 toyamagu-2021 added the solution/workaround There's a workaround, might not be great, but exists label Nov 11, 2023
@toyamagu-2021
Copy link
Member

toyamagu-2021 commented Nov 12, 2023

I need to investigate more carefully, the root cause is dag task does not wait for the TemplateLevelLifeCycleHook.
(dag tassk wait for an exit handler completion, so works fine)

In this issue case, we can use:

            hooks:
              exit: # NOTE: YOU CAN NOT CHANGE THIS STRING
                template: exit-handler
                arguments: {}
                expression: tasks["some-task"].status == "Failed"

P.S.: I noticed the following logic. I will add TemplateLevelLifeCycleHook to this.

if depNode.Type == wfv1.NodeTypeTaskGroup {

@toyamagu-2021 toyamagu-2021 self-assigned this Nov 12, 2023
@MarcusMoe
Copy link
Author

@toyamagu-2021 Thank you for all your suggestions! Unfortunately in my case this would be part of a larger workflow where some-task is an intermediate task that needs to initiate a hook based on whether it is successful or failed with the correct parameters. As onExit does not allow for parameters when calling it, I have not found a good way to use it without adding logic. Your suggestion for the exit hook with this expression expression: tasks["some-task"].status == "Failed" would solve one side, but I would need another one with a success expression, and as far as I know, I can't have two hooks with exit as their name. Using Step instead of DAG could be an option, although it would require some work and might mess up the current dependency logic.

@toyamagu-2021
Copy link
Member

toyamagu-2021 commented Nov 13, 2023

@MarcusMoe
Thanks for clarifying. I'll do my best to resolve issue, but following exit handler might be helpful for you.

  - name: exit-handler
    steps:
    - -  name: suceed
          template: celebrate
          when: "tasks["some-task"].status  == Succeeded"
       - name: failed
          template: cry
          when: "tasks["some-task"].status  == Failed"

This might work for both Succeeded and Failed (Sorry if I missed anything).

@toyamagu-2021
Copy link
Member

toyamagu-2021 commented Nov 13, 2023

@MarcusMoe
Hi, thanks for reporting issue.
I submitted PR #12192, so could you check images attached to PR are what you wanted?
(Or other test cases are welcome. I'll check them.)

@MarcusMoe
Copy link
Author

@toyamagu-2021
This looks good! Thank you for fixing it so fast.

@agilgur5 agilgur5 added area/retryStrategy Template-level retryStrategy and removed area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/hooks area/retryStrategy Template-level retryStrategy P3 Low priority solution/workaround There's a workaround, might not be great, but exists type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants