Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: retry only proper node (#11589) #11839

Merged
merged 15 commits into from
Oct 15, 2023
Merged

Conversation

toyamagu-2021
Copy link
Member

@toyamagu-2021 toyamagu-2021 commented Sep 18, 2023

Signed-off-by: toyamagu-2021 toyamagu2021@gmail.com

Fixes #11589

Motivation

  • Retry only retried nodes.

Modifications

  • When retring, exclude hooked nodes.

Verification

  • retryStrategy works correctly
    • image
  • This PR also works fine for WorkflowTemplateLevelHook
    • image
  • When running, the hook node is DAG
    • image
  • When running the hook node is Step
    • image
  • Works file for a most complicated case
    • image
manifest
spec:
  templates:
    - name: argosay
      inputs: {}
      outputs: {}
      metadata: {}
      container:
        name: ''
        image: argoproj/argosay:v2
        command:
          - /bin/sh
          - '-c'
        args:
          - /bin/sleep 2; exit 100; /argosay
        resources: {}
      retryStrategy:
        limit: 1
    - name: hook
      inputs: {}
      outputs: {}
      metadata: {}
      steps:
        - - name: A
            template: argosay
            arguments: {}
            hooks:
              failed:
                template: argosay2
                arguments: {}
                expression: steps.A.status == "Failed"
              running:
                template: argosay2
                arguments: {}
                expression: steps.A.status == "Running"
              succeed:
                template: argosay2
                arguments: {}
                expression: steps.A.status == "Succeeded"
        - - name: B
            template: argosay
            arguments: {}
    - name: argosay2
      inputs: {}
      outputs: {}
      metadata: {}
      container:
        name: ''
        image: argoproj/argosay:v2
        command:
          - /bin/sh
          - '-c'
        args:
          - exit 0
        resources: {}
  entrypoint: argosay
  arguments: {}
  hooks:
    running:
      template: hook
      arguments: {}
      expression: workflow.status == "Running"

@toyamagu-2021 toyamagu-2021 changed the title fix: exclude hooked note when retry (#11589) fix: retry only proper node (#11589) Sep 24, 2023
@toyamagu-2021 toyamagu-2021 marked this pull request as ready for review September 24, 2023 08:39
Copy link
Contributor

@juliev0 juliev0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing.

So, in your example from your original issue, are you saying the hook node is actually a child of the argosay node? That seems kind of surprising, doesn't it? Are both nodes of type "Pod"?

childrenIds = append(childrenIds, n.ID)
}
}
return childrenIds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments to this function? I'm trying to understand why you're not just looking for all child nodes of "Type"="Retry"

This comment was marked as outdated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you! I forgot the parent was "Retry" and the children were "Pod"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, starting to understand the logic now looking at what you posted. Thanks for doing that.

So, I can assume that the spec for this is a running hook which has expression: workflow.status == "Running", right? Does it seem a little odd that the hook node is a child of the Retry node? Do you figure that it would be the child of whatever the first running Node is?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also can't really remember what is a child of what. If there are 3 Steps A-> B-> C, do you know if B would be a child of A because it follows it, or are A, B, and C siblings (which is what I would assume)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am kind of thinking we shouldn't rely on somebody not changing the name in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So, this PR resolves both cases? And is this particular method - getChildNodeIdsRetried() - the logic that runs in both cases?

Yes. This PR resolves both cases.

It does make me wonder if perhaps the NodeStatus could potentially have some new key/value pair in it like isHook - what do you think?

Ideally, completely agree with you.
I'm going to write a code that changes NodeStatus go struct!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you! Once you do I think you'll need to run make codegen to auto-generate the CRD and other files from that struct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I've added NodeStatus.Hooked and some node initialization. Could you give me feedback :)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a thorough comment to the top of this function to explain what it's doing, what the various cases are, etc?

Copy link
Contributor

@juliev0 juliev0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to post in this PR an example WorkflowStatus corresponding to a WorkflowSpec: mainly just want to see the "tree" of Nodes to understand the relationship between the nodes, where both Retry and Hook are involved. Thank you!

Copy link
Member Author

@toyamagu-2021 toyamagu-2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliev0
Thanks! Is it clear enough?

  • Node type of retried nodes
    • type: Pod
  • parent node of hooked node
    • type: Rety

Full nodeStatus is following:

nodeStatus
status:
  phase: Failed
  startedAt: '2023-09-30T13:49:20Z'
  finishedAt: '2023-09-30T13:50:05Z'
  progress: 2/6
  message: No more retries left
  nodes:
    lovely-bear:
      id: lovely-bear
      name: lovely-bear
      displayName: lovely-bear
      type: Retry
      templateName: argosay
      templateScope: local/lovely-bear
      phase: Failed
      message: No more retries left
      startedAt: '2023-09-30T13:49:20Z'
      finishedAt: '2023-09-30T13:49:56Z'
      progress: 2/6
      resourcesDuration:
        cpu: 42
        memory: 30
      outputs:
        artifacts:
          - name: main-logs
            s3:
              key: lovely-bear/lovely-bear-argosay-74551197/main.log
        exitCode: '1'
      children:
        - lovely-bear-477066958
        - lovely-bear-794339884
        - lovely-bear-4167892875
        - lovely-bear-410250672
        - lovely-bear-74551197
        - lovely-bear-2163413524
    lovely-bear-2163413524:
      id: lovely-bear-2163413524
      name: lovely-bear.hooks.failed
      displayName: lovely-bear.hooks.failed
      type: Pod
      templateName: hook
      templateScope: local/lovely-bear
      phase: Succeeded
      startedAt: '2023-09-30T13:49:56Z'
      finishedAt: '2023-09-30T13:50:01Z'
      progress: 1/1
      resourcesDuration:
        cpu: 7
        memory: 5
      outputs:
        artifacts:
          - name: main-logs
            s3:
              key: lovely-bear/lovely-bear-hook-2163413524/main.log
        exitCode: '0'
      hostNodeName: k3d-k3s-default-server-0
    lovely-bear-410250672:
      id: lovely-bear-410250672
      name: lovely-bear(2)
      displayName: lovely-bear(2)
      type: Pod
      templateName: argosay
      templateScope: local/lovely-bear
      phase: Failed
      message: Error (exit code 1)
      startedAt: '2023-09-30T13:49:38Z'
      finishedAt: '2023-09-30T13:49:43Z'
      progress: 0/1
      resourcesDuration:
        cpu: 7
        memory: 5
      outputs:
        artifacts:
          - name: main-logs
            s3:
              key: lovely-bear/lovely-bear-argosay-410250672/main.log
        exitCode: '1'
      hostNodeName: k3d-k3s-default-server-0
    lovely-bear-4167892875:
      id: lovely-bear-4167892875
      name: lovely-bear(1)
      displayName: lovely-bear(1)
      type: Pod
      templateName: argosay
      templateScope: local/lovely-bear
      phase: Failed
      message: Error (exit code 1)
      startedAt: '2023-09-30T13:49:29Z'
      finishedAt: '2023-09-30T13:49:34Z'
      progress: 0/1
      resourcesDuration:
        cpu: 7
        memory: 5
      outputs:
        artifacts:
          - name: main-logs
            s3:
              key: lovely-bear/lovely-bear-argosay-4167892875/main.log
        exitCode: '1'
      hostNodeName: k3d-k3s-default-server-0
    lovely-bear-477066958:
      id: lovely-bear-477066958
      name: lovely-bear(0)
      displayName: lovely-bear(0)
      type: Pod
      templateName: argosay
      templateScope: local/lovely-bear
      phase: Failed
      message: Error (exit code 1)
      startedAt: '2023-09-30T13:49:20Z'
      finishedAt: '2023-09-30T13:49:25Z'
      progress: 0/1
      resourcesDuration:
        cpu: 7
        memory: 5
      outputs:
        artifacts:
          - name: main-logs
            s3:
              key: lovely-bear/lovely-bear-argosay-477066958/main.log
        exitCode: '1'
      hostNodeName: k3d-k3s-default-server-0
    lovely-bear-74551197:
      id: lovely-bear-74551197
      name: lovely-bear(3)
      displayName: lovely-bear(3)
      type: Pod
      templateName: argosay
      templateScope: local/lovely-bear
      phase: Failed
      message: Error (exit code 1)
      startedAt: '2023-09-30T13:49:47Z'
      finishedAt: '2023-09-30T13:49:52Z'
      progress: 0/1
      resourcesDuration:
        cpu: 7
        memory: 5
      outputs:
        artifacts:
          - name: main-logs
            s3:
              key: lovely-bear/lovely-bear-argosay-74551197/main.log
        exitCode: '1'
      hostNodeName: k3d-k3s-default-server-0
    lovely-bear-794339884:
      id: lovely-bear-794339884
      name: lovely-bear.hooks.running
      displayName: lovely-bear.hooks.running
      type: Pod
      templateName: hook
      templateScope: local/lovely-bear
      phase: Succeeded
      startedAt: '2023-09-30T13:49:20Z'
      finishedAt: '2023-09-30T13:49:25Z'
      progress: 1/1
      resourcesDuration:
        cpu: 7
        memory: 5
      outputs:
        artifacts:
          - name: main-logs
            s3:
              key: lovely-bear/lovely-bear-hook-794339884/main.log
        exitCode: '0'
      hostNodeName: k3d-k3s-default-server-0
  conditions:
    - type: PodRunning
      status: 'False'
    - type: Completed
      status: 'True'
  resourcesDuration:
    cpu: 42
    memory: 30
  artifactRepositoryRef:
    configMap: artifact-repositories
    key: default-v1
    namespace: argo
    artifactRepository:
      archiveLogs: true
      s3:
        endpoint: minio:9000
        bucket: my-bucket
        insecure: true
        accessKeySecret:
          name: my-minio-cred
          key: accesskey
        secretKeySecret:
          name: my-minio-cred
          key: secretkey
  artifactGCStatus:
    notSpecified: true

workflow/controller/operator.go Outdated Show resolved Hide resolved
childrenIds = append(childrenIds, n.ID)
}
}
return childrenIds

This comment was marked as outdated.

@toyamagu-2021 toyamagu-2021 marked this pull request as ready for review October 7, 2023 13:59
@juliev0
Copy link
Contributor

juliev0 commented Oct 7, 2023

Thanks for updating all of the logic to add in the new hooked variable.
Thanks for showing the pictures to show what you verified. It seems like you investigated a lot of possible hook specs.

I'm wondering if I can request that you add even more to the description above as far as what you did? I'm short on time, so the more thorough you are there, the more it will help me to understand. And then we can get this merged.

@toyamagu-2021 toyamagu-2021 marked this pull request as draft October 9, 2023 13:51
Signed-off-by: toyamagu2021@gmail.com <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
@toyamagu-2021 toyamagu-2021 marked this pull request as ready for review October 14, 2023 07:41
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
@juliev0 juliev0 merged commit a04d055 into argoproj:master Oct 15, 2023
25 checks passed
@agilgur5 agilgur5 added the area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries label Oct 15, 2023
terrytangyuan pushed a commit that referenced this pull request Nov 3, 2023
Signed-off-by: toyamagu2021@gmail.com <toyamagu2021@gmail.com>
Signed-off-by: toyamagu-2021 <toyamagu2021@gmail.com>
@agilgur5 agilgur5 added the area/spec Changes to the workflow specification. label Nov 10, 2023
@toyamagu-2021 toyamagu-2021 deleted the fix-11589 branch November 11, 2023 12:39
@agilgur5 agilgur5 added area/retryStrategy Template-level retryStrategy and removed area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries labels Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/hooks area/retryStrategy Template-level retryStrategy area/spec Changes to the workflow specification.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

retryStrategy does not work correctly when we use hook
4 participants