Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: make sure taskresult completed when mark node succeed when it has outputs #12537

Merged
merged 28 commits into from
Jan 28, 2024

Conversation

shuangkun
Copy link
Member

@shuangkun shuangkun commented Jan 17, 2024

When my cluster has lots of workflows, I meet some errors.

="Mark error node" error="failed to evaluate expression: cannot fetch steps-init-artifact from <nil> (1:6)\n | steps['init-artifact'].outputs.parameters['workflow_artifact_key']\n | .....^" namespace=argo nodeName="workflow-bhr9k[3].energy(0:0)[1].energy-steps(0:0)[3].comp-binding-energy-steps(0:0)[15]" workflow=workflow-bhr9k

When the number of workflows is not large, there is no such error.

My workflow has lots of template like this, the next step refer the output of the previous step. Like hello2a refer hello1 in parameter steps['hello1'].outputs.parameters['workflow_artifact_key'].

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: steps-
spec:
  entrypoint: hello-hello-hello
  arguments:
    parameters:
    - name: message1
      value: hello world
    - name: message2
      value: foobar
  # This spec contains two templates: hello-hello-hello and whalesay
  templates:
  - name: hello-hello-hello
    # Instead of just running a container
    # This template has a sequence of steps
    steps:
    - - name: hello1            # hello1 is run before the following steps
        continueOn: {}
        template: whalesay
        arguments:
          parameters:
          - name: message
            value: "hello1"
          - name: workflow_artifact_key
            value: "{{ workflow.parameters.message2}}"
    - - name: hello2a           # double dash => run after previous step
        template: whalesay
        arguments:
          parameters:
          - name: message
            value: "{{=steps['hello1'].outputs.parameters['workflow_artifact_key']}}"

  # This is the same template as from the previous example
  - name: whalesay
    metadata:
      annotations:
        k8s.aliyun.com/eci-spot-strategy: "SpotAsPriceGo"
    inputs:
      parameters:
      - name: message
    outputs:
      parameters:
      - name: workflow_artifact_key
        value: '{{workflow.name}}'
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import random
        i = random.randint(1, 100)
        print(i)

When I search the logs. I find the time of preStep(hello1)‘s “node changed” to succeed are earlier than "task-result changed". And
this cause the hello2a's evaluate expression error. So I want to make sure taskresult completed when mark node succeed when it has outputs.

Motivation

Modifications

Verification

…s outputs

Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun marked this pull request as draft January 17, 2024 15:55
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun marked this pull request as ready for review January 17, 2024 16:59
@agilgur5 agilgur5 added the area/controller Controller issues, panics label Jan 17, 2024
@shuangkun shuangkun marked this pull request as draft January 18, 2024 00:46
Signed-off-by: shuangkun <tsk2013uestc@163.com>
@shuangkun shuangkun marked this pull request as ready for review January 21, 2024 07:39
@shuangkun shuangkun changed the title fix: make sure taskresult completed when mark node succeed when it ha… fix: make sure taskresult completed when mark node succeed when it has outputs Jan 21, 2024
@juliev0 juliev0 self-assigned this Jan 21, 2024
@juliev0
Copy link
Contributor

juliev0 commented Jan 22, 2024

So, it sounds like previously the execution of a Workflow was allowed to continue even if the previous Step's Outputs weren't reconciled? Are you essentially preventing the next Step from running yet in that case?

@shuangkun
Copy link
Member Author

So, it sounds like previously the execution of a Workflow was allowed to continue even if the previous Step's Outputs weren't reconciled? Are you essentially preventing the next Step from running yet in that case?

Yes, I want to prevent the next step from running.

@shuangkun
Copy link
Member Author

Yes, I think this was the case before, although under normal circumstances output will be processed before pod status normally, because this resource is indeed created earlier. But on a large scale, these two processing orders may be caused by high pressure Events arrive at APIserver in different order

@juliev0
Copy link
Contributor

juliev0 commented Jan 23, 2024

Is it possible to see if this worked on some older versions of code? I'm curious if something broke this. It seems like core functionality.

@juliev0
Copy link
Contributor

juliev0 commented Jan 23, 2024

Yes, I think this was the case before, although under normal circumstances output will be processed before pod status normally, because this resource is indeed created earlier. But on a large scale, these two processing orders may be caused by high pressure Events arrive at APIserver in different order

I see. So, maybe this is a good enough answer to my request that you test on an older version - perhaps this case is just an unusual one? I am kind of curious if other people have logged similar bugs.

@shuangkun
Copy link
Member Author

Is it possible to see if this worked on some older versions of code? I'm curious if something broke this. It seems like core functionality.

I think this may be related to the introduction of taskresult resources since 3.4. Maybe it is hard to support old, because there is a lack of records recording whether the taskresult was processed(Originally I needed to add this record, but found that it was included in the latest version.)

@shuangkun
Copy link
Member Author

Yes, I think this was the case before, although under normal circumstances output will be processed before pod status normally, because this resource is indeed created earlier. But on a large scale, these two processing orders may be caused by high pressure Events arrive at APIserver in different order

I see. So, maybe this is a good enough answer to my request that you test on an older version - perhaps this case is just an unusual one? I am kind of curious if other people have logged similar bugs.

Yes, I think this was the case before, although under normal circumstances output will be processed before pod status normally, because this resource is indeed created earlier. But on a large scale, these two processing orders may be caused by high pressure Events arrive at APIserver in different order

I see. So, maybe this is a good enough answer to my request that you test on an older version - perhaps this case is just an unusual one? I am kind of curious if other people have logged similar bugs.

Yes,

failed to evaluate expression

Yes, I think this was the case before, although under normal circumstances output will be processed before pod status normally, because this resource is indeed created earlier. But on a large scale, these two processing orders may be caused by high pressure Events arrive at APIserver in different order

I see. So, maybe this is a good enough answer to my request that you test on an older version - perhaps this case is just an unusual one? I am kind of curious if other people have logged similar bugs.

Yes, I tested it on version 3.4.12 for few weeks. Looks well. There will be no "failed to evaluate expression" error like before.

@juliev0
Copy link
Contributor

juliev0 commented Jan 23, 2024

@Garett-MacGowan do you want to look at this too?

@Garett-MacGowan
Copy link
Contributor

@Garett-MacGowan do you want to look at this too?

I just skimmed it. I can take a proper look after I 😴. In general, if we're proceeding to next steps before outputs are reconciled, it seems important that we add the wait behavior. As you said, it seems like core functionality, so I'm surprised if it's not already accounted for.

I'm wondering if this can be tested.

}
// Check whether the node has output and whether its taskresult is in an incompleted state.
if tmpl.HasOutputs() && woc.wf.Status.IsTaskResultInCompleted(node.ID) && woc.wf.Status.IsTaskResultInCompleted(pod.Name) {
woc.log.WithFields(log.Fields{"nodeID": newState.ID}).WithError(err).Error("Taskresult of the node not yet completed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment along the lines of what @juliev0 was saying, I don't think this is an error. We just need to flag needReconcileTaskResult. Could maybe just log it normally if you want the log?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, probably a Debug line

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, do we need to call it for both woc.wf.Status.IsTaskResultInCompleted(node.ID) && woc.wf.Status.IsTaskResultInCompleted(pod.Name)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I was thinking this but had to step away and forgot to ask. I think it should just be tmpl.HasOutputs() && woc.wf.Status.IsTaskResultInCompleted(node.ID)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe confusion from the comment here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seeing that comment about the comment made me think of it :)

Copy link
Member Author

@shuangkun shuangkun Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, do we need to call it for both woc.wf.Status.IsTaskResultInCompleted(node.ID) && woc.wf.Status.IsTaskResultInCompleted(pod.Name)?

Yes, I thought about this problem at first. But there is a problem. If the outputs are in pod annotations or in taskresult, the key values ​​are different. Maybe we can unify to podName or NodeId. I think nodeId is better, how about you?

May be I can add a func named pod.GetNodeId()

	if x, ok := pod.Annotations[common.AnnotationKeyReportOutputsCompleted]; ok {
		woc.log.Warn("workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/latest/workflow-rbac/")
		resultName := pod.GetName()
		if x == "true" {
			woc.wf.Status.MarkTaskResultComplete(resultName)
		} else {
			woc.wf.Status.MarkTaskResultIncomplete(resultName)
		}
	}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just unify to Node ID.

@juliev0 juliev0 added the prioritized-review For members of the Sustainability Effort label Jan 23, 2024
Signed-off-by: shuangkun <tsk2013uestc@163.com>
shuangkun and others added 6 commits January 28, 2024 10:29
Signed-off-by: shuangkun <tsk2013uestc@163.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
woc.log.WithField("workflow", woc.wf.ObjectMeta.Name).Info("pod reconciliation didn't complete, will retry")
woc.requeue()
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just realized that we probably need to move the if err != nil clause above the if !podReconciliationCompleted {, since we can return err, false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully after that we should be good, thank you for the iterations!

Copy link
Member Author

@shuangkun shuangkun Jan 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I changed it. Thanks!

Signed-off-by: shuangkun <tsk2013uestc@163.com>
@juliev0 juliev0 merged commit 8f2746a into argoproj:main Jan 28, 2024
27 checks passed
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 4, 2024
…s outputs (argoproj#12537)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 27, 2024
…s outputs (argoproj#12537)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 27, 2024
…s outputs (argoproj#12537)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 27, 2024
…s outputs (argoproj#12537)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
isubasinghe pushed a commit to isubasinghe/argo-workflows that referenced this pull request Feb 28, 2024
…s outputs (argoproj#12537)

Signed-off-by: shuangkun <tsk2013uestc@163.com>
Signed-off-by: shuangkun tian <72060326+shuangkun@users.noreply.github.com>
Co-authored-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: Isitha Subasinghe <isubasinghe@student.unimelb.edu.au>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics prioritized-review For members of the Sustainability Effort
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants