Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix missing artifacts for stopped workflows. Fixes #12401 #12402

Merged
merged 16 commits into from
Jan 3, 2024

Conversation

Garett-MacGowan
Copy link
Contributor

Fixes #12401

Motivation

Stopped workflows would not have artifacts available in the UI, this can can be particularly annoying when debugging a workflow.

Modifications

Ensure workflow is archived only once all task results are finished processing.

image

Verification

Added a argo server test to ensure that the artifact is available. Manually verified that artifacts are now present in the UI.

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…tputs in status is now considered.

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
@Garett-MacGowan
Copy link
Contributor Author

@juliev0 Here's the fix for what I was hypothesizing in #12331

@Garett-MacGowan Garett-MacGowan marked this pull request as ready for review December 23, 2023 23:30
@juliev0
Copy link
Contributor

juliev0 commented Dec 24, 2023

Thanks for being on top of any new regressions that may have been caused by your PR! I glanced the other day but then it sounded like you were saying that it was a UI issue. From going through that Issue now, it looks like this is not a direct fix to that issue, right? In that case, do you want to remove "Fixes" from your description and instead say something like "Related to"? Otherwise, this will automatically close that Issue.

@juliev0 juliev0 self-assigned this Dec 24, 2023
@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Dec 24, 2023

@juliev0

These are two separate issues! #12331 must have been caused by another PR. I fixed that issue in #12397. This PR fixes #12401, which occurs when you stop a workflow. My previous PR (#11947) didn't guarantee that artifacts would get archived for stopped workflows, although it fixed the garbage collection.

@juliev0
Copy link
Contributor

juliev0 commented Dec 24, 2023

Oh, okay!

@@ -802,7 +802,7 @@ func (woc *wfOperationCtx) persistUpdates(ctx context.Context) {

func (woc *wfOperationCtx) checkTaskResultsCompleted() bool {
taskResultsCompleted := woc.wf.Status.GetTaskResultsCompleted()
if len(taskResultsCompleted) == 0 {
if len(taskResultsCompleted) == 0 && woc.wf.Status.Outputs != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like we want to make sure there are actually Outputs.

But now it makes me wonder, when do the Outputs get set? Is it possible to arrive here when Outputs==nil because they haven't been set yet?

Copy link
Contributor Author

@Garett-MacGowan Garett-MacGowan Dec 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliev0 Happy Holidays!

Outputs are first initialized and populated (at a node level) in taskResultReconciliation(). Outputs are subsequently initialized and populated to the global scope wf.Status.Outputs in woc.podReconciliation() via woc.addOutputsToGlobalScope(newState.Outputs), assuming Parameters or Artifacts are defined for any node. Therefore, Outputs==nil when either woc.TaskResultReconciliation() hasn't been run yet, or Parameters or Artifacts aren't defined for any node in the workflow.

As long as checkTaskResultsCompleted() has been run after taskResultReconciliation(), && woc.wf.Status.Outputs != nil is meaningful.

checkTaskResultsCompleted() is used in two places:

  1. Before woc.garbageCollectArtifacts(ctx) (after taskResultReconciliation())
	// Do artifact GC if all task results are completed.
	if woc.checkTaskResultsCompleted() {
		if err := woc.garbageCollectArtifacts(ctx); err != nil {
			woc.log.WithError(err).Error("failed to GC artifacts")
			return
		}
	} else {
		woc.log.Debug("Skipping artifact GC")
	}
  1. Before woc.deletetaskResults(ctx) as part of persistUpdates(ctx) (called in conjunction with woc.wf.Status.Phase.Completed(), which ensures that the workflow has either succeeded, failed, or errored, and therefore after checkTaskResultsCompleted() has had a chance to run).
	// Make sure the TaskResults are incorporated into WorkflowStatus before we delete them.
	if woc.wf.Status.Phase.Completed() && woc.checkTaskResultsCompleted() {
		if err := woc.deleteTaskResults(ctx); err != nil {
			woc.log.WithError(err).Warn("failed to delete task-results")
		}
	}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy holidays!

Trying to follow. So, based on your statement about how "Outputs are subsequently initialized and populated to the global scope wf.Status.Outputs in woc.podReconciliation()", doesn't that mean that Outputs could also be nil in the case that taskResultReconciliation() had been run but podReconciliation() had not?

Copy link
Contributor Author

@Garett-MacGowan Garett-MacGowan Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I meant to say:

  1. Before woc.deletetaskResults(ctx) as part of persistUpdates(ctx) (called in conjunction with woc.wf.Status.Phase.Completed(), which ensures that the workflow has either succeeded, failed, or errored, and therefore after woc.podReconciliation() has had a chance to run).

I missed the obvious (case 3), which I just introduced, inside markWorkflowPhase, which is called at the end of the operator loop under various conditions.

	switch phase {
	case wfv1.WorkflowSucceeded, wfv1.WorkflowFailed, wfv1.WorkflowError:
		// wait for all daemon nodes to get terminated before marking workflow completed & make sure task results are complete as well.
		if phase.Completed() && !woc.hasDaemonNodes() && woc.checkTaskResultsCompleted() {
			woc.log.Info("Marking workflow completed")

Good catch, yes, you are right. The case you identified can happen for woc.garbageCollectArtifacts(ctx) (case 1), which gets called right after woc.taskResultReconciliation(), but before woc.podReconciliation(). I think a race condition could occur where the taskResultInformer hasn't received an update yet, and therefore garbage collection is allowed to happen because wf.Status.Outputs == nil.

Since woc.wf.Status.Outputs != nil is meant to check that outputs are in global scope (for archiving purposes), maybe it's best that we pull that logic up to if phase.Completed() && !woc.hasDaemonNodes() && woc.checkTaskResultsCompleted() {.

For example:

switch phase {
	case wfv1.WorkflowSucceeded, wfv1.WorkflowFailed, wfv1.WorkflowError:
		taskResultsCompleted := woc.checkTaskResultsCompleted()
		// wait for all daemon nodes to get terminated before marking workflow completed & make sure task results are complete as well.
		if phase.Completed() && !woc.hasDaemonNodes() && (taskResultsCompleted || (!taskResultsCompleted && woc.wf.Status.Outputs == nil)) {
			woc.log.Info("Marking workflow completed")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above doesn't work. I'm thinking through it and will propose another solution.

Copy link
Contributor Author

@Garett-MacGowan Garett-MacGowan Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliev0 I suggest the following changes. Notice the new considerStatusOutputs bool parameter for checkTaskResultsCompleted:

func (woc *wfOperationCtx) checkTaskResultsCompleted(considerStatusOutputs bool) bool {
	taskResultsCompleted := woc.wf.Status.GetTaskResultsCompleted()

	if considerStatusOutputs && woc.wf.Status.Outputs != nil && len(taskResultsCompleted) == 0 {
		return false
	}
	if !considerStatusOutputs && len(taskResultsCompleted) == 0 {
		return false
	}

	for _, completed := range taskResultsCompleted {
		if !completed {
			return false
		}
	}
	return true
}

Case 1)

	// Do artifact GC if all task results are completed.
	if woc.checkTaskResultsCompleted(false) {
		if err := woc.garbageCollectArtifacts(ctx); err != nil {
			woc.log.WithError(err).Error("failed to GC artifacts")
			return
		}
	} else {
		woc.log.Debug("Skipping artifact GC")
	}

Case 2)

	// Make sure the TaskResults are incorporated into WorkflowStatus before we delete them.
	if woc.wf.Status.Phase.Completed() && woc.checkTaskResultsCompleted(true) {
		if err := woc.deleteTaskResults(ctx); err != nil {
			woc.log.WithError(err).Warn("failed to delete task-results")
		}
	}

Case 3)

	switch phase {
	case wfv1.WorkflowSucceeded, wfv1.WorkflowFailed, wfv1.WorkflowError:
		// wait for all daemon nodes to get terminated before marking workflow completed & make sure task results are complete as well.
		if phase.Completed() && !woc.hasDaemonNodes() && woc.checkTaskResultsCompleted(true) {
			woc.log.Info("Marking workflow completed")

Copy link
Contributor

@juliev0 juliev0 Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, going back....let's start from "what was the original intention of this new check?" To handle which use case?

Copy link
Contributor Author

@Garett-MacGowan Garett-MacGowan Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliev0 If we don't do this check, the operator will hang in marking the workflow as completed if the workflow has no nodes with outputs because woc.checkTaskResultsCompleted will return false if len(taskResultsCompleted) == 0

Copy link
Contributor Author

@Garett-MacGowan Garett-MacGowan Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(taskResultsCompleted) == 0 can happen if there are no outputs for any nodes, so we can't always say that task results have not completed if len(taskResultsCompleted) == 0. If wf.Status.Outputs is defined and len(taskResultsCompleted) == 0, then we can say that task results have not completed.

If wf.Status.Outputs is defined (outputs or parameters do exist in at least one node), and taskResultsComplete is an empty map, then we know that the wait containers aren't done executing.

The new usage of woc.checkTaskResultsCompleted(true) in case wfv1.WorkflowSucceeded, wfv1.WorkflowFailed, wfv1.WorkflowError: is to prevent the workflow from archiving before the status is updated with the outputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliev0 I just updated my last two comments since I did a really poor job in answering the first time.

@@ -802,7 +802,7 @@ func (woc *wfOperationCtx) persistUpdates(ctx context.Context) {

func (woc *wfOperationCtx) checkTaskResultsCompleted() bool {
taskResultsCompleted := woc.wf.Status.GetTaskResultsCompleted()
if len(taskResultsCompleted) == 0 {
if len(taskResultsCompleted) == 0 && woc.wf.Status.Outputs != nil {
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, do you mind if we first go back to the original code and talk about the case you were trying to take care of with if len(taskResultsCompleted) == 0 {?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, maybe we should rename taskResultsCompleted to something like taskResultCompletionStatus since it sounds like we're talking about the number completed here but we're not.?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, we're looking at the case in which no taskresults have ever even been initialized as incomplete, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could probably do with a rename.

Yes, we're looking at the case where no task results have been initialized as incomplete yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the case we're concerned about, or is it more generally speaking the case in which a task result would be initialized in the future and hasn't been yet? As far as I can tell, it's once a Pod exists that a new task result will get added? So, wouldn't it be the same case if say this map has one entry and it's completed, but another task result simply hasn't been initialized yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💩 Yes, you're right. We should be handling that. I'm going to think on this a bit more & try to come up with something better.

Copy link
Contributor Author

@Garett-MacGowan Garett-MacGowan Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've flipped the way I'm looking at this and reformulated the check as checkTaskResultsInProgress. When a task result hasn't completed yet, it is added to a TaskResultsInProgress map. Once it is completed, it is removed from the map.

func (woc *wfOperationCtx) checkTaskResultsInProgress() bool {
	taskResultsInProgress := woc.wf.Status.GetTaskResultsInProgress()
	if len(taskResultsInProgress) != 0 {
		return true
	}
	return false
}

In all three cases, !woc.checkTaskResultsInProgress() is called. All tests seem to be passing & the code is much easier to understand. I'm re-running the test script from my last PR to ensure the garbage collection race condition doesn't exist. #11947 (comment).

Assuming all the tests pass locally, I'll push and you can review @juliev0

Copy link
Contributor Author

@Garett-MacGowan Garett-MacGowan Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliev0 The case of "no task results have been initialized as incomplete yet" doesn't actually seem to be an issue based on my testing (with the new approach).

… 12401.

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
@@ -240,7 +240,7 @@ func (woc *wfOperationCtx) operate(ctx context.Context) {
woc.taskResultReconciliation()

// Do artifact GC if all task results are completed.
if woc.checkTaskResultsCompleted() {
if !woc.checkTaskResultsInProgress() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to check if Phase.Completed() for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The artifact gc strategy protects us for this case. We could conceivably add this though. It would make things a bit more efficient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plus, if we have common logic between this and the other calls, we might be able to have a common function, which would be nice

Copy link
Contributor

@juliev0 juliev0 Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually kind of thinking about a common function which checks for the Workflow being completed, plus compares the number of pods that were run to the number of task results that got incorporated into the WorkflowStatus.Outputs, which could be called Workflow.ReconciliationComplete() or something like that. Does that make any sense?

This method could be called at any place in the code, and it should work.

Copy link
Contributor

@juliev0 juliev0 Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know in this case you were only concerned about which taskResults were saved at all, as opposed to which ones had actually gone through podReconciliation() (incorporated into WorkflowStatus.Outputs as well, so maybe the common function wouldn't be used here if it doesn't make sense.

Copy link
Contributor

@juliev0 juliev0 Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just edited my comment 2 comments back because it wasn't clear what i was talking about)

// wait for all daemon nodes to get terminated before marking workflow completed
if markCompleted && !woc.hasDaemonNodes() {
// wait for all daemon nodes to get terminated before marking workflow completed & make sure task results are complete as well.
if phase.Completed() && !woc.hasDaemonNodes() && !woc.checkTaskResultsInProgress() {
Copy link
Contributor

@juliev0 juliev0 Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need podReconciliation() to have updated Workflow Status for this one?

@@ -788,7 +788,7 @@ func (woc *wfOperationCtx) persistUpdates(ctx context.Context) {
}

// Make sure the TaskResults are incorporated into WorkflowStatus before we delete them.
if woc.wf.Status.Phase.Completed() && woc.checkTaskResultsCompleted() {
if woc.wf.Status.Phase.Completed() && !woc.checkTaskResultsInProgress() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to make sure that the podReconciliation has occurred to incorporate the taskResults into the Workflow's Status, or is that not needed based on where this is being called from?

@juliev0
Copy link
Contributor

juliev0 commented Dec 28, 2023

Something I alluded to in my other comment, but I'm kind of wondering if there could be a risk of only checking whether there's a taskresult in progress - that requires that it got added to the map in the first place. It may be okay if it's only called from a particular place in the code I suppose.

But I was thinking: isn't the simplest thing to check if the Workflow is complete, plus confirm that all TaskResults have been fully processed (therefore equal to the number of Pods run), which could be a function called from anywhere?

Just let me know if you disagree! I haven't looked that carefully.

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Dec 29, 2023

What you are saying is probably the safest solution. Here's what I'm thinking now:

As part of podReconciliation(), outputs are added to the global scope, then, the node phase is updated based on the pod phase. If we observe that the node phase is completed, then we know that podReconciliation() has been run for the last time for that node, and any outputs for the node at that time should be reflected in the global scope. One problem to consider, however, is what happens when outputs are sent to the taskResultInformer between when taskResultReconciliation() and podReconciliation() are called. Outputs sent to the taskResultInformer during this time won't be reflected in the node state, and therefore won't be reflected in the global scope. At this point, i'm considering the possibility of adding woc.addOutputsToGlobalScope() to woc.taskResultReconciliation()

Assuming the above issue is resolved. it should be okay to use something like:

func (woc *wfOperationCtx) checkReconciliationComplete() {
	numCompletedNodes := len(woc.wf.Status.Nodes.Filter(func(x wfv1.NodeStatus) interface{} {
		return x.Phase.Completed()
	}))
	return woc.wf.Status.Phase.Completed() && numCompletedNodes == woc.wf.Status.GetTaskResultsCompleted()
}

Thoughts?

@juliev0
Copy link
Contributor

juliev0 commented Dec 29, 2023

Thanks for all the flexibility on the approach! I’m on vacation now, so may not get a chance to look at your response until after 1/1 unfortunately. I’ll get back to you then.

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Dec 29, 2023

Have a good vacation! I will take a crack at what I've suggested above.

…j#12401

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
@Garett-MacGowan
Copy link
Contributor Author

@juliev0 I had to make some tweaks the implementation I suggested above, but I think I've got something more along the lines of what you were thinking.

@Garett-MacGowan
Copy link
Contributor Author

Working through the test failures. They are to do with termination as far as I can tell.

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Dec 31, 2023

Unfortunately the number of pods run is not a reliable indicator of wait container execution. If the workflow is terminated as soon as it is created, the wait container does not start up and FinalizeOutput is never called.

I'm going to push an alternative solution.

…robustness of reconciliation completion check. Add check for workflow completion to reconciliation completion check. Add call to addOutputsToGlobalScope to ensure that reconciled outputs are included in archived workflows. Fixes: argoproj#12401

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
…ent. Fixes argoproj#12401.

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
@Garett-MacGowan
Copy link
Contributor Author

@juliev0 I feel like the tests fail 1/3 times because the argo exec image download keeps timing out. It's super annoying to have to keep submitting empty commits to re run the CI. Is there anything in the works to address this?

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
@juliev0 juliev0 merged commit c63c2bc into argoproj:main Jan 3, 2024
27 checks passed
@juliev0
Copy link
Contributor

juliev0 commented Jan 3, 2024

@juliev0 I feel like the tests fail 1/3 times because the argo exec image download keeps timing out. It's super annoying to have to keep submitting empty commits to re run the CI. Is there anything in the works to address this?

Sorry, I know the feeling. There was a discussion of potentially implementing automatic retries for failures in the CI but somebody decided against it since it would hide actual failures and they preferred to address failures. As for the argoexec image download timing out, I'm not sure why that's happening so frequently. Is it just today or have you seen it in the past too?

@juliev0
Copy link
Contributor

juliev0 commented Jan 3, 2024

@Garett-MacGowan Thanks for keeping an eye on the new incoming bugs. Now that I've merged this, it would be great if you could continue to keep an eye on anything that could be caused by this, just in case, such as related to output parameters/artifacts, or workflows not being cleaned up, etc.

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Jan 3, 2024

Thanks for merging! Will do.

@Garett-MacGowan
Copy link
Contributor Author

Garett-MacGowan commented Jan 3, 2024

Sorry, I know the feeling. There was a discussion of potentially implementing automatic retries for failures in the CI but somebody decided against it since it would hide actual failures and they preferred to address failures. As for the argoexec image download timing out, I'm not sure why that's happening so frequently. Is it just today or have you seen it in the past too?

This one has been happening for at least the last few days.

I can understand not wanting to automatically retry test failures, but setup failures (like image downloads) fall into a different category. Perhaps we can increase the timeout for that? Maybe it's happening when there is a particularly high load on the image server?

@juliev0
Copy link
Contributor

juliev0 commented Jan 3, 2024

Very true. Definitely feel free to create a PR for that if you're motivated! (I for one am only spending a limited amount of time working on this repo these days)

hittingray pushed a commit to atlassian-forks/argo-workflows that referenced this pull request Jan 4, 2024
argoproj#12402)

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Raymond Chow <rchow@atlassian.com>
hittingray pushed a commit to atlassian-forks/argo-workflows that referenced this pull request Jan 4, 2024
argoproj#12402)

Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
Signed-off-by: Raymond Chow <rchow@atlassian.com>
@agilgur5 agilgur5 added the area/artifacts S3/GCP/OSS/Git/HDFS etc label Jan 4, 2024
sarabala1979 pushed a commit that referenced this pull request Jan 9, 2024
Signed-off-by: Garett MacGowan <garettsoftware@gmail.com>
@zhangtbj
Copy link
Contributor

zhangtbj commented Jan 19, 2024

Hi all,

We face a regression problem after upgrading to release v3.5.3. It should be related to this PR.

We have a global artifact which is used and exported in two different steps WITH different values:

  • e2e-test-inject-runner-context
  • e2e-test-output-testupdate

In the previous release, we can get the current global artifact from the last step (e2e-test-output-testupdate).
But in the new release v3.5.3 , the global artifact will be overwritten. so the artifact will be picked up randomly from these two steps which make our process failed.

I found there is a NEW change here:
https://github.com/argoproj/argo-workflows/blob/v3.5.3/workflow/controller/taskresult.go#L89-L90

  // Add outputs to global scope here to ensure that they are reflected in archive.
  woc.addOutputsToGlobalScope(newNode.Outputs)

Which will overwrite the global artifact after every task and should be the reason to make our process failed. We can find the overwriting log here:

time="2024-01-19T02:03:53.554Z" level=debug msg="task result name:\ne2e-test" namespace=action-dev2 workflow=e2e-test
time="2024-01-19T02:03:53.554Z" level=debug msg="Marking task result complete e2e-test" namespace=action-dev2 workflow=e2e-test
time="2024-01-19T02:03:53.554Z" level=info msg="overwriting workflow.outputs.artifacts.testUpdate: {testUpdate  <nil>  {<nil> &S3Artifact{S3Bucket:S3Bucket{Endpoint:,Bucket:,Region:,Insecure:nil,AccessKeySecret:nil,SecretKeySecret:nil,RoleARN:,UseSDKCreds:false,CreateBucketIfNotPresent:nil,EncryptionOptions:nil,CASecret:nil,},Key:e2e-test/e2e-test-inject-input-type-instances/testUpdate.tgz,} nil nil nil nil nil nil nil nil}  nil false  false  nil false}" namespace=action-dev2 workflow=e2e-test-1-4065749210214828732

I don't know if I found the current code and don't know how to solve that.

Can you please help check if my investigation is correct?

I also posted a comment in the argo-workflow channel:
https://cloud-native.slack.com/archives/C01QW9QSSSK/p1705635020012259

Thanks!

@zhangtbj
Copy link
Contributor

Maybe also cc @Garett-MacGowan for help. Thanks!

@Garett-MacGowan
Copy link
Contributor Author

@zhangtbj Do you have a workflow example to reproduce? I can see if I can fix it for next release.

@Garett-MacGowan
Copy link
Contributor Author

@zhangtbj

the global artifact will be overwritten. so the artifact will be picked up randomly from these two steps which make our process failed.

This makes it sound like you are running these steps in parallel. If that's the case, then I think you're getting expected behavior. Either way, send a long a workflow that reproduces the issue and I will have a better idea of what you're trying to do.

@zhangtbj
Copy link
Contributor

zhangtbj commented Jan 19, 2024

Hi @Garett-MacGowan ,

Thanks for your reply. We are not running these steps in parallel, but it is also not a simple workflow. Please check:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generation: 21
  generateName: test-artifact-12402-
  namespace: action-dev2
spec:
  activeDeadlineSeconds: 28800
  arguments: {}
  entrypoint: test-root
  serviceAccountName: test
  templates:
  - inputs: {}
    metadata: {}
    name: test-root
    outputs: {}
    steps:
    - - arguments: {}
        name: inject-input-artifact
        template: inject-input-artifact
    - - arguments:
          artifacts:
          - from: '{{workflow.outputs.artifacts.testInput}}'
            name: testInput
          - from: '{{workflow.outputs.artifacts.testUpdate}}'
            name: testUpdate
        name: start-entrypoint
        template: main
    - - arguments:
          artifacts:
          - from: '{{workflow.outputs.artifacts.testUpload}}'
            name: testUpload
        name: upload-output-type-instances-step
        template: upload-output-type-instances
    - - arguments:
          artifacts:
          - from: '{{workflow.outputs.artifacts.testUpdate}}'
            name: testUpdate
        name: upload-update-type-instances-step
        template: upload-update-type-instances

  - inputs: {}
    metadata: {}
    name: main
    outputs: {}
    steps:
    - - arguments:
          artifacts:
          - from: '{{workflow.outputs.artifacts.testInput}}'
            name: input
        name: cp
        template: cp
    - - arguments:
          artifacts:
          - from: '{{workflow.outputs.artifacts.testUpdate}}'
            name: input-parameters
        name: render-update-ti
        template: main-render-update-ti-template
    - - arguments:
          artifacts:
          - from: '{{steps.cp.outputs.artifacts.upload}}'
            name: testUpload
        name: output-testupload
        template: output-testupload
    - - arguments:
          artifacts:
          - from: '{{steps.render-update-ti.outputs.artifacts.render}}'
            name: testUpdate
        name: output-testupdate
        template: output-testupdate

  - container:
      image: alpine:3.7
      command: [sh, -c]
      args: ["sleep 1; echo -n test input > /testInput.yaml; echo -n test update > /testUpdate.yaml"]
    name: inject-input-artifact
    outputs:
      artifacts:
      - globalName: testInput
        name: testInput
        path: /testInput.yaml
      - globalName: testUpdate
        name: testUpdate
        path: /testUpdate.yaml

  - container:
      command: [sh, -c]
      args: ["sleep 1; cp /input /upload"]
      image: alpine:3.7
      name: ""
      resources: {}
    inputs:
      artifacts:
      - name: input
        path: /input
    metadata: {}
    name: cp
    outputs:
      artifacts:
      - name: upload
        path: /upload

  - container:
      image: alpine:3.7
      command: [sh, -c]
      args: ["sleep 1; echo -n change update!!! > /render.yml"]
    metadata: {}
    name: main-render-update-ti-template
    outputs:
      artifacts:
      - name: render
        path: /render.yml

  - container:
      args:
      - sleep 1
      command:
      - sh
      - -c
      image: alpine:3.18.4
      name: ""
      resources: {}
    inputs:
      artifacts:
      - name: testUpload
        path: /typeinstance
    metadata: {}
    name: output-testupload
    outputs:
      artifacts:
      - globalName: testUpload
        name: testUpload
        path: /typeinstance

  - container:
      args:
      - sleep 1
      command:
      - sh
      - -c
      image: alpine:3.18.4
      name: ""
      resources: {}
    inputs:
      artifacts:
      - name: testUpdate
        path: /typeinstance
    metadata: {}
    name: output-testupdate
    outputs:
      artifacts:
      - globalName: testUpdate
        name: testUpdate
        path: /typeinstance



  - container:
      args:
      - sleep 1
      command: [sh, -c]
      args: ["sleep 1; cat /upload/typeInstances/testUpload; cat /upload/typeInstances/testUpload > /upload/result.yaml"]
      image: alpine:3.18.4
      name: ""
      resources: {}
    inputs:
      artifacts:
      - name: testUpload
        path: /upload/typeInstances/testUpload
    metadata: {}
    name: upload-output-type-instances
    outputs:
      artifacts:
      - globalName: uploadresult
        name: uploadresult
        path: /upload/result.yaml

  - container:
      args:
      - sleep 1
      command: [sh, -c]
      args: ["sleep 1; cat /upload/typeInstances/testUpdate"]
      image: alpine:3.18.4
      name: ""
      resources: {}
    inputs:
      artifacts:
      - name: testUpdate
        path: /upload/typeInstances/testUpdate
    metadata: {}
    name: upload-update-type-instances
    outputs: {}

If you run this workflow several times, you will find the final global output artifact will be changed to:
Screenshot 2024-01-19 at 14 59 13

testUpdate is refered to test-artifact-12402-2cntl/test-artifact-12402-2cntl-inject-input-artifact-46224399/testUpdate.tgz

But acutally, it should use the the artifact from later step, like:
test-artifact-12402-2cntl/test-artifact-12402-2cntl-output-testupdate-4251288622/testUpdate.tgz

Please let me know if anything is not clear. Thanks!

@zhangtbj
Copy link
Contributor

zhangtbj commented Jan 19, 2024

I am not sure if it is a special case that it is not covered well.

  • start-entrypoint step is a template with some inner steps, and we update the testUpdate global output artifact in these inner steps.
  • The problem happens in the next step upload-output-type-instances-step just after start-entrypoint, I am not sure if it loads artifacts differently or delays (But it works correctly before).

Before step upload-output-type-instances-step, the status is:
Screenshot 2024-01-19 at 15 12 58

After step upload-output-type-instances-step, sometimes the status is changed to:
Screenshot 2024-01-19 at 15 15 16

@Garett-MacGowan
Copy link
Contributor Author

@zhangtbj I have a hypothesis for what is wrong. I'm working though how I can properly test it. Can you open an issue with the information above? I will submit a fix linked to that issue. Please mention this issue in the new issue. Thanks!

@zhangtbj
Copy link
Contributor

Hi @Garett-MacGowan ,

Sure thing. I opened a new task: #12554 and link this issue in the new issue.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stopped archived workflow missing artifacts in UI
4 participants