Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't necessarily include all artifacts from templates in node outputs #13066

Merged
merged 15 commits into from
Jun 17, 2024

Conversation

juliev0
Copy link
Contributor

@juliev0 juliev0 commented May 19, 2024

The fact that it's being listed in the NodeStatus is causing ArtifactGC to attempt to delete it (and fail doing so).

Fixes: #12845

Motivation

If an Artifact file didn't get written, it shouldn't be in the outputs for the TaskResult (and thus the Outputs for the Node) just because it was listed as an Artifact in the Workflow.

Modifications

Only if the Artifact file is successfully written does the wait container include it as one of the Output Artifacts in its WorkflowTaskResult. The Workflow Controller uses that WorkflowTaskResult information to populate the NodeStatus of the Workflow. This is the information that ArtifactGC uses to determine what needs to be deleted.

Verification

The person who submitted the Issue tested the image I made, and it repeatedly worked to solve his issue of ArtifactGC failing. (Note it also prevented his other artifacts from getting deleted, because the failed deletion preempted the other deletions from being attempted (as @agilgur5 pointed out, this could be another improvement, to continue the deletion process independent of failure)).

I did start to create an e2e test for this issue. But it seems that with minio, I can't repeat the same failure using the "main" branch. It seems that attempting to delete an artifact which doesn't exist doesn't result in a failure for some reason.

…utputs

Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
}
we.Template.Outputs.Artifacts[i] = art
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of updating this directly, create a new list of the artifacts that we actually saved and return that

Copy link
Contributor

@Garett-MacGowan Garett-MacGowan Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. So before we were using the we.Template.Outputs.Artifacts to pass to ReportOutputs but now we are creating a fresh wfv1.Artifacts object that will only include the successfully saved artifacts. This makes sense.

Copy link
Member

@agilgur5 agilgur5 Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I do see a small issue here, but I think this already existed beforehand:
If the wait container is interrupted during this loop, some artifacts may save to S3 (etc) but not be reported

But since adding artifacts to Outputs.Artifacts wouldn't report them until the very end anyway, I think this doesn't change any existing behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the wait container is interrupted during this loop, some artifacts may save to S3 (etc) but not be reported

How do you mean it wouldn't be reported? Since the background context is passed into both SaveArtifacts() and ReportOutputs() below, I believe all of that logic is still supposed to be executed. Is there something different you're referring to besides the context being cancelled?

Copy link
Member

@agilgur5 agilgur5 Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the loop is long or the container is otherwise non-gracefully terminated, the request context won't matter.

Also realized that saving artifacts can be parallelized here to reduce that chance and improve throughput (related: #12442). Similarly pre-existing logic though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is basically if this Pod receives a SIGKILL? I don't think there's anything that can be done in that case, is there? (at least from the Workflow Pod side)

Copy link
Member

@agilgur5 agilgur5 Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More or less yes, but there are things we can do to mitigate that or detect that scenario, such as parallel uploads and writing pieces of the report as each artifact completes uploading (although k8s does not have streaming requests afaik, so that would put more strain on the api server when there are many artifacts).

I also recently remembered we had #12413 merged for 3.6 that does help prevent that too.

(To be clear, this is pre-existing logic / an incidental finding, so not going to block review on that, but there are perhaps actions we should take around this case. Perhaps better discussed in #12993 though; although that's a specific case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
@agilgur5 agilgur5 added the area/artifacts S3/GCP/OSS/Git/HDFS etc label May 29, 2024
@juliev0 juliev0 marked this pull request as ready for review June 1, 2024 23:55
Copy link
Contributor

@Garett-MacGowan Garett-MacGowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
we.Template.Outputs.Artifacts[i] = art
Copy link
Contributor

@Garett-MacGowan Garett-MacGowan Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. So before we were using the we.Template.Outputs.Artifacts to pass to ReportOutputs but now we are creating a fresh wfv1.Artifacts object that will only include the successfully saved artifacts. This makes sense.

@agilgur5 agilgur5 self-assigned this Jun 8, 2024
…, including when no artifact is written

Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
@agilgur5
Copy link
Member

agilgur5 commented Jun 9, 2024

(as @agilgur5 pointed out, this could be another improvement, to continue the deletion process independent of failure)

I was thinking about this and it will probably be fixed by parallel artifact GC (#11768), as that inherently has to process each artifact independently.

Vaguely related to this, I was wondering if creating more WorkflowTaskResults, e.g. child results, might help with some of these races, i.e. via create-only immutable records of progress. Particularly it could resolve a security issue with malicious Workflows as mentioned in #12783 (comment) as the Executor would then only need create permissions, and not patch.

Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Jun 10, 2024
@agilgur5 agilgur5 linked an issue Jun 10, 2024 that may be closed by this pull request
4 tasks
Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. I left a few style comments on the tests, one suggestion for the test case, and a question on the saving + reporting. Still learning my way around the artifacts part of the codebase

test/e2e/artifacts_test.go Outdated Show resolved Hide resolved
test/e2e/artifacts_test.go Outdated Show resolved Hide resolved
}
we.Template.Outputs.Artifacts[i] = art
Copy link
Member

@agilgur5 agilgur5 Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I do see a small issue here, but I think this already existed beforehand:
If the wait container is interrupted during this loop, some artifacts may save to S3 (etc) but not be reported

But since adding artifacts to Outputs.Artifacts wouldn't report them until the very end anyway, I think this doesn't change any existing behavior

@agilgur5 agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Jun 10, 2024
juliev0 and others added 4 commits June 11, 2024 05:29
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
@juliev0
Copy link
Contributor Author

juliev0 commented Jun 12, 2024

Thanks for fixing the lint issue @agilgur5 !

Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very tiny comments on the tests below, otherwise LGTM. Marking as approved so you can merge once you fix or answer those

test/e2e/artifacts_test.go Outdated Show resolved Hide resolved
test/e2e/artifacts_test.go Outdated Show resolved Hide resolved
fmt.Printf("artifact key: %q\n", artifactKey)
if err != nil {
panic(err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be below the continue conditional? Since it's not used by it, so can be skipped in that case. Unless you want to print each key?

juliev0 and others added 2 commits June 17, 2024 12:54
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
@juliev0 juliev0 merged commit 2ca4841 into argoproj:main Jun 17, 2024
28 checks passed
@juliev0 juliev0 deleted the workflows-save-artifacts branch June 17, 2024 04:22
yulin-li pushed a commit to yulin-li/argo-workflows that referenced this pull request Jun 17, 2024
…utputs (argoproj#13066)

Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Yulin Li <yulili@microsoft.com>
agilgur5 pushed a commit that referenced this pull request Jun 17, 2024
…utputs (#13066)

Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
(cherry picked from commit 2ca4841)
@agilgur5 agilgur5 mentioned this pull request Jun 24, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ArtifactGC fails when workflows are retried
3 participants