Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't necessarily include all artifacts from templates in node outputs #13066

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

juliev0
Copy link
Contributor

@juliev0 juliev0 commented May 19, 2024

The fact that it's being listed in the NodeStatus is causing ArtifactGC to attempt to delete it (and fail doing so).

Fixes: #12845

Motivation

If an Artifact file didn't get written, it shouldn't be in the outputs for the TaskResult (and thus the Outputs for the Node) just because it was listed as an Artifact in the Workflow.

Modifications

Only if the Artifact file is successfully written does the wait container include it as one of the Output Artifacts in its WorkflowTaskResult. The Workflow Controller uses that WorkflowTaskResult information to populate the NodeStatus of the Workflow. This is the information that ArtifactGC uses to determine what needs to be deleted.

Verification

The person who submitted the Issue tested the image I made, and it repeatedly worked to solve his issue of ArtifactGC failing. (Note it also prevented his other artifacts from getting deleted, because the failed deletion preempted the other deletions from being attempted (as @agilgur5 pointed out, this could be another improvement, to continue the deletion process independent of failure)).

I did start to create an e2e test for this issue. But it seems that with minio, I can't repeat the same failure using the "main" branch. It seems that attempting to delete an artifact which doesn't exist doesn't result in a failure for some reason.

…utputs

Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
}
we.Template.Outputs.Artifacts[i] = art
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of updating this directly, create a new list of the artifacts that we actually saved and return that

Copy link
Contributor

@Garett-MacGowan Garett-MacGowan Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. So before we were using the we.Template.Outputs.Artifacts to pass to ReportOutputs but now we are creating a fresh wfv1.Artifacts object that will only include the successfully saved artifacts. This makes sense.

Copy link
Member

@agilgur5 agilgur5 Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I do see a small issue here, but I think this already existed beforehand:
If the wait container is interrupted during this loop, some artifacts may save to S3 (etc) but not be reported

But since adding artifacts to Outputs.Artifacts wouldn't report them until the very end anyway, I think this doesn't change any existing behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the wait container is interrupted during this loop, some artifacts may save to S3 (etc) but not be reported

How do you mean it wouldn't be reported? Since the background context is passed into both SaveArtifacts() and ReportOutputs() below, I believe all of that logic is still supposed to be executed. Is there something different you're referring to besides the context being cancelled?

Copy link
Member

@agilgur5 agilgur5 Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the loop is long or the container is otherwise non-gracefully terminated, the request context won't matter.

Also realized that saving artifacts can be parallelized here to reduce that chance and improve throughput (related: #12442). Similarly pre-existing logic though.

Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
@agilgur5 agilgur5 added the area/artifacts S3/GCP/OSS/Git/HDFS etc label May 29, 2024
@juliev0 juliev0 marked this pull request as ready for review June 1, 2024 23:55
Copy link
Contributor

@Garett-MacGowan Garett-MacGowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

}
we.Template.Outputs.Artifacts[i] = art
Copy link
Contributor

@Garett-MacGowan Garett-MacGowan Jun 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. So before we were using the we.Template.Outputs.Artifacts to pass to ReportOutputs but now we are creating a fresh wfv1.Artifacts object that will only include the successfully saved artifacts. This makes sense.

@agilgur5 agilgur5 self-assigned this Jun 8, 2024
…, including when no artifact is written

Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
@agilgur5
Copy link
Member

agilgur5 commented Jun 9, 2024

(as @agilgur5 pointed out, this could be another improvement, to continue the deletion process independent of failure)

I was thinking about this and it will probably be fixed by parallel artifact GC (#11768), as that inherently has to process each artifact independently.

Vaguely related to this, I was wondering if creating more WorkflowTaskResults, e.g. child results, might help with some of these races, i.e. via create-only immutable records of progress. Particularly it could resolve a security issue with malicious Workflows as mentioned in #12783 (comment) as the Executor would then only need create permissions, and not patch.

Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
@@ -352,7 +368,7 @@ func (s *ArtifactsSuite) TestArtifactGC() {
}
})

if when.WorkflowCondition(func(wf *wfv1.Workflow) bool {
if tt.workflowShouldSucceed && when.WorkflowCondition(func(wf *wfv1.Workflow) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was previously trying to avoid intermittent test failures due to the workflow itself failing when it shouldn't - but now there's a Workflow which fails in here so this is how I'm allowing that one to be failed

Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Jun 10, 2024
@agilgur5 agilgur5 linked an issue Jun 10, 2024 that may be closed by this pull request
4 tasks
Copy link
Member

@agilgur5 agilgur5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. I left a few style comments on the tests, one suggestion for the test case, and a question on the saving + reporting. Still learning my way around the artifacts part of the codebase

test/e2e/artifacts_test.go Outdated Show resolved Hide resolved
Comment on lines 346 to 347
expectedGCPodsOnWFCompletion: 0,
expectedArtifacts: []artifactState{},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm to ensure that GC works in the presence of a missing artifact, should we make the Workflow have a normal, successful artifact as well? i.e. the same case as the bug report; some successful artifacts, some unsuccessful

in this current test case, GC doesn't run at all, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've modified the test Workflow so that it also successfully saves an artifact file and garbage collects it. Had to update the test to look for artifact keys names that are auto-generated.

}
we.Template.Outputs.Artifacts[i] = art
Copy link
Member

@agilgur5 agilgur5 Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I do see a small issue here, but I think this already existed beforehand:
If the wait container is interrupted during this loop, some artifacts may save to S3 (etc) but not be reported

But since adding artifacts to Outputs.Artifacts wouldn't report them until the very end anyway, I think this doesn't change any existing behavior

@agilgur5 agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Jun 10, 2024
juliev0 and others added 4 commits June 11, 2024 05:29
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Co-authored-by: Anton Gilgur <4970083+agilgur5@users.noreply.github.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Signed-off-by: Julie Vogelman <julievogelman0@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ArtifactGC fails when workflows are retried
3 participants