fix: hanging wait container on save artifact to GCS bucket artifactRepository #7536
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
argo-workflows version:
v3.2.4
executor:
emissary
K8s version:
1.20.10-gke.1600
We have wait containers intermittently hanging indefinitely. From the logs of one of the affected wait containers (see below), its clear that the last event before an indefinite repeating of the
stats.StartStatsTicker
output every 5 mins, is the attempt to save the main container's logs to our GCS bucket artifactRepository.I have applied the proposed fix and waited for a re-occurrence of this issue and the affected pod instead fails with this error:
I believe the root cause is related to some underlying issue with Workload Identity and I am still investigating why this happens, but I believe there should be some definite limit to the allowed retry time on GCS failures.
Diagnostics
Example error logs from affected wait container before making this fix:
Don't bother creating a PR until you've done this:
make pre-commit -B
to fix codegen, lint, and commit message problems.Create your PR as a draft.
does not need to pass.
Tips: