-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent errors when saving artifacts to GCS when using GKE with Workload Identity #10282
Comments
This might be related to #10174. |
I can confirm that when moving our workflows back to using a GCP Service Account key these failures/errors appear to go away. I ran 2,000 workflows that generated 4,000 pods to pass artifacts to one-another and didn't see any failures in a slightly modified gcs artifact driver version I have forked from 3.4.1. Typically we'd see around 2-10% of these fail with Workload Identity. Rolling back to v3.4.1 from your quay repos I am seeing a 99.7% success rate with simple 1k workflow test. |
During my testing with Workload Identity enabled, I also added the Google suggested initContainers solution to my simple artifact passing test that did seem to handle a number of initial transient errors (404s from the curl'd endpoint.) I would still find the |
I am also using Workload Identity with GKE and have initContainers set up, but sometimes it fails with log uploads. (Perhaps this is a situation that sometimes occurs because I'm not running thousands of massive workflows like you are.) |
@laughingman7743 I can confirm that I did see a few instances with I have also received a reply from GCP/GKE support and they are escalating this issue to the Workload Identity team for review as they believe it's an issue with that project rather than GKE itself. I can work on a small PR here to update the transient error function to include my fixes in the coming days. |
…driver Fixes #10282 #10174 (#10292) Signed-off-by: Kevin Holmes <kholmes@synack.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Saravanan Balasubramanian <33908564+sarabala1979@users.noreply.github.com>
…driver Fixes argoproj#10282 argoproj#10174 (argoproj#10292) Signed-off-by: Kevin Holmes <kholmes@synack.com> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Saravanan Balasubramanian <33908564+sarabala1979@users.noreply.github.com> Signed-off-by: Rajshekar Reddy <reddymh@gmail.com>
Pre-requisites
:latest
What happened/what you expected to happen?
We are seeing a number of different errors when attempting to load/store workflow artifacts in GCS. If I run 1,000 test workflows that are identical, around 20-100 of these will fail due to errors returned from GCP. It appears as though the transient errors are not being fully caught so they can be retried again following Google's exponential backoff guidance for times like this.
What's interesting is the 504 doesn't appear to be caught and retried either. I wonder if it's because of how the error is coming back - it's not being picked up by
case *googleapi.Error:
possibly?I have a ticket open with GCP Support about this issue also to see if they can do anything on their end to shed light on this issue. If we can determine what's actually causing this I'd love to submit a PR with a fix.
Relevant code: https://github.com/argoproj/argo-workflows/blob/master/workflow/artifacts/gcs/gcs.go#L41-L99
Version
v3.4.1
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container
The text was updated successfully, but these errors were encountered: