Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent errors when saving artifacts to GCS when using GKE with Workload Identity #10282

Closed
3 tasks done
kevholmes opened this issue Dec 27, 2022 · 5 comments · Fixed by #10292
Closed
3 tasks done
Labels

Comments

@kevholmes
Copy link
Contributor

kevholmes commented Dec 27, 2022

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We are seeing a number of different errors when attempting to load/store workflow artifacts in GCS. If I run 1,000 test workflows that are identical, around 20-100 of these will fail due to errors returned from GCP. It appears as though the transient errors are not being fully caught so they can be retried again following Google's exponential backoff guidance for times like this.

What's interesting is the 504 doesn't appear to be caught and retried either. I wonder if it's because of how the error is coming back - it's not being picked up by case *googleapi.Error: possibly?

I have a ticket open with GCP Support about this issue also to see if they can do anything on their end to shed light on this issue. If we can determine what's actually causing this I'd love to submit a PR with a fix.

Relevant code: https://github.com/argoproj/argo-workflows/blob/master/workflow/artifacts/gcs/gcs.go#L41-L99

Version

v3.4.1

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

https://raw.githubusercontent.com/argoproj/argo-workflows/master/examples/artifact-passing.yaml

This example workflow will net us around 20-100 failures for every 1,000 executed.

Logs from the workflow controller

The controller output doesn't appear to have anything relevant at this time.

Logs from in your workflow's wait container

time="2022-12-22T21:34:02.919Z" level=warning msg="Non-transient error: upload /tmp/argo/outputs/artifacts/hello-art.tgz: writer close: Post \"https://storage.googleapis.com/upload/storage/v1/b/${BUCKET_REDACTED}/o?alt=json&name=default%2Fns%3Dargo%2Fdt%3D2022-12-22%2Fartifact-passing-nodeselector-d9pdg%2Fartifact-passing-nodeselector-d9pdg-whalesay-1608025048%2Fhello-art.tgz&prettyPrint=false&projection=full&uploadType=multipart\": metadata: GCE metadata \"instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdevstorage.full_control%2Chttps%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform\" not defined"

time="2022-12-27T15:45:19.319Z" level=warning msg="Non-transient error: upload /tmp/argo/outputs/artifacts/hello-art.tgz: writer close: Post \"https://storage.googleapis.com/upload/storage/v1/b/${BUCKET_REDACTED}/o?alt=json&name=default%2Fns%3Dargo%2Fdt%3D2022-12-27%2Fartifact-passing-nodeselector-7vfsj%2Fartifact-passing-nodeselector-7vfsj-whalesay-3609688236%2Fhello-art.tgz&prettyPrint=false&projection=full&uploadType=multipart\": compute: Received 504 `Gateway Timeout\n`"

Non-transient error: upload /tmp/argo/outputs/artifacts/hello-art.tgz: writer close: Post \"https://storage.googleapis.com/upload/storage/v1/b/${BUCKET_REDACTED}/o?alt=json&name=default%2Fns%3Dargo%2Fdt%3D2022-12-27%2Fartifact-passing-nodeselector-7vfsj%2Fartifact-passing-nodeselector-7vfsj-whalesay-3609688236%2Fhello-art.tgz&prettyPrint=false&projection=full&uploadType=multipart\": http2: client connection lost

"time="2022-12-28T21:23:40.614Z" level=info msg="gcs.go/isTransientGCSErr() err: writer close: Post \"https://storage.googleapis.com/upload/storage/v1/b/${BUCKET_REDACTED}/o?alt=json&name=default%2Fns%3Dargo%2Fdt%3D2022-12-28%2Fartifact-passing-nodeselector-dqq9d%2Fartifact-passing-nodeselector-dqq9d-whalesay-2111961005%2Fhello-art.tgz&prettyPrint=false&projection=full&uploadType=multipart\": compute: Received 500 `Unable to generate access token; IAM returned \n`""
@kevholmes
Copy link
Contributor Author

This might be related to #10174.

@kevholmes
Copy link
Contributor Author

kevholmes commented Dec 29, 2022

I can confirm that when moving our workflows back to using a GCP Service Account key these failures/errors appear to go away. I ran 2,000 workflows that generated 4,000 pods to pass artifacts to one-another and didn't see any failures in a slightly modified gcs artifact driver version I have forked from 3.4.1. Typically we'd see around 2-10% of these fail with Workload Identity. Rolling back to v3.4.1 from your quay repos I am seeing a 99.7% success rate with simple 1k workflow test.

@kevholmes
Copy link
Contributor Author

kevholmes commented Dec 30, 2022

During my testing with Workload Identity enabled, I also added the Google suggested initContainers solution to my simple artifact passing test that did seem to handle a number of initial transient errors (404s from the curl'd endpoint.) I would still find the not defined error in my Description above striking and causing a sub-optimal workflow success ratio for those that used the gcs driver.

@laughingman7743
Copy link

I am also using Workload Identity with GKE and have initContainers set up, but sometimes it fails with log uploads. (Perhaps this is a situation that sometimes occurs because I'm not running thousands of massive workflows like you are.)
#10174
In my case, I often get timeouts at the SSL layer during log uploads, and I feel like I need to do a retry process in addition to checking the HTTP status code and retrying.

@kevholmes
Copy link
Contributor Author

@laughingman7743 I can confirm that I did see a few instances with net/http: TLS handshake timeout errors during my testing with WI + GKE. The transient error handling code in the GCS driver isn't catching a number of cases from what I have seen so far. There are probably five or six error messages that slip through the cracks when using Argo WF + WI + GCS Artifact driver - and probably two without from what I've seen so far.

I have also received a reply from GCP/GKE support and they are escalating this issue to the Workload Identity team for review as they believe it's an issue with that project rather than GKE itself.

I can work on a small PR here to update the transient error function to include my fixes in the coming days.

sarabala1979 added a commit that referenced this issue Jan 5, 2023
…driver Fixes #10282 #10174 (#10292)

Signed-off-by: Kevin Holmes <kholmes@synack.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Saravanan Balasubramanian <33908564+sarabala1979@users.noreply.github.com>
reddymh pushed a commit to reddymh/argo-workflows that referenced this issue Jan 31, 2023
…driver Fixes argoproj#10282 argoproj#10174 (argoproj#10292)

Signed-off-by: Kevin Holmes <kholmes@synack.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Saravanan Balasubramanian <33908564+sarabala1979@users.noreply.github.com>
Signed-off-by: Rajshekar Reddy <reddymh@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants