Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: hanging wait container on save artifact to GCS bucket artifactRepository #7536

Merged
merged 4 commits into from
Jan 19, 2022
Merged

fix: hanging wait container on save artifact to GCS bucket artifactRepository #7536

merged 4 commits into from
Jan 19, 2022

Conversation

kostas-theo
Copy link
Contributor

@kostas-theo kostas-theo commented Jan 10, 2022

Summary

argo-workflows version: v3.2.4
executor: emissary
K8s version: 1.20.10-gke.1600

We have wait containers intermittently hanging indefinitely. From the logs of one of the affected wait containers (see below), its clear that the last event before an indefinite repeating of the stats.StartStatsTicker output every 5 mins, is the attempt to save the main container's logs to our GCS bucket artifactRepository.

I have applied the proposed fix and waited for a re-occurrence of this issue and the affected pod instead fails with this error:

Error (exit code 1): GCS storage.NewClient: dialing: google: could not find default credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.

I believe the root cause is related to some underlying issue with Workload Identity and I am still investigating why this happens, but I believe there should be some definite limit to the allowed retry time on GCS failures.

Diagnostics

Example error logs from affected wait container before making this fix:

time="2021-12-19T15:25:13.569Z" level=info msg="Starting Workflow Executor" executorType=emissary version=v3.2.4
time="2021-12-19T15:25:13.574Z" level=info msg="Creating a emissary executor"
time="2021-12-19T15:25:13.574Z" level=info msg="Executor initialized" deadline="0001-01-01 00:00:00 +0000 UTC" includeScriptOutput=true namespace=flood podName=flood-pi
peline-live-zwdld-2936040236 template="{\"name\":\"get-urban-start-time-utc\",\"inputs\":{\"parameters\":[{\"name\":\"run-id\",\"value\":\"27229680\"}]},\"
outputs\":{},\"nodeSelector\":{\"role\":\"flood-urban\"},\"metadata\":{},\"script\":{\"name\":\"\",\"image\":\"python:alpine3.6\",\"command\":[\"python\"],\"resources\"
:{\"limits\":{\"cpu\":\"1\",\"memory\":\"1G\"},\"requests\":{\"cpu\":\"1\",\"memory\":\"1G\"}},\"volumeMounts\":[{\"name\":\"data\",\"mountPath\":\"/var/lib/oneconcern\
"}],\"source\":\"import json, os\\nurban_args_file = \\\"/var/lib/oneconcern/argo-artifacts/27229680/urban_args.json\\\"\\n\\nwith open(urban_args_file) as f:\\n  full_
args = json.load(f)\\n\\nprint(full_args['urban_start_time'])\\n\"},\"archiveLocation\":{\"archiveLogs\":true,\"gcs\":{\"bucket\":\"argo-dev-artifacts\",\"key\":\"flood
/flood-pipeline-live-zwdld/flood-pipeline-live-zwdld-2936040236/\"}},\"retryStrategy\":{\"limit\":\"3\",\"retryPolicy\":\"Always\"},\"tolerati
ons\":[{\"key\":\"role\",\"operator\":\"Equal\",\"value\":\"flood-urban\"}]}" version="&Version{Version:v3.2.4,BuildDate:2021-11-17T23:18:57Z,GitCommit:8771ca279c329753
e420dbdd986a9c914876b151,GitTag:v3.2.4,GitTreeState:clean,GoVersion:go1.16.10,Compiler:gc,Platform:linux/amd64,}"
time="2021-12-19T15:25:13.574Z" level=info msg="Starting deadline monitor"
time="2021-12-19T15:25:16.574Z" level=info msg="Main container completed"
time="2021-12-19T15:25:16.574Z" level=info msg="Capturing script output"
time="2021-12-19T15:25:16.574Z" level=info msg="Saving logs"
time="2021-12-19T15:25:16.575Z" level=info msg="GCS Save path: /tmp/argo/outputs/logs/main.log, key: flood/flood-pipeline-live-zwdld/flood-pipeline-live-zwdld-2936040236/main.log"
time="2021-12-19T15:30:13.575Z" level=info msg="Alloc=20445 TotalAlloc=26113 Sys=73297 NumGC=6 Goroutines=9"
time="2021-12-19T15:35:13.574Z" level=info msg="Alloc=20492 TotalAlloc=26293 Sys=73297 NumGC=8 Goroutines=9"
time="2021-12-19T15:40:13.575Z" level=info msg="Alloc=20460 TotalAlloc=26458 Sys=73297 NumGC=11 Goroutines=9"
(ongoing every 5 mins indefinitely until pod is killed)

Don't bother creating a PR until you've done this:

  • Run make pre-commit -B to fix codegen, lint, and commit message problems.

Create your PR as a draft.

  • Your PR needs to pass the required checks before it can be approved. If the check is not required (e.g. E2E tests) it
    does not need to pass.
  • Once required tests have passed, you can make it "Ready for review".
  • Say how how you tested your changes. If you changed the UI, attach screenshots.

Tips:

  • If changes were requested, and you've made them, then dismiss the review to get it looked at again.
  • Add you organization to USERS.md if you like.
  • You can ask for help!

Signed-off-by: kostas-theo <ktheo@oneconcern.com>
Signed-off-by: kostas-theo <ktheo@oneconcern.com>
Signed-off-by: kostas-theo <ktheo@oneconcern.com>
@kostas-theo kostas-theo marked this pull request as ready for review January 12, 2022 14:54
@kostas-theo
Copy link
Contributor Author

Thanks for the review @sarabala1979 - Could you point me in the right direction as to why the Unit Tests have failed? Is this step known to be flaky?

@alexec alexec changed the title Fix hanging wait container on save artifact to GCS bucket artifactRepository fix: hanging wait container on save artifact to GCS bucket artifactRepository Jan 19, 2022
@alexec alexec enabled auto-merge (squash) January 19, 2022 16:08
@alexec alexec merged commit f1fe3be into argoproj:master Jan 19, 2022
yriveiro pushed a commit to yriveiro/argo-workflows that referenced this pull request Jan 27, 2022
…pository (argoproj#7536)

Signed-off-by: kostas-theo <ktheo@oneconcern.com>
@alexec alexec mentioned this pull request Jan 27, 2022
4 tasks
@sarabala1979 sarabala1979 mentioned this pull request Mar 1, 2022
@kostas-theo
Copy link
Contributor Author

@sarabala1979 - Can this change make its way into the next release please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants