Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add request timed out in errors.isTransientNetworkErr #4144

Closed
anggao opened this issue Sep 28, 2020 · 5 comments
Closed

Add request timed out in errors.isTransientNetworkErr #4144

anggao opened this issue Sep 28, 2020 · 5 comments
Labels
Milestone

Comments

@anggao
Copy link
Contributor

anggao commented Sep 28, 2020

Summary

What happened/what you expected to happen?

We got following errors from time to time:
time="2020-09-25T14:34:34.038Z" level=error msg="executor error: etcdserver: request timed out"

Looking at the ExponentialBackoff usage in executor codes, e.g. https://github.com/argoproj/argo/blob/master/workflow/executor/executor.go#L622

It will failed earlier if the error is not transient error, the method IsTransientErr: https://github.com/argoproj/argo/blob/master/util/errors/errors.go#L40
Seems missing case for request timed out

Diagnostics

What version of Argo Workflows are you running?

2.10.1
time="2020-09-25T14:06:14.611Z" level=info msg="Waiting on main container"
time="2020-09-25T14:06:15.541Z" level=info msg="main container started with container ID: b69687396a67edffa59337f84882f22ded284b44b05b61718578bd0c403f207e"
time="2020-09-25T14:06:15.541Z" level=info msg="Starting annotations monitor"
time="2020-09-25T14:06:15.544Z" level=info msg="Waiting for container b69687396a67edffa59337f84882f22ded284b44b05b61718578bd0c403f207e to complete"
time="2020-09-25T14:06:15.544Z" level=info msg="Starting to wait completion of containerID b69687396a67edffa59337f84882f22ded284b44b05b61718578bd0c403f207e ..."
time="2020-09-25T14:06:15.544Z" level=info msg="Starting deadline monitor"
time="2020-09-25T14:11:14.612Z" level=info msg="Alloc=6049 TotalAlloc=42920 Sys=71872 NumGC=12 Goroutines=10"
time="2020-09-25T14:16:14.611Z" level=info msg="Alloc=8286 TotalAlloc=70981 Sys=72384 NumGC=18 Goroutines=10"
time="2020-09-25T14:21:14.612Z" level=info msg="Alloc=6010 TotalAlloc=99087 Sys=72384 NumGC=25 Goroutines=10"
time="2020-09-25T14:26:14.611Z" level=info msg="Alloc=7707 TotalAlloc=127181 Sys=72640 NumGC=31 Goroutines=10"
time="2020-09-25T14:31:14.612Z" level=info msg="Alloc=5406 TotalAlloc=155271 Sys=72640 NumGC=38 Goroutines=10"
time="2020-09-25T14:34:34.038Z" level=warning msg="Failed to wait for container id 'b69687396a67edffa59337f84882f22ded284b44b05b61718578bd0c403f207e': etcdserver: request timed out"
time="2020-09-25T14:34:34.038Z" level=error msg="executor error: etcdserver: request timed out"
time="2020-09-25T14:34:34.038Z" level=info msg="No Script output reference in workflow. Capturing script output ignored"
time="2020-09-25T14:34:34.038Z" level=info msg="Capturing script exit code"
time="2020-09-25T14:34:34.038Z" level=info msg="Getting exit code of b69687396a67edffa59337f84882f22ded284b44b05b61718578bd0c403f207e"
time="2020-09-25T14:34:34.038Z" level=info msg="Annotations monitor stopped"
time="2020-09-25T14:34:34.044Z" level=info msg="No output parameters"
time="2020-09-25T14:34:34.044Z" level=info msg="No output artifacts"
time="2020-09-25T14:34:34.044Z" level=info msg="Killing sidecars"
time="2020-09-25T14:34:34.105Z" level=info msg="Alloc=6181 TotalAlloc=173280 Sys=72640 NumGC=42 Goroutines=9"


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec
Copy link
Contributor

alexec commented Sep 28, 2020

This is a great idea. I think you need to add this to isTransientNetworkErr. Would you like to submit a PR?

@alexec alexec added this to the v2.10 milestone Sep 28, 2020
@anggao
Copy link
Contributor Author

anggao commented Sep 28, 2020

@alexec yes that's seems the right place to add, sure I will provide a PR.
curious why we need this earlier exit, since ExecutorRetry config is already short enough, the pain point is looks like we have to enumerate all possible transient error cases ?

@alexec
Copy link
Contributor

alexec commented Sep 28, 2020

Good point. You're thinking just always retry rather that enumerate all transient errors. I think this is a good idea too. @jessesuen can you confirm what the reason for excluding (what was at the time) called as retry-able errors in #685

@alexec
Copy link
Contributor

alexec commented Sep 28, 2020

@jessesuen has agreed with me. You can retry any error and remove the IsTransientErr check.

@anggao
Copy link
Contributor Author

anggao commented Sep 28, 2020

@alexec great! I will update that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants