wait container | Failed to establish pod watch ... dial tcp i/o timeout #4980
-
Hi all, We submit hundreds of workflows at specific times of the day. The status of some workflows would be "Error/Failed", and MESSAGE is as follows:
At first, I thought it was caused by overloading the cluster. After observing it for a while, I found that the problem was all happening on the wait container. Is there a bug in the wait container? and is there any way to solve this problem? Any suggestion would be appreciated. More Information
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
I think this should be raised as an issue rather than a discusssion. |
Beta Was this translation helpful? Give feedback.
-
Copying my response from #4993 to this discussion and hopefully others will find it useful as well: It's likely that your cluster/apiserver is super unstable. There's an environment variable |
Beta Was this translation helpful? Give feedback.
Copying my response from #4993 to this discussion and hopefully others will find it useful as well:
It's likely that your cluster/apiserver is super unstable. There's an environment variable
TRANSIENT_ERROR_PATTERN
you can use to specify a regular expression for additional errors that you'd like to retry. There are also variables likeRETRY_BACKOFF_*
that you can set as well for further customization of the retry behavior. More details in this doc: https://github.com/argoproj/argo-workflows/blob/master/docs/environment-variables.md