Add config to control Kubernetes Client retry behaviour#26710
Add config to control Kubernetes Client retry behaviour#26710hterik wants to merge 1 commit intoapache:mainfrom
Conversation
Occasionally, a request to the Kubernetes API might fail due to temporary network glitches. By default, such requests are retried 3 times, without any delay between. On the final failure, the entire scheduler crashes. This configuration allows the urllib retry behaviour to be adjusted, mainly to allow some backoff in between each retry, giving the network time to recover before the final attempt. Fixes apache#24748
hterik
left a comment
There was a problem hiding this comment.
Will need to work on adding tests and running the existing tests. Haven't done that yet, would first like some feedback on the design.
| else: | ||
| configuration = Configuration() | ||
| configuration.verify_ssl = False | ||
| Configuration.set_default(configuration) |
There was a problem hiding this comment.
Is it ok to remove this set_default and only rely on every other code path going through get_kube_client?
I see in pod_generator and TaskInstance creates ApiClient in many places, but only uses it for offline operations.
It's also created by hooks.kubernetes, haven't looked into what that does yet.
Maybe it's safer to keep it this way and incorporate the new config using this same method?
|
|
||
| retryparams = conf.getjson('kubernetes', 'client_retry_configuration_kwargs', fallback={}) | ||
| if retryparams != {}: | ||
| client_config.retries = urllib3.util.Retry(**retryparams) |
There was a problem hiding this comment.
Is this level of configuration granularity good? Or is it enough to only expose the backoff and number?
I could even go as far as saying some kind of backoff should be enabled by default, without configuration.
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
|
stale ping |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
|
Hi, this is a problem for us in production still, can we make forward progress here? |
It was fixed in #29809 instead |
Occasionally, a request to the Kubernetes API might fail due to temporary network glitches. By default, such requests are retried 3 times, without any delay between.
On the final failure, the entire scheduler crashes.
This configuration allows the urllib retry behaviour to be adjusted, mainly to allow some backoff in between each retry, giving the network time to recover before the final attempt.
Fixes #24748