Skip to content

Comments

Add config to control Kubernetes Client retry behaviour#26710

Closed
hterik wants to merge 1 commit intoapache:mainfrom
hterik:k8sretrybackoff
Closed

Add config to control Kubernetes Client retry behaviour#26710
hterik wants to merge 1 commit intoapache:mainfrom
hterik:k8sretrybackoff

Conversation

@hterik
Copy link
Contributor

@hterik hterik commented Sep 27, 2022

Occasionally, a request to the Kubernetes API might fail due to temporary network glitches. By default, such requests are retried 3 times, without any delay between.
On the final failure, the entire scheduler crashes.

This configuration allows the urllib retry behaviour to be adjusted, mainly to allow some backoff in between each retry, giving the network time to recover before the final attempt.

Fixes #24748

Occasionally, a request to the Kubernetes API might fail due
to temporary network glitches. By default, such requests
are retried 3 times, without any delay between.
On the final failure, the entire scheduler crashes.

This configuration allows the urllib retry behaviour to be
adjusted, mainly to allow some backoff in between each retry,
giving the network time to recover before the final attempt.

Fixes apache#24748
@boring-cyborg boring-cyborg bot added the provider:cncf-kubernetes Kubernetes (k8s) provider related issues label Sep 27, 2022
Copy link
Contributor Author

@hterik hterik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to work on adding tests and running the existing tests. Haven't done that yet, would first like some feedback on the design.

else:
configuration = Configuration()
configuration.verify_ssl = False
Configuration.set_default(configuration)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to remove this set_default and only rely on every other code path going through get_kube_client?

I see in pod_generator and TaskInstance creates ApiClient in many places, but only uses it for offline operations.

It's also created by hooks.kubernetes, haven't looked into what that does yet.


Maybe it's safer to keep it this way and incorporate the new config using this same method?


retryparams = conf.getjson('kubernetes', 'client_retry_configuration_kwargs', fallback={})
if retryparams != {}:
client_config.retries = urllib3.util.Retry(**retryparams)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this level of configuration granularity good? Or is it enough to only expose the backoff and number?
I could even go as far as saying some kind of backoff should be enabled by default, without configuration.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 12, 2022
@hterik
Copy link
Contributor Author

hterik commented Nov 14, 2022

stale ping

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Nov 15, 2022
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Dec 31, 2022
@github-actions github-actions bot closed this Jan 5, 2023
@dinedal
Copy link

dinedal commented Jun 8, 2023

Hi, this is a problem for us in production still, can we make forward progress here?

@hterik
Copy link
Contributor Author

hterik commented Oct 5, 2023

Hi, this is a problem for us in production still, can we make forward progress here?

It was fixed in #29809 instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

provider:cncf-kubernetes Kubernetes (k8s) provider related issues stale Stale PRs per the .github/workflows/stale.yml policy file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Configuring retry policy of the the kubernetes CoreV1Api ApiClient

2 participants