Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the max wait time for EndpointCreate API requests #25805

Merged
merged 1 commit into from Jun 20, 2023

Conversation

squeed
Copy link
Contributor

@squeed squeed commented May 31, 2023

In the event that a node has a burst of PodSanboxCreate requests, API requests will pile up. However, we should wait longer for the queue to clear before short-circuting and returing failure.

This is because the kubelet has a relatively relaxed timeout for PodSandbox creation -- 4 minutes. Furthermore, if we return a failure here, it is propagated all the way back through containerd to kubelet, which will tear down the entire PodSandbox and try again, which can be expensive.

So, increase the maximum queue duration time to 1 minute. That should hopefully give enough time for the queue to clear.

Fixes: #24361

Cilium now waits longer before returning a failure in the event of a pod creation burst.

@squeed squeed added area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels May 31, 2023
@squeed squeed requested a review from a team as a code owner May 31, 2023 21:04
@squeed squeed requested a review from thorn3r May 31, 2023 21:04
@squeed
Copy link
Contributor Author

squeed commented Jun 1, 2023

/test

Copy link
Contributor

@thorn3r thorn3r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice investigation on this one 👍

@squeed squeed force-pushed the endpointcreate-wait-longer branch from 5a9f0af to cd0b6d9 Compare June 9, 2023 07:52
@squeed
Copy link
Contributor Author

squeed commented Jun 9, 2023

/test

@christarazi christarazi added kind/enhancement This would improve or streamline existing functionality. area/daemon Impacts operation of the Cilium daemon. labels Jun 15, 2023
In the event that a node has a burst of PodSanboxCreate requests, API
requests will pile up. However, we should wait longer for the queue to
clear before short-circuting and returing failure.

This is because the kubelet has a relatively relaxed timeout for
PodSandbox creation -- 4 minutes. Furthermore, if we return a failure
here, it is propagated all the way back through containerd to kubelet,
which will tear down the entire PodSandbox and try again, which can be
expensive.

So, increase the maximum queue duration time to 1 minute. That should
hopefully give enough time for the queue to clear.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed squeed force-pushed the endpointcreate-wait-longer branch from cd0b6d9 to 4ff1ef8 Compare June 19, 2023 14:03
@squeed
Copy link
Contributor Author

squeed commented Jun 19, 2023

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 19, 2023
@squeed squeed added the kind/bug This is a bug in the Cilium logic. label Jun 20, 2023
@squeed
Copy link
Contributor Author

squeed commented Jun 20, 2023

With regards to feature freeze, I argue this is fixing a known bug (and it tweaks a few configs, rather than introducing any sort of new behavior).

@borkmann borkmann merged commit c8eeebe into cilium:main Jun 20, 2023
61 of 62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cni Impacts the Container Networking Interface between Cilium and the orchestrator. area/daemon Impacts operation of the Cilium daemon. kind/bug This is a bug in the Cilium logic. kind/enhancement This would improve or streamline existing functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: flakes due to putEndpointIdTooManyRequests
4 participants