-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ipam: Add exponential backoff when pool maintanance fails #21473
Merged
ti-mo
merged 3 commits into
cilium:master
from
gandro:pr/gandro/ipam-add-maintenance-backoff
Sep 30, 2022
Merged
ipam: Add exponential backoff when pool maintanance fails #21473
ti-mo
merged 3 commits into
cilium:master
from
gandro:pr/gandro/ipam-add-maintenance-backoff
Sep 30, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This adds a new optional callback to the trigger mechanism which will be called if a trigger is stopped via the Trigger.Shutdown. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit extracts the main ClusterSizeDependantInterval computation so it can be used by different node managers. It will be used in a subsequent commit. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
When pool maintenance fails, the pool maintenance trigger is triggered such that the logic is executed again. However, if maintenance fails for example because of external reasons, the default retry interval of 10 milliseconds is way to short. Especially if the cloud provider API is overloaded, multiple nodes can be stuck in a 10 millisecond retry loop, which will make the situation even worse. Therefore, this commit introduces an exponential backoff if the pool maintenance function fails with an error. The minimum trigger interval remains 10 milliseconds to allow for other trigger reasons (e.g. because of a resync) to not be delayed as long as the node is healthy. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
2a1065b
to
f6d8d26
Compare
christarazi
approved these changes
Sep 29, 2022
/test |
jibi
approved these changes
Sep 29, 2022
Failing CI pipelines are not required and the failures are unrelated (infra issue in both cases). Marking ready-to-merge. |
ti-mo
approved these changes
Sep 30, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
affects/v1.10
This issue affects v1.10 branch
affects/v1.11
This issue affects v1.11 branch
affects/v1.12
This issue affects v1.12 branch
ready-to-merge
This PR has passed all tests and received consensus from code owners to merge.
release-note/minor
This PR changes functionality that users may find relevant to operating Cilium.
sig/ipam
IP address management, including cloud IPAM
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When pool maintenance fails, the pool maintenance trigger is triggered
such that the logic is executed again. However, if maintenance fails for
example because of external reasons, the default retry interval of 10
milliseconds is way to short. Especially if the cloud provider API is
overloaded, multiple nodes can be stuck in a 10 millisecond retry loop,
which will make the situation even worse.
Therefore, this commit introduces an exponential backoff if the pool
maintenance function fails with an error. The minimum trigger interval
remains 10 milliseconds to allow for other trigger reasons (e.g.
because of a resync) to not be delayed as long as the node is healthy.