Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipam: Add exponential backoff when pool maintanance fails #21473

Merged
merged 3 commits into from
Sep 30, 2022

Conversation

gandro
Copy link
Member

@gandro gandro commented Sep 28, 2022

When pool maintenance fails, the pool maintenance trigger is triggered
such that the logic is executed again. However, if maintenance fails for
example because of external reasons, the default retry interval of 10
milliseconds is way to short. Especially if the cloud provider API is
overloaded, multiple nodes can be stuck in a 10 millisecond retry loop,
which will make the situation even worse.

Therefore, this commit introduces an exponential backoff if the pool
maintenance function fails with an error. The minimum trigger interval
remains 10 milliseconds to allow for other trigger reasons (e.g.
because of a resync) to not be delayed as long as the node is healthy.

This adds a new optional callback to the trigger mechanism which will be
called if a trigger is stopped via the Trigger.Shutdown.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit extracts the main ClusterSizeDependantInterval computation
so it can be used by different node managers. It will be used in a
subsequent commit.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 28, 2022
@gandro gandro added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/ipam IP address management, including cloud IPAM affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch affects/v1.10 This issue affects v1.10 branch labels Sep 28, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 28, 2022
@gandro gandro marked this pull request as ready for review September 28, 2022 14:08
@gandro gandro requested review from a team as code owners September 28, 2022 14:08
@gandro gandro requested a review from jibi September 28, 2022 14:08
When pool maintenance fails, the pool maintenance trigger is triggered
such that the logic is executed again. However, if maintenance fails for
example because of external reasons, the default retry interval of 10
milliseconds is way to short. Especially if the cloud provider API is
overloaded, multiple nodes can be stuck in a 10 millisecond retry loop,
which will make the situation even worse.

Therefore, this commit introduces an exponential backoff if the pool
maintenance function fails with an error. The minimum trigger interval
remains 10 milliseconds to allow for other trigger reasons (e.g.
because of a resync) to not be delayed as long as the node is healthy.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/ipam-add-maintenance-backoff branch from 2a1065b to f6d8d26 Compare September 28, 2022 14:11
@gandro
Copy link
Member Author

gandro commented Sep 29, 2022

/test

@gandro
Copy link
Member Author

gandro commented Sep 29, 2022

Failing CI pipelines are not required and the failures are unrelated (infra issue in both cases). Marking ready-to-merge.

@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Sep 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.10 This issue affects v1.10 branch affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/ipam IP address management, including cloud IPAM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants