Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cilium-health: Fix broken retry loop in cilium-health-ep controller #31622

Merged

Conversation

gandro
Copy link
Member

@gandro gandro commented Mar 26, 2024

This commit fixes a bug in the cilium-health-ep controller restart logic where it did not give the cilium-health endpoint enough time to startup before it was re-created.

For context, the cilium-health-ep performs two tasks:

  1. Launch the cilium-health endpoint when the controller is started for the first time.
  2. Ping the cilium-health endpoint, and if it does not reply, destroy and re-create it.

The controller has a RunInterval of 60 seconds and a default ErrorRetryBaseDuration of 1 second. This means that after launching the initial cilium-health endpoint, we wait for 60 seconds before we attempt to ping it. If that ping succeeds, we then keep pinging the health endpoint every 60 seconds.

However, if a ping fails, the controller deletes the existing endpoint and creates a new one. Because the controller then also returns an error, it is immediately re-run after one second, because in the failure case a controller retries with an interval of consecutiveErrors * ErrorRetryBaseDuration.

This meant that after a failed ping, we deleted the unreachable endpoint, recreated a new one, and after 1s would immediately try to ping it. Because the newly launched endpoint will is unlikely to be reachable after just one second (it requires a full endpoint regeneration with BPF compilation), the cilium-health-ep logic would declare the still starting endpoint as dead and re-create it. This loop would continue endlessly, causing lots of unnecessary CPU churn, until enough consecutive errors have happened for the wait time between launch and the first ping to be long enough for a cilium-health endpoint to be fully regenerated.

This commit attempts to fix the logic by not immediately killing a unreachable health endpoint and instead waiting for three minutes to pass before we attempt to try again. Three minutes should hopefully be enough time for the initial endpoint regeneration to succeed.

@gandro gandro added release-note/bug This PR fixes an issue in a previous release of Cilium. area/health Relates to the cilium-health component needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Mar 26, 2024
@gandro gandro requested a review from a team as a code owner March 26, 2024 16:13
@gandro gandro requested a review from danehans March 26, 2024 16:13
This commit fixes a bug in the `cilium-health-ep` controller restart
logic where it did not give the cilium-health endpoint enough time to
startup before it was re-created.

For context, the `cilium-health-ep` performs two tasks:

  1. Launch the cilium-health endpoint when the controller is started
     for the first time.
  2. Ping the cilium-health endpoint, and if it does not reply, destroy
     and re-create it.

The controller has a `RunInterval` of 60 seconds and a default
`ErrorRetryBaseDuration` of 1 second. This means that after launching
the initial cilium-health endpoint, we wait for 60 seconds before we
attempt to ping it. If that ping succeeds, we then keep pinging the
health endpoint every 60 seconds.

However, if a ping fails, the controller deletes the existing endpoint
and creates a new one. Because the controller then also returns an
error, it is immediately re-run after one second, because in the failure
case a controller retries with an interval of `consecutiveErrors *
ErrorRetryBaseDuration`.

This meant that after a failed ping, we deleted the unreachable
endpoint, recreated a new one, and after 1s would immediately try to
ping it. Because the newly launched endpoint will is unlikely to be
reachable after just one second (it requires a full endpoint
regeneration with BPF compilation), the `cilium-health-ep` logic would
declare the still starting endpoint as dead and re-create it. This loop
would continue endlessly, causing lots of unnecessary CPU churn, until
enough consecutive errors have happened for the wait time between launch
and the first ping to be long enough for a cilium-health endpoint to be
fully regenerated.

This commit attempts to fix the logic by not immediately killing a
unreachable health endpoint and instead waiting for three minutes to
pass before we attempt to try again. Three minutes should hopefully be
enough time for the initial endpoint regeneration to succeed.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/fix-cilium-health-restart-logic branch from e35b861 to aad9783 Compare March 26, 2024 16:16
@gandro
Copy link
Member Author

gandro commented Mar 26, 2024

The reason for back-porting is that this also causes CI flakes sometimes, as in CI the endpoint regeneration at startup can take up some time. For example:

@gandro
Copy link
Member Author

gandro commented Mar 26, 2024

/test

Copy link
Contributor

@michi-covalent michi-covalent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if my approval counts but LGTM!

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find!

@gandro gandro removed the request for review from danehans March 27, 2024 08:56
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 27, 2024
@gandro gandro added this pull request to the merge queue Mar 27, 2024
Merged via the queue into cilium:main with commit 43bd8c1 Mar 27, 2024
62 checks passed
@gandro gandro deleted the pr/gandro/fix-cilium-health-restart-logic branch March 27, 2024 10:05
@joamaki joamaki mentioned this pull request Apr 2, 2024
8 tasks
@joamaki joamaki added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 2, 2024
@joamaki joamaki mentioned this pull request Apr 2, 2024
10 tasks
@joamaki joamaki added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Apr 2, 2024
@joamaki joamaki mentioned this pull request Apr 2, 2024
13 tasks
@joamaki joamaki added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 2, 2024
@github-actions github-actions bot removed the backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. label Apr 4, 2024
@github-actions github-actions bot added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Apr 4, 2024
brb added a commit that referenced this pull request Apr 11, 2024
Until #31622 has been released.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
brb added a commit that referenced this pull request Apr 11, 2024
Until #31622 has been released.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
brb added a commit that referenced this pull request Apr 11, 2024
Until #31622 has been released.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
github-merge-queue bot pushed a commit that referenced this pull request Apr 11, 2024
Until #31622 has been released.

Signed-off-by: Martynas Pumputis <m@lambda.lt>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/health Relates to the cilium-health component backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants