cilium-health: Fix broken retry loop in `cilium-health-ep` controller #31622

gandro · 2024-03-26T16:13:40Z

This commit fixes a bug in the cilium-health-ep controller restart logic where it did not give the cilium-health endpoint enough time to startup before it was re-created.

For context, the cilium-health-ep performs two tasks:

Launch the cilium-health endpoint when the controller is started for the first time.
Ping the cilium-health endpoint, and if it does not reply, destroy and re-create it.

The controller has a RunInterval of 60 seconds and a default ErrorRetryBaseDuration of 1 second. This means that after launching the initial cilium-health endpoint, we wait for 60 seconds before we attempt to ping it. If that ping succeeds, we then keep pinging the health endpoint every 60 seconds.

However, if a ping fails, the controller deletes the existing endpoint and creates a new one. Because the controller then also returns an error, it is immediately re-run after one second, because in the failure case a controller retries with an interval of consecutiveErrors * ErrorRetryBaseDuration.

This meant that after a failed ping, we deleted the unreachable endpoint, recreated a new one, and after 1s would immediately try to ping it. Because the newly launched endpoint will is unlikely to be reachable after just one second (it requires a full endpoint regeneration with BPF compilation), the cilium-health-ep logic would declare the still starting endpoint as dead and re-create it. This loop would continue endlessly, causing lots of unnecessary CPU churn, until enough consecutive errors have happened for the wait time between launch and the first ping to be long enough for a cilium-health endpoint to be fully regenerated.

This commit attempts to fix the logic by not immediately killing a unreachable health endpoint and instead waiting for three minutes to pass before we attempt to try again. Three minutes should hopefully be enough time for the initial endpoint regeneration to succeed.

This commit fixes a bug in the `cilium-health-ep` controller restart logic where it did not give the cilium-health endpoint enough time to startup before it was re-created. For context, the `cilium-health-ep` performs two tasks: 1. Launch the cilium-health endpoint when the controller is started for the first time. 2. Ping the cilium-health endpoint, and if it does not reply, destroy and re-create it. The controller has a `RunInterval` of 60 seconds and a default `ErrorRetryBaseDuration` of 1 second. This means that after launching the initial cilium-health endpoint, we wait for 60 seconds before we attempt to ping it. If that ping succeeds, we then keep pinging the health endpoint every 60 seconds. However, if a ping fails, the controller deletes the existing endpoint and creates a new one. Because the controller then also returns an error, it is immediately re-run after one second, because in the failure case a controller retries with an interval of `consecutiveErrors * ErrorRetryBaseDuration`. This meant that after a failed ping, we deleted the unreachable endpoint, recreated a new one, and after 1s would immediately try to ping it. Because the newly launched endpoint will is unlikely to be reachable after just one second (it requires a full endpoint regeneration with BPF compilation), the `cilium-health-ep` logic would declare the still starting endpoint as dead and re-create it. This loop would continue endlessly, causing lots of unnecessary CPU churn, until enough consecutive errors have happened for the wait time between launch and the first ping to be long enough for a cilium-health endpoint to be fully regenerated. This commit attempts to fix the logic by not immediately killing a unreachable health endpoint and instead waiting for three minutes to pass before we attempt to try again. Three minutes should hopefully be enough time for the initial endpoint regeneration to succeed. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

gandro · 2024-03-26T16:30:15Z

The reason for back-porting is that this also causes CI flakes sometimes, as in CI the endpoint regeneration at startup can take up some time. For example:

CI: Cilium not ready due to failures of cilium-health-ep controller #28321
CI: Jenkins: Test runs time out: cilium-health-ep reports "connect: no route to host" #18913

gandro · 2024-03-26T16:34:59Z

/test

michi-covalent

not sure if my approval counts but LGTM!

tklauser

Nice find!

Until #31622 has been released. Signed-off-by: Martynas Pumputis <m@lambda.lt>

gandro requested a review from a team as a code owner March 26, 2024 16:13

gandro requested a review from danehans March 26, 2024 16:13

gandro force-pushed the pr/gandro/fix-cilium-health-restart-logic branch from e35b861 to aad9783 Compare March 26, 2024 16:16

michi-covalent approved these changes Mar 27, 2024

View reviewed changes

tklauser approved these changes Mar 27, 2024

View reviewed changes

gandro removed the request for review from danehans March 27, 2024 08:56

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 27, 2024

gandro added this pull request to the merge queue Mar 27, 2024

Merged via the queue into cilium:main with commit 43bd8c1 Mar 27, 2024
62 checks passed

gandro deleted the pr/gandro/fix-cilium-health-restart-logic branch March 27, 2024 10:05

squeed mentioned this pull request Mar 27, 2024

ICMP health-check timeouts after upgrading to cilium1.15.2 #31567

Open

3 tasks

joamaki mentioned this pull request Apr 2, 2024

v1.13 Backports 2024-04-02 #31722

Merged

8 tasks

joamaki added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 2, 2024

joamaki mentioned this pull request Apr 2, 2024

v1.14 Backports 2024-04-02 #31724

Merged

10 tasks

joamaki added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Apr 2, 2024

joamaki mentioned this pull request Apr 2, 2024

v1.15 Backports 2024-04-02 #31727

Merged

13 tasks

joamaki added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 2, 2024

github-actions bot removed the backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. label Apr 4, 2024

brb added a commit that referenced this pull request Apr 11, 2024

ci-e2e-upgrade: Disable 12th config

ddad2c8

Until #31622 has been released. Signed-off-by: Martynas Pumputis <m@lambda.lt>

brb added a commit that referenced this pull request Apr 11, 2024

ci-e2e-upgrade: Disable 12th config

337b8de

Until #31622 has been released. Signed-off-by: Martynas Pumputis <m@lambda.lt>

brb added a commit that referenced this pull request Apr 11, 2024

ci-e2e-upgrade: Disable 12th config

f4739ad

Until #31622 has been released. Signed-off-by: Martynas Pumputis <m@lambda.lt>

This was referenced Apr 11, 2024

Prepare for release v1.13.15 asauber/cilium#1

Open

Prepare for release v1.13.15 #31906

Merged

github-merge-queue bot pushed a commit that referenced this pull request Apr 11, 2024

ci-e2e-upgrade: Disable 12th config

3823873

Until #31622 has been released. Signed-off-by: Martynas Pumputis <m@lambda.lt>

This was referenced Apr 11, 2024

Prepare for release v1.15.4 #31908

Merged

Prepare for release v1.14.10 #31910

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cilium-health: Fix broken retry loop in `cilium-health-ep` controller #31622

cilium-health: Fix broken retry loop in `cilium-health-ep` controller #31622

gandro commented Mar 26, 2024 •

edited

gandro commented Mar 26, 2024 •

edited

gandro commented Mar 26, 2024

michi-covalent left a comment

tklauser left a comment

cilium-health: Fix broken retry loop in cilium-health-ep controller #31622

cilium-health: Fix broken retry loop in cilium-health-ep controller #31622

Conversation

gandro commented Mar 26, 2024 • edited

gandro commented Mar 26, 2024 • edited

gandro commented Mar 26, 2024

michi-covalent left a comment

Choose a reason for hiding this comment

tklauser left a comment

Choose a reason for hiding this comment

cilium-health: Fix broken retry loop in `cilium-health-ep` controller #31622

cilium-health: Fix broken retry loop in `cilium-health-ep` controller #31622

gandro commented Mar 26, 2024 •

edited

gandro commented Mar 26, 2024 •

edited