cilium-health: Fix broken retry loop in `cilium-health-ep` controller #31622

This commit fixes a bug in the `cilium-health-ep` controller restart logic where it did not give the cilium-health endpoint enough time to startup before it was re-created. For context, the `cilium-health-ep` performs two tasks: 1. Launch the cilium-health endpoint when the controller is started for the first time. 2. Ping the cilium-health endpoint, and if it does not reply, destroy and re-create it. The controller has a `RunInterval` of 60 seconds and a default `ErrorRetryBaseDuration` of 1 second. This means that after launching the initial cilium-health endpoint, we wait for 60 seconds before we attempt to ping it. If that ping succeeds, we then keep pinging the health endpoint every 60 seconds. However, if a ping fails, the controller deletes the existing endpoint and creates a new one. Because the controller then also returns an error, it is immediately re-run after one second, because in the failure case a controller retries with an interval of `consecutiveErrors * ErrorRetryBaseDuration`. This meant that after a failed ping, we deleted the unreachable endpoint, recreated a new one, and after 1s would immediately try to ping it. Because the newly launched endpoint will is unlikely to be reachable after just one second (it requires a full endpoint regeneration with BPF compilation), the `cilium-health-ep` logic would declare the still starting endpoint as dead and re-create it. This loop would continue endlessly, causing lots of unnecessary CPU churn, until enough consecutive errors have happened for the wait time between launch and the first ping to be long enough for a cilium-health endpoint to be fully regenerated. This commit attempts to fix the logic by not immediately killing a unreachable health endpoint and instead waiting for three minutes to pass before we attempt to try again. Three minutes should hopefully be enough time for the initial endpoint regeneration to succeed. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cilium-health: Fix broken retry loop in `cilium-health-ep` controller #31622

cilium-health: Fix broken retry loop in `cilium-health-ep` controller #31622

Commits on Mar 26, 2024

cilium-health: Fix broken retry loop in cilium-health-ep controller #31622

cilium-health: Fix broken retry loop in cilium-health-ep controller #31622

Commits on Mar 26, 2024

cilium-health: Fix broken retry loop in `cilium-health-ep` controller #31622

cilium-health: Fix broken retry loop in `cilium-health-ep` controller #31622