New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cilium-health: Fix broken retry loop in cilium-health-ep
controller
#31622
cilium-health: Fix broken retry loop in cilium-health-ep
controller
#31622
Conversation
This commit fixes a bug in the `cilium-health-ep` controller restart logic where it did not give the cilium-health endpoint enough time to startup before it was re-created. For context, the `cilium-health-ep` performs two tasks: 1. Launch the cilium-health endpoint when the controller is started for the first time. 2. Ping the cilium-health endpoint, and if it does not reply, destroy and re-create it. The controller has a `RunInterval` of 60 seconds and a default `ErrorRetryBaseDuration` of 1 second. This means that after launching the initial cilium-health endpoint, we wait for 60 seconds before we attempt to ping it. If that ping succeeds, we then keep pinging the health endpoint every 60 seconds. However, if a ping fails, the controller deletes the existing endpoint and creates a new one. Because the controller then also returns an error, it is immediately re-run after one second, because in the failure case a controller retries with an interval of `consecutiveErrors * ErrorRetryBaseDuration`. This meant that after a failed ping, we deleted the unreachable endpoint, recreated a new one, and after 1s would immediately try to ping it. Because the newly launched endpoint will is unlikely to be reachable after just one second (it requires a full endpoint regeneration with BPF compilation), the `cilium-health-ep` logic would declare the still starting endpoint as dead and re-create it. This loop would continue endlessly, causing lots of unnecessary CPU churn, until enough consecutive errors have happened for the wait time between launch and the first ping to be long enough for a cilium-health endpoint to be fully regenerated. This commit attempts to fix the logic by not immediately killing a unreachable health endpoint and instead waiting for three minutes to pass before we attempt to try again. Three minutes should hopefully be enough time for the initial endpoint regeneration to succeed. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
e35b861
to
aad9783
Compare
The reason for back-porting is that this also causes CI flakes sometimes, as in CI the endpoint regeneration at startup can take up some time. For example: |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if my approval counts but LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find!
Until #31622 has been released. Signed-off-by: Martynas Pumputis <m@lambda.lt>
Until #31622 has been released. Signed-off-by: Martynas Pumputis <m@lambda.lt>
Until #31622 has been released. Signed-off-by: Martynas Pumputis <m@lambda.lt>
Until #31622 has been released. Signed-off-by: Martynas Pumputis <m@lambda.lt>
This commit fixes a bug in the
cilium-health-ep
controller restart logic where it did not give the cilium-health endpoint enough time to startup before it was re-created.For context, the
cilium-health-ep
performs two tasks:The controller has a
RunInterval
of 60 seconds and a defaultErrorRetryBaseDuration
of 1 second. This means that after launching the initial cilium-health endpoint, we wait for 60 seconds before we attempt to ping it. If that ping succeeds, we then keep pinging the health endpoint every 60 seconds.However, if a ping fails, the controller deletes the existing endpoint and creates a new one. Because the controller then also returns an error, it is immediately re-run after one second, because in the failure case a controller retries with an interval of
consecutiveErrors * ErrorRetryBaseDuration
.This meant that after a failed ping, we deleted the unreachable endpoint, recreated a new one, and after 1s would immediately try to ping it. Because the newly launched endpoint will is unlikely to be reachable after just one second (it requires a full endpoint regeneration with BPF compilation), the
cilium-health-ep
logic would declare the still starting endpoint as dead and re-create it. This loop would continue endlessly, causing lots of unnecessary CPU churn, until enough consecutive errors have happened for the wait time between launch and the first ping to be long enough for a cilium-health endpoint to be fully regenerated.This commit attempts to fix the logic by not immediately killing a unreachable health endpoint and instead waiting for three minutes to pass before we attempt to try again. Three minutes should hopefully be enough time for the initial endpoint regeneration to succeed.