Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cilium-health: Fix broken retry loop in cilium-health-ep controller #31622

Merged

Commits on Mar 26, 2024

  1. cilium-health: Fix broken retry loop in cilium-health-ep controller

    This commit fixes a bug in the `cilium-health-ep` controller restart
    logic where it did not give the cilium-health endpoint enough time to
    startup before it was re-created.
    
    For context, the `cilium-health-ep` performs two tasks:
    
      1. Launch the cilium-health endpoint when the controller is started
         for the first time.
      2. Ping the cilium-health endpoint, and if it does not reply, destroy
         and re-create it.
    
    The controller has a `RunInterval` of 60 seconds and a default
    `ErrorRetryBaseDuration` of 1 second. This means that after launching
    the initial cilium-health endpoint, we wait for 60 seconds before we
    attempt to ping it. If that ping succeeds, we then keep pinging the
    health endpoint every 60 seconds.
    
    However, if a ping fails, the controller deletes the existing endpoint
    and creates a new one. Because the controller then also returns an
    error, it is immediately re-run after one second, because in the failure
    case a controller retries with an interval of `consecutiveErrors *
    ErrorRetryBaseDuration`.
    
    This meant that after a failed ping, we deleted the unreachable
    endpoint, recreated a new one, and after 1s would immediately try to
    ping it. Because the newly launched endpoint will is unlikely to be
    reachable after just one second (it requires a full endpoint
    regeneration with BPF compilation), the `cilium-health-ep` logic would
    declare the still starting endpoint as dead and re-create it. This loop
    would continue endlessly, causing lots of unnecessary CPU churn, until
    enough consecutive errors have happened for the wait time between launch
    and the first ping to be long enough for a cilium-health endpoint to be
    fully regenerated.
    
    This commit attempts to fix the logic by not immediately killing a
    unreachable health endpoint and instead waiting for three minutes to
    pass before we attempt to try again. Three minutes should hopefully be
    enough time for the initial endpoint regeneration to succeed.
    
    Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
    gandro committed Mar 26, 2024
    Configuration menu
    Copy the full SHA
    aad9783 View commit details
    Browse the repository at this point in the history