clustermesh: ensure that the status of the remote clusters controller is correcty reported #26271
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
150de13 ("clustermesh: delete stale node/service entries on reconnect/disconnect") and follow-ups changed the behavior of the controller which wraps the logic used to connect to the kvstore in a remote cluster. More specifically, we previously used to return from the controller DoFunc after completing the setup process (while executing in background the actual synchronization tasks). With that commit, instead, we don't return until the context is closed (which means that the connection needed to be restarted/stopped).
While this simplifies the implementation of the cleanup logic, the change turned out to cause issues in the controller health reporting logic. In particular, given that we don't return on success, a previous failure is never cleared out. Which means incorrect metrics and status reporting through the
cilium status
commands.This PR fixes this issue reworking the logic so that we return from the controller DoFunc as soon as the initialization tasks completed, while executing the long running logic in background.