Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustermesh: ensure that the status of the remote clusters controller is correcty reported #26271

Merged
merged 1 commit into from Jun 16, 2023

Conversation

giorio94
Copy link
Member

150de13 ("clustermesh: delete stale node/service entries on reconnect/disconnect") and follow-ups changed the behavior of the controller which wraps the logic used to connect to the kvstore in a remote cluster. More specifically, we previously used to return from the controller DoFunc after completing the setup process (while executing in background the actual synchronization tasks). With that commit, instead, we don't return until the context is closed (which means that the connection needed to be restarted/stopped).

While this simplifies the implementation of the cleanup logic, the change turned out to cause issues in the controller health reporting logic. In particular, given that we don't return on success, a previous failure is never cleared out. Which means incorrect metrics and status reporting through the cilium status commands.

This PR fixes this issue reworking the logic so that we return from the controller DoFunc as soon as the initialization tasks completed, while executing the long running logic in background.

@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. area/clustermesh Relates to multi-cluster routing functionality in Cilium. release-note/misc This PR makes changes that have no direct user impact. labels Jun 15, 2023
@giorio94 giorio94 requested a review from a team as a code owner June 15, 2023 15:07
@giorio94 giorio94 changed the title clustermesh: ensure that the status of the remote clusters controller is reset on success clustermesh: ensure that the status of the remote clusters controller is correcty reported Jun 15, 2023
150de13 ("clustermesh: delete stale node/service entries on
reconnect/disconnect") and follow-ups changed the behavior of the
controller which wraps the logic used to connect to the kvstore in
a remote cluster. More specifically, we previously used to return
from the controller DoFunc after completing the setup process (while
executing in background the actual synchronization tasks). With that
commit, instead, we don't return until the context is closed (which
means that the connection needed to be restarted/stopped).

While this simplifies the implementation of the cleanup logic, the
change turned out to cause issues in the controller health reporting
logic. In particular, given that we don't return on success, a previous
failure is never cleared out. Which means incorrect metrics and status
reporting through the `cilium status` commands.

This commit fixes this issue reworking the logic so that we return from
the controller DoFunc as soon as the initialization tasks completed,
while executing the long running logic in background.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94
Copy link
Member Author

/test

@giorio94
Copy link
Member Author

giorio94 commented Jun 16, 2023

/ci-external-workloads

Failed because one spot instance got rotated while performing the tests.

@giorio94
Copy link
Member Author

giorio94 commented Jun 16, 2023

/ci-gke

Same as above

@giorio94
Copy link
Member Author

giorio94 commented Jun 16, 2023

/ci-l4lb

Failure while pulling an image from Dockerhub

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 16, 2023
@giorio94
Copy link
Member Author

ConformanceK8sKind hit #25758

@borkmann borkmann merged commit 019eac8 into cilium:main Jun 16, 2023
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants