clustermesh: ensure that the status of the remote clusters controller is correcty reported #26271

giorio94 · 2023-06-15T15:07:48Z

150de13 ("clustermesh: delete stale node/service entries on reconnect/disconnect") and follow-ups changed the behavior of the controller which wraps the logic used to connect to the kvstore in a remote cluster. More specifically, we previously used to return from the controller DoFunc after completing the setup process (while executing in background the actual synchronization tasks). With that commit, instead, we don't return until the context is closed (which means that the connection needed to be restarted/stopped).

While this simplifies the implementation of the cleanup logic, the change turned out to cause issues in the controller health reporting logic. In particular, given that we don't return on success, a previous failure is never cleared out. Which means incorrect metrics and status reporting through the cilium status commands.

This PR fixes this issue reworking the logic so that we return from the controller DoFunc as soon as the initialization tasks completed, while executing the long running logic in background.

150de13 ("clustermesh: delete stale node/service entries on reconnect/disconnect") and follow-ups changed the behavior of the controller which wraps the logic used to connect to the kvstore in a remote cluster. More specifically, we previously used to return from the controller DoFunc after completing the setup process (while executing in background the actual synchronization tasks). With that commit, instead, we don't return until the context is closed (which means that the connection needed to be restarted/stopped). While this simplifies the implementation of the cleanup logic, the change turned out to cause issues in the controller health reporting logic. In particular, given that we don't return on success, a previous failure is never cleared out. Which means incorrect metrics and status reporting through the `cilium status` commands. This commit fixes this issue reworking the logic so that we return from the controller DoFunc as soon as the initialization tasks completed, while executing the long running logic in background. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2023-06-15T16:43:33Z

/test

giorio94 · 2023-06-16T08:42:58Z

/ci-external-workloads

Failed because one spot instance got rotated while performing the tests.

giorio94 · 2023-06-16T08:47:07Z

/ci-gke

Same as above

giorio94 · 2023-06-16T08:47:53Z

/ci-l4lb

Failure while pulling an image from Dockerhub

giorio94 · 2023-06-16T10:02:52Z

ConformanceK8sKind hit #25758

giorio94 added kind/bug This is a bug in the Cilium logic. area/clustermesh Relates to multi-cluster routing functionality in Cilium. release-note/misc This PR makes changes that have no direct user impact. labels Jun 15, 2023

giorio94 requested a review from a team as a code owner June 15, 2023 15:07

giorio94 requested a review from YutaroHayakawa June 15, 2023 15:07

giorio94 force-pushed the mio/clustermesh-controller branch from 8304298 to b229448 Compare June 15, 2023 15:20

giorio94 changed the title ~~clustermesh: ensure that the status of the remote clusters controller is reset on success~~ clustermesh: ensure that the status of the remote clusters controller is correcty reported Jun 15, 2023

giorio94 force-pushed the mio/clustermesh-controller branch from b229448 to ebebf06 Compare June 15, 2023 16:42

YutaroHayakawa approved these changes Jun 16, 2023

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 16, 2023

borkmann merged commit 019eac8 into cilium:main Jun 16, 2023
62 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustermesh: ensure that the status of the remote clusters controller is correcty reported #26271

clustermesh: ensure that the status of the remote clusters controller is correcty reported #26271

giorio94 commented Jun 15, 2023

giorio94 commented Jun 15, 2023

giorio94 commented Jun 16, 2023 •

edited

giorio94 commented Jun 16, 2023 •

edited

giorio94 commented Jun 16, 2023 •

edited

giorio94 commented Jun 16, 2023

clustermesh: ensure that the status of the remote clusters controller is correcty reported #26271

clustermesh: ensure that the status of the remote clusters controller is correcty reported #26271

Conversation

giorio94 commented Jun 15, 2023

giorio94 commented Jun 15, 2023

giorio94 commented Jun 16, 2023 • edited

giorio94 commented Jun 16, 2023 • edited

giorio94 commented Jun 16, 2023 • edited

giorio94 commented Jun 16, 2023

giorio94 commented Jun 16, 2023 •

edited

giorio94 commented Jun 16, 2023 •

edited

giorio94 commented Jun 16, 2023 •

edited