clustermesh: split the generic logic from the specific part #25921

giorio94 · 2023-06-06T07:27:30Z

This PR refactors the clustermesh subsystem, splitting it into a generic part (pkg/clustermesh), which is responsible for watching the configuration directory and establishing the connections to remote etcd endpoints, and a clustermesh specific part (pkg/clustermesh/clustermesh), which handles the actual synchronization of the required information). This is a preparatory change to allow reusing the generic part for the upcoming kvstoremesh implementation, reducing code duplication and maintenance burden.

pkg/clustermesh/clustermesh/clustermesh.go

joamaki · 2023-06-06T08:56:35Z

pkg/clustermesh/clustermesh/clustermesh.go

+// SPDX-License-Identifier: Apache-2.0
+// Copyright Authors of Cilium
+
+package clustermesh


pkg/clustermesh/clustermesh/clustermesh.go isn't great. Would it be possible to make this internal package with public APIs it implements exposed from pkg/clustermesh as interface types for example? Or wait, is it the other way around?

Yeah, I also personally find pkg/clustermesh/clustermesh to be not that great. I'm not sure about a better one though. One possibility could be to just keep everything in the base clustermesh package, losing a bit of separation (and have the kvstoremesh stuff in pkg/clustermesh/kvstoremesh). Otherwise I could do the other way round and move the generic bits in pkg/clustermesh/internal or pkg/clustermesh/generic (is this what you were proposing?). WDYT?

internal or shared or generic package seems like it could work nicely. Hard to say what's best though, so I'll leave it to you.

I've moved the generic part to the pkg/clustermesh/internal package, and restored back the business logic in pkg/clustermesh, adapting variables names accordingly. I've additionally extracted a few clusterconfig-related utilities in a pkg/clustermesh/utils package, as they are also used by the clustermesh-apiserver.

pkg/clustermesh/clustermesh.go

YutaroHayakawa

LGTM with nits 👍

giorio94 · 2023-06-07T10:06:38Z

@YutaroHayakawa @joamaki PTAL

giorio94 · 2023-06-07T10:08:22Z

/test

giorio94 · 2023-06-07T10:20:28Z

Force pushed to rebase onto main.

giorio94 · 2023-06-07T10:20:33Z

/test

nebril

Looks good overall, I think I might have found one bug around restarting remote connections.

pkg/clustermesh/internal/remote_cluster.go

nebril · 2023-06-07T11:31:53Z

pkg/clustermesh/internal/remote_cluster.go

+		for {
+			select {
+			// terminate routine when remote cluster is removed
+			case _, ok := <-rc.changed:


Please correct me if I am wrong, but I think that this channel receive will race with the receive from line 273. If this goroutine happens to get in select while the goroutine above is not executing and there is a message in rc.changed channel buffer, it will get the message from channel, check ok value to be true (since the channel is not closed) and this goroutine will return while restartRemoteConnection would not be called above.

To fix that behaviour, I would suggest merging both goroutines into one so that there is only one select.

Yes, you are right. It seems to me that this logic is also possibly prone to other race conditions, as I've seen connections being restarted multiple times in a row following a single failure. I'll refactor it make it more robust.

~~9668440e8b98e32cd4a5281e8cd25e689a07bafc~~ faebb6f5368610a78c56bd59626dc37e088cc864 addresses this issue refactoring the watchdog logic. I've still maintained the existing changed channel logic when an update is detected to avoid introducing too many changes in this PR which is already large due to moving things around.

🤔 There's an issue with the new logic. Let me fix it

Error cases should be correctly handled now.

YutaroHayakawa

🚀

giorio94 · 2023-06-07T13:20:47Z

Rebased onto main to pick the CI fixes.

giorio94 · 2023-06-07T13:21:08Z

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.19' hit: #25964 (97.42% similarity)

giorio94 · 2023-06-08T07:08:53Z

/test-1.16-4.19

giorio94 · 2023-06-08T09:10:47Z

Rebased onto main to fix conflicts

giorio94 · 2023-06-08T10:59:48Z

/test

giorio94 · 2023-06-08T12:36:42Z

/ci-aks

giorio94 · 2023-06-08T12:48:08Z

Conformance ginkgo hit #22749 and #25964

nebril

LGTM!

This commit refactors the clustermesh subsystem, splitting it into a generic part (pkg/clustermesh/internal), which is responsible for watching the configuration directory and establishing the connections to remote etcd endpoints, and a clustermesh specific part (pkg/clustermesh), which handles the actual synchronization of the required information). Additionally, the helper methods dealing with the setting and retrieval of the CiliumClusterConfig are moved to a separate `clustermesh/utils` package. This is a preparatory change to allow reusing the generic part for the upcoming kvstoremesh implementation, reducing code duplication and maintenance burden. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

The cilium agent implements a watchdog mechanism to restart the connection to remote kvstores in case of connectivity errors. Yet, the current implementation is prone to a possible race conditions causing spurious restarts. Specifically, when an error is detected, it triggers the restart of the controller responsible for connecting to the kvstore, and then iterates again, supposedly waiting until the connection is established (i.e., `rc.backend` is no longer nil). Still, `rc.backend` is asynchronously reset to nil when the controller is executed, causing the new iteration of the watchdog to possibly still use the errors channel of the old connection. This commit fixes this issue moving the watchdog logic to a separate function which is started by the controller, and terminates when the corresponding context is canceled (i.e., the controller is restarted). This also removes the race condition between the two goroutines reading from the `rc.changed` channel. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2023-06-09T10:00:27Z

Rebased onto main to fix a conflict.

giorio94 · 2023-06-09T10:00:39Z

/test

joamaki

Structure looks good!

giorio94 · 2023-06-09T12:12:27Z

/ci-aks

giorio94 added area/clustermesh Relates to multi-cluster routing functionality in Cilium. release-note/misc This PR makes changes that have no direct user impact. labels Jun 6, 2023

giorio94 requested review from a team as code owners June 6, 2023 07:27

giorio94 requested review from YutaroHayakawa, lmb and nebril June 6, 2023 07:27

joamaki self-requested a review June 6, 2023 08:48

joamaki reviewed Jun 6, 2023

View reviewed changes

YutaroHayakawa reviewed Jun 7, 2023

View reviewed changes

pkg/clustermesh/clustermesh.go Outdated Show resolved Hide resolved

YutaroHayakawa reviewed Jun 7, 2023

View reviewed changes

pkg/clustermesh/clustermesh.go Outdated Show resolved Hide resolved

YutaroHayakawa approved these changes Jun 7, 2023

View reviewed changes

giorio94 force-pushed the mio/clustermesh-split branch from 8524fa6 to 16c4793 Compare June 7, 2023 07:55

joamaki approved these changes Jun 7, 2023

View reviewed changes

giorio94 force-pushed the mio/clustermesh-split branch from 16c4793 to 03c285b Compare June 7, 2023 10:02

giorio94 requested review from YutaroHayakawa and joamaki June 7, 2023 10:06

giorio94 force-pushed the mio/clustermesh-split branch from 03c285b to f7f9c74 Compare June 7, 2023 10:19

nebril requested changes Jun 7, 2023

View reviewed changes

YutaroHayakawa approved these changes Jun 7, 2023

View reviewed changes

giorio94 force-pushed the mio/clustermesh-split branch 2 times, most recently from 0102736 to faebb6f Compare June 7, 2023 12:34

giorio94 requested a review from nebril June 7, 2023 12:44

giorio94 force-pushed the mio/clustermesh-split branch from faebb6f to 09d9368 Compare June 7, 2023 13:20

lmb removed their request for review June 7, 2023 13:33

maintainer-s-little-helper bot mentioned this pull request Jun 7, 2023

CI: K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing with bpf_host #25964

Closed

giorio94 force-pushed the mio/clustermesh-split branch from 09d9368 to 7b42930 Compare June 8, 2023 09:10

giorio94 mentioned this pull request Jun 8, 2023

CI: ConformanceAKS - echo-ingress-l7 #26032

Closed

nebril approved these changes Jun 9, 2023

View reviewed changes

giorio94 added 2 commits June 9, 2023 11:59

giorio94 force-pushed the mio/clustermesh-split branch from 7b42930 to 9d7289a Compare June 9, 2023 10:00

joamaki approved these changes Jun 9, 2023

View reviewed changes

giorio94 mentioned this pull request Jun 9, 2023

CI: ConformanceAKS - Clean up Cilium #26075

Closed

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 9, 2023

dylandreimerink merged commit 1505363 into cilium:main Jun 9, 2023
63 of 64 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustermesh: split the generic logic from the specific part #25921

clustermesh: split the generic logic from the specific part #25921

giorio94 commented Jun 6, 2023

joamaki Jun 6, 2023

giorio94 Jun 6, 2023

joamaki Jun 7, 2023

giorio94 Jun 7, 2023

YutaroHayakawa left a comment

giorio94 commented Jun 7, 2023

giorio94 commented Jun 7, 2023

giorio94 commented Jun 7, 2023

giorio94 commented Jun 7, 2023

nebril left a comment

nebril Jun 7, 2023

giorio94 Jun 7, 2023

giorio94 Jun 7, 2023 •

edited

giorio94 Jun 7, 2023

giorio94 Jun 7, 2023

YutaroHayakawa left a comment

giorio94 commented Jun 7, 2023

giorio94 commented Jun 7, 2023 •

edited by maintainer-s-little-helper bot

giorio94 commented Jun 8, 2023

giorio94 commented Jun 8, 2023

giorio94 commented Jun 8, 2023

giorio94 commented Jun 8, 2023

giorio94 commented Jun 8, 2023

nebril left a comment

giorio94 commented Jun 9, 2023

giorio94 commented Jun 9, 2023

joamaki left a comment

giorio94 commented Jun 9, 2023

clustermesh: split the generic logic from the specific part #25921

clustermesh: split the generic logic from the specific part #25921

Conversation

giorio94 commented Jun 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YutaroHayakawa left a comment

Choose a reason for hiding this comment

giorio94 commented Jun 7, 2023

giorio94 commented Jun 7, 2023

giorio94 commented Jun 7, 2023

giorio94 commented Jun 7, 2023

nebril left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

giorio94 Jun 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

YutaroHayakawa left a comment

Choose a reason for hiding this comment

giorio94 commented Jun 7, 2023

giorio94 commented Jun 7, 2023 • edited by maintainer-s-little-helper bot

giorio94 commented Jun 8, 2023

giorio94 commented Jun 8, 2023

giorio94 commented Jun 8, 2023

giorio94 commented Jun 8, 2023

giorio94 commented Jun 8, 2023

nebril left a comment

Choose a reason for hiding this comment

giorio94 commented Jun 9, 2023

giorio94 commented Jun 9, 2023

joamaki left a comment

Choose a reason for hiding this comment

giorio94 commented Jun 9, 2023

giorio94 Jun 7, 2023 •

edited

giorio94 commented Jun 7, 2023 •

edited by maintainer-s-little-helper bot