Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustermesh: split the generic logic from the specific part #25921

Merged
merged 2 commits into from Jun 9, 2023

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Jun 6, 2023

This PR refactors the clustermesh subsystem, splitting it into a generic part (pkg/clustermesh), which is responsible for watching the configuration directory and establishing the connections to remote etcd endpoints, and a clustermesh specific part (pkg/clustermesh/clustermesh), which handles the actual synchronization of the required information). This is a preparatory change to allow reusing the generic part for the upcoming kvstoremesh implementation, reducing code duplication and maintenance burden.

@giorio94 giorio94 added area/clustermesh Relates to multi-cluster routing functionality in Cilium. release-note/misc This PR makes changes that have no direct user impact. labels Jun 6, 2023
@giorio94 giorio94 requested review from a team as code owners June 6, 2023 07:27
@joamaki joamaki self-requested a review June 6, 2023 08:48
pkg/clustermesh/clustermesh/clustermesh.go Outdated Show resolved Hide resolved
// SPDX-License-Identifier: Apache-2.0
// Copyright Authors of Cilium

package clustermesh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/clustermesh/clustermesh/clustermesh.go isn't great. Would it be possible to make this internal package with public APIs it implements exposed from pkg/clustermesh as interface types for example? Or wait, is it the other way around?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also personally find pkg/clustermesh/clustermesh to be not that great. I'm not sure about a better one though. One possibility could be to just keep everything in the base clustermesh package, losing a bit of separation (and have the kvstoremesh stuff in pkg/clustermesh/kvstoremesh). Otherwise I could do the other way round and move the generic bits in pkg/clustermesh/internal or pkg/clustermesh/generic (is this what you were proposing?). WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal or shared or generic package seems like it could work nicely. Hard to say what's best though, so I'll leave it to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the generic part to the pkg/clustermesh/internal package, and restored back the business logic in pkg/clustermesh, adapting variables names accordingly. I've additionally extracted a few clusterconfig-related utilities in a pkg/clustermesh/utils package, as they are also used by the clustermesh-apiserver.

Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits 👍

@giorio94
Copy link
Member Author

giorio94 commented Jun 7, 2023

@YutaroHayakawa @joamaki PTAL

@giorio94
Copy link
Member Author

giorio94 commented Jun 7, 2023

/test

@giorio94
Copy link
Member Author

giorio94 commented Jun 7, 2023

Force pushed to rebase onto main.

@giorio94
Copy link
Member Author

giorio94 commented Jun 7, 2023

/test

Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, I think I might have found one bug around restarting remote connections.

pkg/clustermesh/internal/remote_cluster.go Outdated Show resolved Hide resolved
for {
select {
// terminate routine when remote cluster is removed
case _, ok := <-rc.changed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am wrong, but I think that this channel receive will race with the receive from line 273. If this goroutine happens to get in select while the goroutine above is not executing and there is a message in rc.changed channel buffer, it will get the message from channel, check ok value to be true (since the channel is not closed) and this goroutine will return while restartRemoteConnection would not be called above.

To fix that behaviour, I would suggest merging both goroutines into one so that there is only one select.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. It seems to me that this logic is also possibly prone to other race conditions, as I've seen connections being restarted multiple times in a row following a single failure. I'll refactor it make it more robust.

Copy link
Member Author

@giorio94 giorio94 Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9668440e8b98e32cd4a5281e8cd25e689a07bafc faebb6f5368610a78c56bd59626dc37e088cc864 addresses this issue refactoring the watchdog logic. I've still maintained the existing changed channel logic when an update is detected to avoid introducing too many changes in this PR which is already large due to moving things around.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 There's an issue with the new logic. Let me fix it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error cases should be correctly handled now.

Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@giorio94 giorio94 force-pushed the mio/clustermesh-split branch 2 times, most recently from 0102736 to faebb6f Compare June 7, 2023 12:34
@giorio94 giorio94 requested a review from nebril June 7, 2023 12:44
@giorio94
Copy link
Member Author

giorio94 commented Jun 7, 2023

Rebased onto main to pick the CI fixes.

@giorio94
Copy link
Member Author

giorio94 commented Jun 7, 2023

/test

Job 'Cilium-PR-K8s-1.16-kernel-4.19' hit: #25964 (97.42% similarity)

@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

/test-1.16-4.19

@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

Rebased onto main to fix conflicts

@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

/test

@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

/ci-aks

@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

Conformance ginkgo hit #22749 and #25964

Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

This commit refactors the clustermesh subsystem, splitting it into a
generic part (pkg/clustermesh/internal), which is responsible for watching
the configuration directory and establishing the connections to remote
etcd endpoints, and a clustermesh specific part (pkg/clustermesh),
which handles the actual synchronization of the required information).
Additionally, the helper methods dealing with the setting and retrieval
of the CiliumClusterConfig are moved to a separate `clustermesh/utils`
package. This is a preparatory change to allow reusing the generic part
for the upcoming kvstoremesh implementation, reducing code duplication and
maintenance burden.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The cilium agent implements a watchdog mechanism to restart the
connection to remote kvstores in case of connectivity errors. Yet, the
current implementation is prone to a possible race conditions causing
spurious restarts. Specifically, when an error is detected, it triggers
the restart of the controller responsible for connecting to the kvstore,
and then iterates again, supposedly waiting until the connection is
established (i.e., `rc.backend` is no longer nil). Still, `rc.backend`
is asynchronously reset to nil when the controller is executed, causing
the new iteration of the watchdog to possibly still use the errors
channel of the old connection. This commit fixes this issue moving the
watchdog logic to a separate function which is started by the
controller, and terminates when the corresponding context is canceled
(i.e., the controller is restarted). This also removes the race
condition between the two goroutines reading from the `rc.changed`
channel.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94
Copy link
Member Author

giorio94 commented Jun 9, 2023

Rebased onto main to fix a conflict.

@giorio94
Copy link
Member Author

giorio94 commented Jun 9, 2023

/test

Copy link
Contributor

@joamaki joamaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structure looks good!

@giorio94
Copy link
Member Author

giorio94 commented Jun 9, 2023

/ci-aks

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 9, 2023
@dylandreimerink dylandreimerink merged commit 1505363 into cilium:main Jun 9, 2023
63 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants