New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clustermesh: split the generic logic from the specific part #25921
Conversation
// SPDX-License-Identifier: Apache-2.0 | ||
// Copyright Authors of Cilium | ||
|
||
package clustermesh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/clustermesh/clustermesh/clustermesh.go
isn't great. Would it be possible to make this internal
package with public APIs it implements exposed from pkg/clustermesh
as interface types for example? Or wait, is it the other way around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also personally find pkg/clustermesh/clustermesh
to be not that great. I'm not sure about a better one though. One possibility could be to just keep everything in the base clustermesh package, losing a bit of separation (and have the kvstoremesh stuff in pkg/clustermesh/kvstoremesh
). Otherwise I could do the other way round and move the generic bits in pkg/clustermesh/internal
or pkg/clustermesh/generic
(is this what you were proposing?). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal
or shared
or generic
package seems like it could work nicely. Hard to say what's best though, so I'll leave it to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved the generic part to the pkg/clustermesh/internal
package, and restored back the business logic in pkg/clustermesh
, adapting variables names accordingly. I've additionally extracted a few clusterconfig-related utilities in a pkg/clustermesh/utils
package, as they are also used by the clustermesh-apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nits 👍
8524fa6
to
16c4793
Compare
16c4793
to
03c285b
Compare
@YutaroHayakawa @joamaki PTAL |
/test |
03c285b
to
f7f9c74
Compare
Force pushed to rebase onto main. |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, I think I might have found one bug around restarting remote connections.
for { | ||
select { | ||
// terminate routine when remote cluster is removed | ||
case _, ok := <-rc.changed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I am wrong, but I think that this channel receive will race with the receive from line 273. If this goroutine happens to get in select
while the goroutine above is not executing and there is a message in rc.changed
channel buffer, it will get the message from channel, check ok
value to be true
(since the channel is not closed) and this goroutine will return while restartRemoteConnection
would not be called above.
To fix that behaviour, I would suggest merging both goroutines into one so that there is only one select.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. It seems to me that this logic is also possibly prone to other race conditions, as I've seen connections being restarted multiple times in a row following a single failure. I'll refactor it make it more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9668440e8b98e32cd4a5281e8cd25e689a07bafc faebb6f5368610a78c56bd59626dc37e088cc864 addresses this issue refactoring the watchdog logic. I've still maintained the existing changed
channel logic when an update is detected to avoid introducing too many changes in this PR which is already large due to moving things around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 There's an issue with the new logic. Let me fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error cases should be correctly handled now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
0102736
to
faebb6f
Compare
Rebased onto main to pick the CI fixes. |
faebb6f
to
09d9368
Compare
/test Job 'Cilium-PR-K8s-1.16-kernel-4.19' hit: #25964 (97.42% similarity) |
/test-1.16-4.19 |
09d9368
to
7b42930
Compare
Rebased onto main to fix conflicts |
/test |
/ci-aks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This commit refactors the clustermesh subsystem, splitting it into a generic part (pkg/clustermesh/internal), which is responsible for watching the configuration directory and establishing the connections to remote etcd endpoints, and a clustermesh specific part (pkg/clustermesh), which handles the actual synchronization of the required information). Additionally, the helper methods dealing with the setting and retrieval of the CiliumClusterConfig are moved to a separate `clustermesh/utils` package. This is a preparatory change to allow reusing the generic part for the upcoming kvstoremesh implementation, reducing code duplication and maintenance burden. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
The cilium agent implements a watchdog mechanism to restart the connection to remote kvstores in case of connectivity errors. Yet, the current implementation is prone to a possible race conditions causing spurious restarts. Specifically, when an error is detected, it triggers the restart of the controller responsible for connecting to the kvstore, and then iterates again, supposedly waiting until the connection is established (i.e., `rc.backend` is no longer nil). Still, `rc.backend` is asynchronously reset to nil when the controller is executed, causing the new iteration of the watchdog to possibly still use the errors channel of the old connection. This commit fixes this issue moving the watchdog logic to a separate function which is started by the controller, and terminates when the corresponding context is canceled (i.e., the controller is restarted). This also removes the race condition between the two goroutines reading from the `rc.changed` channel. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
7b42930
to
9d7289a
Compare
Rebased onto main to fix a conflict. |
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structure looks good!
/ci-aks |
This PR refactors the clustermesh subsystem, splitting it into a generic part (
pkg/clustermesh
), which is responsible for watching the configuration directory and establishing the connections to remote etcd endpoints, and a clustermesh specific part (pkg/clustermesh/clustermesh
), which handles the actual synchronization of the required information). This is a preparatory change to allow reusing the generic part for the upcoming kvstoremesh implementation, reducing code duplication and maintenance burden.