Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for multiple clustermesh-apiserver replicas (ClusterMesh HA) #31677

Merged
merged 3 commits into from
Apr 9, 2024

Conversation

thorn3r
Copy link
Contributor

@thorn3r thorn3r commented Mar 29, 2024

This adds support for running clustermesh-apiserver deployments with multiple replicas for high availability.

Each clustermesh-apiserver pod runs its own etcd cluster. Depending on configuration, either the Cilium Agent or KVStoreMesh instance watches etcd in a remote cluster. All responses from the remote etcd cluster are intercepted and the header is inspected to retrieve the etcd cluster ID. If a failover event occurs and the cluster ID has changed, the remote connection is restarted to ensure that no events are missed and that no invalid data is retained. See individual commit messages for additional details.

Add support for deploying clustermesh-apiserver with multiple replicas for high availability.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 29, 2024
@thorn3r
Copy link
Contributor Author

thorn3r commented Mar 29, 2024

/test

@thorn3r
Copy link
Contributor Author

thorn3r commented Mar 29, 2024

/test

@thorn3r thorn3r added the release-note/major This PR introduces major new functionality to Cilium. label Apr 2, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 2, 2024
@thorn3r
Copy link
Contributor Author

thorn3r commented Apr 2, 2024

/test

@thorn3r thorn3r marked this pull request as ready for review April 2, 2024 14:52
@thorn3r thorn3r requested review from a team as code owners April 2, 2024 14:52
@thorn3r thorn3r mentioned this pull request Apr 2, 2024
@thorn3r thorn3r added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. and removed release-note/major This PR introduces major new functionality to Cilium. labels Apr 2, 2024
@thorn3r thorn3r changed the title ClusterMesh HA Add support for multiple clustermesh-apiserver replicas (ClusterMesh HA) Apr 2, 2024
@giorio94 giorio94 self-requested a review April 4, 2024 08:39
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Just a bunch of minor comments and nits inline.

pkg/clustermesh/common/interceptor.go Outdated Show resolved Hide resolved
pkg/clustermesh/common/interceptor.go Outdated Show resolved Hide resolved
pkg/clustermesh/common/interceptor.go Outdated Show resolved Hide resolved
pkg/clustermesh/common/interceptor.go Outdated Show resolved Hide resolved
pkg/clustermesh/common/interceptor.go Outdated Show resolved Hide resolved
pkg/kvstore/etcd.go Outdated Show resolved Hide resolved
.github/workflows/tests-clustermesh-upgrade.yaml Outdated Show resolved Hide resolved
.github/workflows/tests-clustermesh-upgrade.yaml Outdated Show resolved Hide resolved
.github/workflows/tests-clustermesh-upgrade.yaml Outdated Show resolved Hide resolved
@thorn3r thorn3r requested a review from a team as a code owner April 4, 2024 20:34
@thorn3r
Copy link
Contributor Author

thorn3r commented Apr 4, 2024

/test

@thorn3r
Copy link
Contributor Author

thorn3r commented Apr 5, 2024

/test

Update the clustermesh-upgrade CI workflow to test rolling-
upgrades/failover by setting maxSurge=1 and maxUnavailable=0 and adding
a new test using a global service.

The test restarts the clustermesh-apiserver deployment on cluster2 and
deploys a global service to both clusters. Cluster1 should be able to
properly handle the failover event, establish a connection with the
new clustermesh-apiserver pods, and synchronize the endpoints from
cluster2 for the service. To ensure ClusterMesh both with and without
KVStoreMesh enabled can handle failover events, separate upgrading
Cilium and enabling KVStoreMesh for cluster1 into separate steps.

Signed-off-by: Tim Horner <timothy.horner@isovalent.com>
@thorn3r
Copy link
Contributor Author

thorn3r commented Apr 5, 2024

/ci-clustermesh

@thorn3r
Copy link
Contributor Author

thorn3r commented Apr 5, 2024

/test

@thorn3r
Copy link
Contributor Author

thorn3r commented Apr 5, 2024

TIL that comments in a multi-line command in bash are tricky

@giorio94
Copy link
Member

giorio94 commented Apr 8, 2024

TIL that comments in a multi-line command in bash are tricky

Yep, I typically just add them above the entire command, to avoid these kinds of problems.

Copy link
Member

@nbusseneau nbusseneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 9, 2024
@joamaki joamaki added this pull request to the merge queue Apr 9, 2024
Merged via the queue into main with commit 0426636 Apr 9, 2024
251 checks passed
@joamaki joamaki deleted the pr/thorn3r/clustermeshHA branch April 9, 2024 09:14
@marseel marseel added needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 15, 2024
@giorio94
Copy link
Member

Marking for backport to v1.15 to address #30964. I'm going to backport a reduced version which only includes the configuration of the unique etcd Cluster ID and the interceptor logic, fixing a bug potentially causing Cilium agents to incorrectly restart an etcd watch against a different clustermesh-apiserver instance.

@giorio94 giorio94 added the needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch label Apr 16, 2024
@giorio94 giorio94 added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Apr 16, 2024
@thorn3r thorn3r mentioned this pull request Apr 22, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants