Add support for multiple clustermesh-apiserver replicas (ClusterMesh HA) #31677

thorn3r · 2024-03-29T16:02:09Z

This adds support for running clustermesh-apiserver deployments with multiple replicas for high availability.

Each clustermesh-apiserver pod runs its own etcd cluster. Depending on configuration, either the Cilium Agent or KVStoreMesh instance watches etcd in a remote cluster. All responses from the remote etcd cluster are intercepted and the header is inspected to retrieve the etcd cluster ID. If a failover event occurs and the cluster ID has changed, the remote connection is restarted to ensure that no events are missed and that no invalid data is retained. See individual commit messages for additional details.

Add support for deploying clustermesh-apiserver with multiple replicas for high availability.

thorn3r · 2024-03-29T16:11:38Z

/test

thorn3r · 2024-03-29T20:09:11Z

/test

thorn3r · 2024-04-02T14:51:15Z

/test

giorio94

Looks great to me! Just a bunch of minor comments and nits inline.

pkg/clustermesh/common/interceptor.go

pkg/kvstore/etcd.go

.github/workflows/tests-clustermesh-upgrade.yaml

thorn3r · 2024-04-04T20:48:49Z

/test

thorn3r · 2024-04-05T19:08:53Z

/test

Update the clustermesh-upgrade CI workflow to test rolling- upgrades/failover by setting maxSurge=1 and maxUnavailable=0 and adding a new test using a global service. The test restarts the clustermesh-apiserver deployment on cluster2 and deploys a global service to both clusters. Cluster1 should be able to properly handle the failover event, establish a connection with the new clustermesh-apiserver pods, and synchronize the endpoints from cluster2 for the service. To ensure ClusterMesh both with and without KVStoreMesh enabled can handle failover events, separate upgrading Cilium and enabling KVStoreMesh for cluster1 into separate steps. Signed-off-by: Tim Horner <timothy.horner@isovalent.com>

thorn3r · 2024-04-05T19:19:08Z

/ci-clustermesh

thorn3r · 2024-04-05T19:52:29Z

/test

thorn3r · 2024-04-05T19:53:36Z

TIL that comments in a multi-line command in bash are tricky

giorio94 · 2024-04-08T07:28:37Z

TIL that comments in a multi-line command in bash are tricky

Yep, I typically just add them above the entire command, to avoid these kinds of problems.

nbusseneau

Thanks!

giorio94 · 2024-04-16T13:48:40Z

Marking for backport to v1.15 to address #30964. I'm going to backport a reduced version which only includes the configuration of the unique etcd Cluster ID and the interceptor logic, fixing a bug potentially causing Cilium agents to incorrectly restart an etcd watch against a different clustermesh-apiserver instance.

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 29, 2024

thorn3r force-pushed the pr/thorn3r/clustermeshHA branch from 0ea2d4e to b855a05 Compare March 29, 2024 19:38

thorn3r force-pushed the pr/thorn3r/clustermeshHA branch from b855a05 to 660dc0f Compare April 2, 2024 13:54

thorn3r added the release-note/major This PR introduces major new functionality to Cilium. label Apr 2, 2024

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Apr 2, 2024

thorn3r marked this pull request as ready for review April 2, 2024 14:52

thorn3r requested review from a team as code owners April 2, 2024 14:52

thorn3r requested review from marseel, joamaki, squeed and viktor-kurchenko April 2, 2024 14:52

thorn3r mentioned this pull request Apr 2, 2024

ClusterMesh HA #31132

Closed

thorn3r added release-note/minor This PR changes functionality that users may find relevant to operating Cilium. and removed release-note/major This PR introduces major new functionality to Cilium. labels Apr 2, 2024

thorn3r changed the title ~~ClusterMesh HA~~ Add support for multiple clustermesh-apiserver replicas (ClusterMesh HA) Apr 2, 2024

This was referenced Apr 3, 2024

CI:Conformance Cluster Mesh: DNS resolution timeout #30251

Open

CI: Cilium IPsec upgrade - unexpected packet drops on downgrade #29987

Open

giorio94 self-requested a review April 4, 2024 08:39

giorio94 reviewed Apr 4, 2024

View reviewed changes

thorn3r force-pushed the pr/thorn3r/clustermeshHA branch from 660dc0f to ee4fb0a Compare April 4, 2024 20:34

thorn3r requested a review from a team as a code owner April 4, 2024 20:34

thorn3r force-pushed the pr/thorn3r/clustermeshHA branch from 571dc26 to bbf4488 Compare April 5, 2024 19:14

This was referenced Apr 5, 2024

CI: Runtime Test: TestIpamManyNodes: error: Node node2 allocation mismatch. expected: 10 allocated: 20 #31466

Closed

CI: Conformance IPsec E2E: no-policies/pod-to-world/https-to-one.one.one.one-index-0: Resolving timed out after 2000 milliseconds #27799

Open

nbusseneau approved these changes Apr 8, 2024

View reviewed changes

squeed approved these changes Apr 8, 2024

View reviewed changes

viktor-kurchenko approved these changes Apr 8, 2024

View reviewed changes

joamaki approved these changes Apr 9, 2024

View reviewed changes

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 9, 2024

joamaki added this pull request to the merge queue Apr 9, 2024

Merged via the queue into main with commit 0426636 Apr 9, 2024
251 checks passed

joamaki deleted the pr/thorn3r/clustermeshHA branch April 9, 2024 09:14

giorio94 mentioned this pull request Apr 12, 2024

[CI] Cluster Mesh upgrade: no-interrupted-connections: test-conn-disrupt-client failed due to interrupted traffic during downgrade #30964

Closed

marseel added needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 15, 2024

giorio94 added the needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch label Apr 16, 2024

giorio94 mentioned this pull request Apr 16, 2024

[v1.15] Prevent Cilium agents from incorrectly restarting an etcd watch against a different clustermesh-apiserver instance. #32005

Merged

thorn3r mentioned this pull request Apr 22, 2024

Remove CiliumOperatorName constant #31597

Merged

8 tasks

thorn3r mentioned this pull request May 21, 2024

clustermesh-apiserver service sessionAffinity regression #32646

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for multiple clustermesh-apiserver replicas (ClusterMesh HA) #31677

Add support for multiple clustermesh-apiserver replicas (ClusterMesh HA) #31677

thorn3r commented Mar 29, 2024 •

edited

thorn3r commented Mar 29, 2024

thorn3r commented Mar 29, 2024

thorn3r commented Apr 2, 2024

giorio94 left a comment

thorn3r commented Apr 4, 2024

thorn3r commented Apr 5, 2024

thorn3r commented Apr 5, 2024

thorn3r commented Apr 5, 2024

thorn3r commented Apr 5, 2024 •

edited

giorio94 commented Apr 8, 2024

nbusseneau left a comment

giorio94 commented Apr 16, 2024

Add support for multiple clustermesh-apiserver replicas (ClusterMesh HA) #31677

Add support for multiple clustermesh-apiserver replicas (ClusterMesh HA) #31677

Conversation

thorn3r commented Mar 29, 2024 • edited

thorn3r commented Mar 29, 2024

thorn3r commented Mar 29, 2024

thorn3r commented Apr 2, 2024

giorio94 left a comment

Choose a reason for hiding this comment

thorn3r commented Apr 4, 2024

thorn3r commented Apr 5, 2024

thorn3r commented Apr 5, 2024

thorn3r commented Apr 5, 2024

thorn3r commented Apr 5, 2024 • edited

giorio94 commented Apr 8, 2024

nbusseneau left a comment

Choose a reason for hiding this comment

giorio94 commented Apr 16, 2024

thorn3r commented Mar 29, 2024 •

edited

thorn3r commented Apr 5, 2024 •

edited