-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for multiple clustermesh-apiserver replicas (ClusterMesh HA) #31677
Conversation
/test |
0ea2d4e
to
b855a05
Compare
/test |
b855a05
to
660dc0f
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! Just a bunch of minor comments and nits inline.
660dc0f
to
ee4fb0a
Compare
/test |
/test |
Update the clustermesh-upgrade CI workflow to test rolling- upgrades/failover by setting maxSurge=1 and maxUnavailable=0 and adding a new test using a global service. The test restarts the clustermesh-apiserver deployment on cluster2 and deploys a global service to both clusters. Cluster1 should be able to properly handle the failover event, establish a connection with the new clustermesh-apiserver pods, and synchronize the endpoints from cluster2 for the service. To ensure ClusterMesh both with and without KVStoreMesh enabled can handle failover events, separate upgrading Cilium and enabling KVStoreMesh for cluster1 into separate steps. Signed-off-by: Tim Horner <timothy.horner@isovalent.com>
571dc26
to
bbf4488
Compare
/ci-clustermesh |
/test |
TIL that comments in a multi-line command in bash are tricky |
Yep, I typically just add them above the entire command, to avoid these kinds of problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Marking for backport to v1.15 to address #30964. I'm going to backport a reduced version which only includes the configuration of the unique etcd Cluster ID and the interceptor logic, fixing a bug potentially causing Cilium agents to incorrectly restart an etcd watch against a different clustermesh-apiserver instance. |
This adds support for running clustermesh-apiserver deployments with multiple replicas for high availability.
Each clustermesh-apiserver pod runs its own etcd cluster. Depending on configuration, either the Cilium Agent or KVStoreMesh instance watches etcd in a remote cluster. All responses from the remote etcd cluster are intercepted and the header is inspected to retrieve the etcd cluster ID. If a failover event occurs and the cluster ID has changed, the remote connection is restarted to ensure that no events are missed and that no invalid data is retained. See individual commit messages for additional details.