Fix possible cross-cluster connection drops on agents restart when clustermesh is enabled #27575

giorio94 · 2023-08-17T17:04:25Z

This PR modifies the endpoint regeneration logic to additional wait for remote clusters ipcache synchronization before performing the datapath regeneration, to make sure that the new ipcache map starts to be used only when it has been completely populated.

A user-configurable timeout (defaulting to one minute) limits the maximum wait interval, to prevent blocking the local endpoints regeneration process for an extended period of time if a remote cluster is not ready (which could also lead to checken-and-egg dependencies). A special case is IPSec enabled, as this currently requires a synchronous regeneration of the host endpoint: here, we limit the timeout to at most 5 seconds, to not block the agent startup process.

In addition, when WireGuard is enabled, the deletion of obsolete peers is now performed after listing all nodes from remote clusters, again to prevent possible connectivity drops on agent restart.

Please refer to the individual commit messages for additional details.

Tested with #27232: https://github.com/cilium/cilium/actions/runs/5949436519
The downgrade test was based on the current tip of v1.14 with these changes on top; hence, I'm marking this PR for backport to v1.14 (and backport/author given that I've already the PR for that).

Fixes: #26462

Fix possible cross-cluster connection drops on agents restart when clustermesh is enabled

giorio94 · 2023-08-21T12:48:51Z

/test

giorio94 · 2023-08-22T13:49:50Z

/test

giorio94 · 2023-08-22T15:43:06Z

Converting back to draft, as the remaining part of the fix for #26462 (when endpoints are selected by netpols) should be simple enough to be included here.

giorio94 · 2023-08-22T17:05:55Z

/test

giorio94 · 2023-08-22T17:08:03Z

/test

Add a new IPIdentitiesSynced method to the clustermesh object to allow waiting until the initial list of ipcache entries and identities has been received from all known remote clusters. This mimics the ServicesSynced() method, and enables to postpone disruptive operations to prevent dropping valid long-running connections on agent restart. Differently from the services case, we don't need to separately track the propagation of the single entries to the datapath, as performed synchronously. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

Add a new NodesSynced method to the clustermesh object to allow waiting until the initial list of nodes has been received from all known remote clusters. This mimics the ServicesSynced() method, and enables to postpone disruptive operations to prevent dropping valid long-running connections on agent restart. Differently from the services case, we don't need to separately track the propagation of the single entries to the datapath, as performed synchronously. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

A new ipcache map is recreated from scratch and populated by each cilium agent upon restart, and the existing endpoints are switched to use it during the regeneration process of the endpoint itself. Before performing this operation, the new ipcache map needs to have already been fully synchronized, otherwise we might drop long running connections. Currently we wait for synchronization from k8s (when in CRD mode) and the kvstore (when in kvstore mode). Yet, when clustermesh is enabled, we don't wait for synchronization from all the remote clusters, possibly disrupting existing cross-cluster connections. The same also applies to identities in case network policies targeting the given endpoint are present. This commit introduces additional logic to wait for ipcache and identities synchronization from remote clusters before triggering endpoint regeneration. The wait period is bound by a user-configurable timeout, defaulting to one minute. The idea being that we don't block the local endpoints regeneration for an extended period of time if a given remote cluster is not ready. This also prevents possible inter dependency problems, with the regeneration in one cluster being required to allow other clusters to successfully connect to it. Nonetheless, the timeout can be configured through a dedicated --clustermesh-ip-identities-sync-timeout flag, to select the desired trade-off between possible endpoints regeneration delay and likelihood of connection drops if a given remote clusters is not ready (e.g., because the corresponding clustermesh-apiserver is restarting). A special case is represented by the host endpoint if IPSec is enabled, because it is currently restored synchronously to ensure ordering. Given that blocking this operation prevents the agent from becoming ready, and in turn new pods from being scheduled, we forcefully cap the wait timeout to 5 seconds. Empirically, this value has proved to be sufficient to retrieve a high number of entries if the remote kvstore is ready (e.g., the clustermesh-apiserver pod is running, and already synchronized with Kubernetes). Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

Currently, the WireGuard subsystem triggers the deletion of obsolete peers when the daemon restoration process has completed. Yet, at this point the information about nodes belonging to remote clusters might not yet have been received, causing the disruption of cross-cluster connections. Hence, let's first wait for the synchronization of all remote nodes before kicking off the removal process. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2023-08-30T17:39:16Z

Rebased onto main to pick CI changes

giorio94 · 2023-08-30T17:52:07Z

/test

michi-covalent · 2023-09-09T20:37:33Z

temporarily moving this back to needs/backport/1.14 to make the release script happy.

giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. labels Aug 17, 2023

giorio94 changed the title ~~Fix possible connection drops on agents restart when clustermesh is enabled~~ Fix possible cross-cluster connection drops on agents restart when clustermesh is enabled Aug 17, 2023

giorio94 mentioned this pull request Aug 21, 2023

Fix possible cross-cluster connection drops on agents restart when clustermesh is enabled #27611

Merged

1 task

giorio94 force-pushed the mio/fix-clustermesh-drops-on-restart branch from 28ed271 to d4c61f7 Compare August 22, 2023 13:11

giorio94 mentioned this pull request Aug 22, 2023

CI: Runtime Test (privileged): Test/linuxPrivilegedIPv6OnlyTestSuite/TestArpPingHandlingForMultiDevice #26669

Closed

giorio94 marked this pull request as ready for review August 22, 2023 14:27

giorio94 requested review from a team as code owners August 22, 2023 14:27

giorio94 requested review from brb, jrajahalme, ti-mo and thorn3r August 22, 2023 14:27

giorio94 added the needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch label Aug 22, 2023

giorio94 marked this pull request as draft August 22, 2023 15:43

giorio94 force-pushed the mio/fix-clustermesh-drops-on-restart branch from d4c61f7 to 67f8e99 Compare August 22, 2023 17:05

giorio94 force-pushed the mio/fix-clustermesh-drops-on-restart branch from 67f8e99 to 98f1dd5 Compare August 22, 2023 17:07

ti-mo approved these changes Aug 23, 2023

View reviewed changes

giorio94 force-pushed the mio/fix-clustermesh-drops-on-restart branch from 98f1dd5 to 0acd226 Compare August 23, 2023 08:35

giorio94 added the backport/author The backport will be carried out by the author of the PR. label Aug 23, 2023

giorio94 added 4 commits August 30, 2023 19:35

giorio94 force-pushed the mio/fix-clustermesh-drops-on-restart branch from 3b69304 to e0b482a Compare August 30, 2023 17:39

giorio94 removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 30, 2023

This was referenced Aug 31, 2023

CI: K8sAgentPolicyTest Multi-node policy test validates ingress CIDR-dependent L4 connectivity works from the outside before any policies #27052

Closed

CI: Multicluster: client-egress-l7-named-port failed #27672

Open

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Aug 31, 2023

borkmann merged commit 6a8f5ab into cilium:main Aug 31, 2023
59 checks passed

giorio94 added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Sep 1, 2023

maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.2 Sep 1, 2023

michi-covalent added needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Sep 9, 2023

maintainer-s-little-helper bot moved this from Backport pending to v1.14 to Needs backport from main in 1.14.2 Sep 9, 2023

michi-covalent added this to Needs backport from main in 1.14.3 Sep 9, 2023

michi-covalent removed this from Needs backport from main in 1.14.2 Sep 9, 2023

michi-covalent added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Sep 9, 2023

michi-covalent moved this from Needs backport from main to Backport pending to v1.14 in 1.14.3 Sep 9, 2023

giorio94 added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Sep 22, 2023

giorio94 deleted the mio/fix-clustermesh-drops-on-restart branch September 22, 2023 16:45

jrajahalme moved this from Backport pending to v1.14 to Backport done to v1.14 in 1.14.3 Oct 18, 2023

giorio94 mentioned this pull request Dec 4, 2023

Fix possible disruption of long running, cross-cluster, pod to node traffic on agent restart #29613

Merged

giorio94 mentioned this pull request Jan 9, 2024

Possible circular dependency preventing to connect to the clustermesh-apiserver NodePort (with tunneling and KPR=true) in rare conditions #30156

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix possible cross-cluster connection drops on agents restart when clustermesh is enabled #27575

Fix possible cross-cluster connection drops on agents restart when clustermesh is enabled #27575

giorio94 commented Aug 17, 2023 •

edited

giorio94 commented Aug 21, 2023

giorio94 commented Aug 22, 2023

giorio94 commented Aug 22, 2023

giorio94 commented Aug 22, 2023

giorio94 commented Aug 22, 2023

giorio94 commented Aug 30, 2023

giorio94 commented Aug 30, 2023

michi-covalent commented Sep 9, 2023

Fix possible cross-cluster connection drops on agents restart when clustermesh is enabled #27575

Fix possible cross-cluster connection drops on agents restart when clustermesh is enabled #27575

Conversation

giorio94 commented Aug 17, 2023 • edited

giorio94 commented Aug 21, 2023

giorio94 commented Aug 22, 2023

giorio94 commented Aug 22, 2023

giorio94 commented Aug 22, 2023

giorio94 commented Aug 22, 2023

giorio94 commented Aug 30, 2023

giorio94 commented Aug 30, 2023

michi-covalent commented Sep 9, 2023

giorio94 commented Aug 17, 2023 •

edited