clustermesh: correctly remove remoteCache on connection disruption #23532

squeed · 2023-02-02T10:23:54Z

Upon reconnect, we failed to remove the old remoteCache (we were looking at the wrong Allocator on cleanup), meaning that every time we reconnected, all old remoteCaches were kept around.

This was, at best, a memory leak, and at worst meant that we continued to read stale data even after reconnecting, depending on the ordering of a map iteration.

Many thanks to @oblazek, who found the root cause.

Fixes: #22988
Fixes: #13446

Signed-off-by: Casey Callendrello cdc@isovalent.com

Fixes a memory leak and (possible) source of stale data for Clustermesh whenever the connection to the remote cluster is disrupted or restarted.

squeed · 2023-02-02T11:17:54Z

/ci-multicluster

squeed · 2023-02-02T13:12:46Z

/ci-multicluster

squeed · 2023-02-02T13:13:20Z

Something seems wrong; this failure doesn't seem like it should be happening. And this test has been pretty stable.

christarazi

Thanks for submitting the fix and thanks to @oblazek for finding it!

squeed · 2023-02-02T19:11:23Z

/ci-multicluster

aditighag

Good find! The refactoring/fixing typos related changes could've been separated in a separate commit. LGTM.
Some of the functions need updated description.

aditighag · 2023-02-02T23:47:54Z

pkg/identity/cache/allocator.go

@@ -440,18 +440,24 @@ func (m *CachingIdentityAllocator) ReleaseSlice(ctx context.Context, owner Ident

 // WatchRemoteIdentities starts watching for identities in another kvstore and
 // syncs all identities to the local identity cache.
-func (m *CachingIdentityAllocator) WatchRemoteIdentities(backend kvstore.BackendOperations) (*allocator.RemoteCache, error) {
+func (m *CachingIdentityAllocator) WatchRemoteIdentities(name string, backend kvstore.BackendOperations) (*allocator.RemoteCache, error) {


Nit: Please update the function description.

oblazek · 2023-02-03T07:47:57Z

pkg/clustermesh/remote_cluster.go

@@ -269,6 +269,7 @@ func (rc *remoteCluster) restartRemoteConnection(allocator RemoteIdentityWatcher
 				rc.releaseOldConnection()
 				rc.mesh.metricTotalNodes.WithLabelValues(rc.mesh.conf.Name, rc.mesh.conf.NodeName, rc.name).Set(float64(rc.remoteNodes.NumEntries()))
 				rc.mesh.metricReadinessStatus.WithLabelValues(rc.mesh.conf.Name, rc.mesh.conf.NodeName, rc.name).Set(metrics.BoolToFloat64(rc.isReadyLocked()))
+				allocator.RemoveRemoteIdentities(rc.name)


can we add this also somewhere here? otherwise it does not really fix the memleak issue.. the controller itself is not stopped, so this cleanup is never run in my case...

@oblazek There should be only one controller per remote cluster, so as soon as a new connection is made, it will replace the old cache in the allocator. I decided to preserve the old one, so that connections aren't disrupted while etcd blips. Otherwise all identities are deleted and re-created, meaning connectivity is lost.

ah yes, got it

squeed · 2023-02-03T10:01:00Z

@oblazek found another memory leak :-). Fix coming shortly.

squeed · 2023-02-06T13:30:23Z

OK, I've banged on this pretty well, couldn't get the bug to appear. This seems good to go.

pchaigno · 2023-02-08T18:08:48Z

/test

squeed · 2023-02-09T09:06:12Z

ci-multicluster has failed. Investigating.

squeed · 2023-02-13T15:06:37Z

Found the issue; we were starting the connectivity test too soon after restarting the agents. I have a test run in #23716, which I'll include here when finished. I believe this change is safe.

YutaroHayakawa

Sorry for the late resposse. Changes looks good to me (including the test changes) thanks!

Upon reconnect, we failed to remove the old remoteCache (we were looking at the wrong Allocator on cleanup), meaning that every time we reconnected, all old remoteCaches were kept around. This was, at best, a memory leak, and at worst meant that we continued to read stale data even after reconnecting, depending on the ordering of a map iteration. Fixes: cilium#22988 Fixes: cilium#13446 Signed-off-by: Casey Callendrello <cdc@isovalent.com>

Before this change, RemoteCache would create and start a second cache on an existing Allocator, which is a waste of resources. So, just pass through to the Allocator's existing cache directly. There's no need to duplicate etcd watches or the in-memory cache. Signed-off-by: Casey Callendrello <cdc@isovalent.com>

When two clusters are connected, it causes a full agent rollout. So, wait for that to finish with "cilium status" before proceeding with connectivity tests. Signed-off-by: Casey Callendrello <cdc@isovalent.com>

squeed · 2023-02-14T09:59:45Z

Passing run for proposed CI changes here: https://github.com/cilium/cilium/actions/runs/4164795509/jobs/7206931563

squeed · 2023-02-14T10:47:50Z

/ci-multicluster

squeed · 2023-02-14T11:24:42Z

CI is all green, including multicluster (the flake fixed by this CI change didn't hit us this time)

nbusseneau · 2023-02-14T13:53:02Z

.github/workflows/conformance-multicluster.yaml

+          cilium status --wait --context ${{ steps.contexts.outputs.context1 }}
+          cilium status --wait --context ${{ steps.contexts.outputs.context2 }}


While this is fine for now, I suppose a more proper fix would be to have the clustermesh status --wait command automatically also imply a status --wait directly in CLI code. cc @tklauser for thoughts.

Well, ultimately, they do different things. Ideally there would be a --wait in clustermesh join

squeed · 2023-02-14T15:50:21Z

/test

squeed · 2023-02-15T16:16:44Z

/ci-verifier

squeed requested review from a team as code owners February 2, 2023 10:23

squeed requested review from aditighag, YutaroHayakawa and tklauser February 2, 2023 10:23

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 2, 2023

squeed added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Feb 2, 2023

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 2, 2023

squeed force-pushed the remote-custer-leak branch from 2d12982 to 85a48fe Compare February 2, 2023 10:31

squeed requested a review from a team as a code owner February 2, 2023 10:31

squeed force-pushed the remote-custer-leak branch from 85a48fe to 1febbc6 Compare February 2, 2023 11:14

squeed force-pushed the remote-custer-leak branch from 1febbc6 to e423e25 Compare February 2, 2023 12:11

squeed force-pushed the remote-custer-leak branch from e423e25 to 98a72b1 Compare February 2, 2023 13:41

christarazi added area/clustermesh Relates to multi-cluster routing functionality in Cilium. affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Feb 2, 2023

This comment was marked as duplicate.

Sign in to view

christarazi approved these changes Feb 2, 2023

View reviewed changes

aditighag approved these changes Feb 2, 2023

View reviewed changes

tklauser approved these changes Feb 3, 2023

View reviewed changes

oblazek reviewed Feb 3, 2023

View reviewed changes

squeed force-pushed the remote-custer-leak branch from 98a72b1 to ef5e69b Compare February 3, 2023 10:51

YutaroHayakawa approved these changes Feb 13, 2023

View reviewed changes

squeed added 3 commits February 14, 2023 10:59

github/conformance-multicluster: wait for cluster to converge

42c6584

When two clusters are connected, it causes a full agent rollout. So, wait for that to finish with "cilium status" before proceeding with connectivity tests. Signed-off-by: Casey Callendrello <cdc@isovalent.com>

squeed force-pushed the remote-custer-leak branch from ef5e69b to 42c6584 Compare February 14, 2023 09:59

squeed requested review from a team as code owners February 14, 2023 09:59

squeed requested a review from nbusseneau February 14, 2023 09:59

nbusseneau approved these changes Feb 14, 2023

View reviewed changes

oblazek mentioned this pull request Feb 15, 2023

clustermesh: correctly release allocator references for GC #23785

Merged

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 15, 2023

pchaigno merged commit 26e6bcf into cilium:master Feb 15, 2023

pchaigno mentioned this pull request Feb 16, 2023

v1.13 backports 2023-02-16 #23834

Merged

16 tasks

pchaigno added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Feb 16, 2023

nebril mentioned this pull request Mar 15, 2023

Prepare for release v1.13.1 #24395

Merged

squeed mentioned this pull request Apr 3, 2023

some cillium agents cannot pick the correct cilium agent health endpoint IP when clustermesh is active in 1.13.0 #24411

Closed

2 tasks

julianwiedmann added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Apr 10, 2023

tklauser mentioned this pull request Jun 8, 2023

Fix missed deletion events when reconnecting to/disconnecting from remote clusters (identities) #25677

Merged

squeed mentioned this pull request Jul 4, 2023

Pods on a single node(s) periodically become unreachable (EKS, Clustermesh) #26543

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustermesh: correctly remove remoteCache on connection disruption #23532

clustermesh: correctly remove remoteCache on connection disruption #23532

squeed commented Feb 2, 2023 •

edited

squeed commented Feb 2, 2023

squeed commented Feb 2, 2023

squeed commented Feb 2, 2023 •

edited

This comment was marked as duplicate.

christarazi left a comment

squeed commented Feb 2, 2023

aditighag left a comment

aditighag Feb 2, 2023

oblazek Feb 3, 2023

squeed Feb 3, 2023

oblazek Feb 3, 2023

squeed commented Feb 3, 2023

squeed commented Feb 6, 2023

pchaigno commented Feb 8, 2023

squeed commented Feb 9, 2023

squeed commented Feb 13, 2023

YutaroHayakawa left a comment •

edited

squeed commented Feb 14, 2023

squeed commented Feb 14, 2023

squeed commented Feb 14, 2023 •

edited

nbusseneau Feb 14, 2023

squeed Feb 14, 2023

squeed commented Feb 14, 2023

squeed commented Feb 15, 2023

		cilium status --wait --context ${{ steps.contexts.outputs.context1 }}
		cilium status --wait --context ${{ steps.contexts.outputs.context2 }}

clustermesh: correctly remove remoteCache on connection disruption #23532

clustermesh: correctly remove remoteCache on connection disruption #23532

Conversation

squeed commented Feb 2, 2023 • edited

squeed commented Feb 2, 2023

squeed commented Feb 2, 2023

squeed commented Feb 2, 2023 • edited

This comment was marked as duplicate.

christarazi left a comment

Choose a reason for hiding this comment

squeed commented Feb 2, 2023

aditighag left a comment

Choose a reason for hiding this comment

aditighag Feb 2, 2023

Choose a reason for hiding this comment

oblazek Feb 3, 2023

Choose a reason for hiding this comment

squeed Feb 3, 2023

Choose a reason for hiding this comment

oblazek Feb 3, 2023

Choose a reason for hiding this comment

squeed commented Feb 3, 2023

squeed commented Feb 6, 2023

pchaigno commented Feb 8, 2023

squeed commented Feb 9, 2023

squeed commented Feb 13, 2023

YutaroHayakawa left a comment • edited

Choose a reason for hiding this comment

squeed commented Feb 14, 2023

squeed commented Feb 14, 2023

squeed commented Feb 14, 2023 • edited

nbusseneau Feb 14, 2023

Choose a reason for hiding this comment

squeed Feb 14, 2023

Choose a reason for hiding this comment

squeed commented Feb 14, 2023

squeed commented Feb 15, 2023

squeed commented Feb 2, 2023 •

edited

squeed commented Feb 2, 2023 •

edited

YutaroHayakawa left a comment •

edited

squeed commented Feb 14, 2023 •

edited