Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustermesh: correctly remove remoteCache on connection disruption #23532

Merged
merged 3 commits into from
Feb 15, 2023

Conversation

squeed
Copy link
Contributor

@squeed squeed commented Feb 2, 2023

Upon reconnect, we failed to remove the old remoteCache (we were looking at the wrong Allocator on cleanup), meaning that every time we reconnected, all old remoteCaches were kept around.

This was, at best, a memory leak, and at worst meant that we continued to read stale data even after reconnecting, depending on the ordering of a map iteration.

Many thanks to @oblazek, who found the root cause.

Fixes: #22988
Fixes: #13446

Signed-off-by: Casey Callendrello cdc@isovalent.com

Fixes a memory leak and (possible) source of stale data for Clustermesh whenever the connection to the remote cluster is disrupted or restarted.

@squeed squeed requested review from a team as code owners February 2, 2023 10:23
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 2, 2023
@squeed squeed added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Feb 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 2, 2023
@squeed squeed requested a review from a team as a code owner February 2, 2023 10:31
@squeed
Copy link
Contributor Author

squeed commented Feb 2, 2023

/ci-multicluster

@squeed
Copy link
Contributor Author

squeed commented Feb 2, 2023

/ci-multicluster

@squeed
Copy link
Contributor Author

squeed commented Feb 2, 2023

Something seems wrong; this failure doesn't seem like it should be happening. And this test has been pretty stable.

@christarazi christarazi added area/clustermesh Relates to multi-cluster routing functionality in Cilium. affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Feb 2, 2023
christarazi

This comment was marked as duplicate.

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting the fix and thanks to @oblazek for finding it!

@squeed
Copy link
Contributor Author

squeed commented Feb 2, 2023

/ci-multicluster

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find! The refactoring/fixing typos related changes could've been separated in a separate commit. LGTM.
Some of the functions need updated description.

@@ -440,18 +440,24 @@ func (m *CachingIdentityAllocator) ReleaseSlice(ctx context.Context, owner Ident

// WatchRemoteIdentities starts watching for identities in another kvstore and
// syncs all identities to the local identity cache.
func (m *CachingIdentityAllocator) WatchRemoteIdentities(backend kvstore.BackendOperations) (*allocator.RemoteCache, error) {
func (m *CachingIdentityAllocator) WatchRemoteIdentities(name string, backend kvstore.BackendOperations) (*allocator.RemoteCache, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please update the function description.

@@ -269,6 +269,7 @@ func (rc *remoteCluster) restartRemoteConnection(allocator RemoteIdentityWatcher
rc.releaseOldConnection()
rc.mesh.metricTotalNodes.WithLabelValues(rc.mesh.conf.Name, rc.mesh.conf.NodeName, rc.name).Set(float64(rc.remoteNodes.NumEntries()))
rc.mesh.metricReadinessStatus.WithLabelValues(rc.mesh.conf.Name, rc.mesh.conf.NodeName, rc.name).Set(metrics.BoolToFloat64(rc.isReadyLocked()))
allocator.RemoveRemoteIdentities(rc.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add this also somewhere here? otherwise it does not really fix the memleak issue.. the controller itself is not stopped, so this cleanup is never run in my case...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oblazek There should be only one controller per remote cluster, so as soon as a new connection is made, it will replace the old cache in the allocator. I decided to preserve the old one, so that connections aren't disrupted while etcd blips. Otherwise all identities are deleted and re-created, meaning connectivity is lost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, got it

@squeed
Copy link
Contributor Author

squeed commented Feb 3, 2023

@oblazek found another memory leak :-). Fix coming shortly.

@squeed
Copy link
Contributor Author

squeed commented Feb 6, 2023

OK, I've banged on this pretty well, couldn't get the bug to appear. This seems good to go.

@pchaigno
Copy link
Member

pchaigno commented Feb 8, 2023

/test

@squeed
Copy link
Contributor Author

squeed commented Feb 9, 2023

ci-multicluster has failed. Investigating.

@squeed
Copy link
Contributor Author

squeed commented Feb 13, 2023

Found the issue; we were starting the connectivity test too soon after restarting the agents. I have a test run in #23716, which I'll include here when finished. I believe this change is safe.

Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late resposse. Changes looks good to me (including the test changes) thanks!

Upon reconnect, we failed to remove the old remoteCache (we were looking
at the wrong Allocator on cleanup), meaning that every time we
reconnected, all old remoteCaches were kept around.

This was, at best, a memory leak, and at worst meant that we continued
to read stale data even after reconnecting, depending on the ordering of
a map iteration.

Fixes: cilium#22988
Fixes: cilium#13446

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Before this change, RemoteCache would create and start a second cache on
an existing Allocator, which is a waste of resources.

So, just pass through to the Allocator's existing cache directly.
There's no need to duplicate etcd watches or the in-memory cache.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
When two clusters are connected, it causes a full agent rollout. So,
wait for that to finish with "cilium status" before proceeding with
connectivity tests.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
@squeed
Copy link
Contributor Author

squeed commented Feb 14, 2023

Passing run for proposed CI changes here: https://github.com/cilium/cilium/actions/runs/4164795509/jobs/7206931563

@squeed
Copy link
Contributor Author

squeed commented Feb 14, 2023

/ci-multicluster

@squeed
Copy link
Contributor Author

squeed commented Feb 14, 2023

CI is all green, including multicluster (the flake fixed by this CI change didn't hit us this time)

Comment on lines +325 to +326
cilium status --wait --context ${{ steps.contexts.outputs.context1 }}
cilium status --wait --context ${{ steps.contexts.outputs.context2 }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is fine for now, I suppose a more proper fix would be to have the clustermesh status --wait command automatically also imply a status --wait directly in CLI code. cc @tklauser for thoughts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, ultimately, they do different things. Ideally there would be a --wait in clustermesh join

@squeed
Copy link
Contributor Author

squeed commented Feb 14, 2023

/test

@squeed
Copy link
Contributor Author

squeed commented Feb 15, 2023

/ci-verifier

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 15, 2023
@pchaigno pchaigno merged commit 26e6bcf into cilium:master Feb 15, 2023
@pchaigno pchaigno mentioned this pull request Feb 16, 2023
16 tasks
@pchaigno pchaigno added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Feb 16, 2023
@julianwiedmann julianwiedmann added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.11 This issue affects v1.11 branch affects/v1.12 This issue affects v1.12 branch area/clustermesh Relates to multi-cluster routing functionality in Cilium. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
9 participants