Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missed deletion events when reconnecting to/disconnecting from remote clusters (identities) #25677

Conversation

giorio94
Copy link
Member

Follow up of #25499 targeting identities synchronization. Please refer to the above PR and the commit descriptions for additional information.

Related: #24740
Related: #25499

Fix missed deletion events when reconnecting to/disconnecting from remote clusters (identities)

@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/clustermesh Relates to multi-cluster routing functionality in Cilium. sig/kvstore Impacts the KVStore package interactions. labels May 25, 2023
@giorio94 giorio94 force-pushed the mio/clustermesh-reconnect-deletion-events-identities branch 5 times, most recently from a32c53a to 055bd44 Compare May 29, 2023 12:02
@giorio94 giorio94 force-pushed the mio/clustermesh-reconnect-deletion-events-identities branch from 055bd44 to c1b23e2 Compare June 1, 2023 10:42
@giorio94
Copy link
Member Author

giorio94 commented Jun 1, 2023

/test

@giorio94 giorio94 added the release-blocker/1.14 This issue will prevent the release of the next version of Cilium. label Jun 1, 2023
@giorio94 giorio94 marked this pull request as ready for review June 1, 2023 10:44
@giorio94 giorio94 requested review from a team as code owners June 1, 2023 10:44
@giorio94 giorio94 force-pushed the mio/clustermesh-reconnect-deletion-events-identities branch from c1b23e2 to 886b27d Compare June 8, 2023 08:44
@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

Rebased onto main to fix conflicts.

@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sUpdates Tests upgrade and downgrade from a Cilium stable image to master

Failure Output

FAIL: Unexpected missed tail calls

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/599/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

Copy link
Member

@tklauser tklauser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes LGTM.

Given this fixes issues in bfe7b88 ("clustermesh: correctly remove remoteCache on connection disruption") from PR #23532 which was backported to 1.13 (and also affects 1.11 and 1.12) I wonder whether this PR should be labeled backported accordingly as well?

@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

Code changes LGTM.

Given this fixes issues in bfe7b88 ("clustermesh: correctly remove remoteCache on connection disruption") from PR #23532 which was backported to 1.13 (and also affects 1.11 and 1.12) I wonder whether this PR should be labeled backported accordingly as well?

I personally think that backporting is quite risky in this case, since the overall fix is a fairly large change which also depends on #25388, #25499 and #25675 (and likely more). I'd be more comfortable not marking it as needs-backport for the moment, and re-evaluate once all pieces are in.

@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

/ci-aks

@giorio94
Copy link
Member Author

giorio94 commented Jun 8, 2023

/test-1.26-net-next

Hit known flake #24514

Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Nice to see beautiful lines of mgr.Register on restartRemoteConnection.

Currently, the RemoteIdentityWatcher interface includes the Close()
method, which is never used, and shall never used in the clustermesh
context, since it would close the main allocator instance.
Hence, let's drop it.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
bfe7b88 ("clustermesh: correctly remove remoteCache on connection
disruption") modified the handling of remote identities upon kvstore
reconnection to fix a memory leak. Still, it didn't address the
possibility that identities may be deleted in the time window between
the termination of the previous connection and the establishment of a
new one. If this happens, the stale identities will be removed when the
map entry gets overwritten, but the corresponding deletion event will
never be propagated to the rest of the agent processing logic.

This commit fixes this issue draining all stale identities once the new
cache for remote identities has initialized correctly.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
bfe7b88 ("clustermesh: correctly remove remoteCache on
connection disruption") modified the handling of remote identities
when a remote cluster is disconnected, removing the corresponding
cache from the global allocator map. Still, it didn't emit a deletion
event to propagate this information to the rest of the agent processing
logic. This commit implements the missing logic to drain all remote
identities when a remote kvstore is unregistered.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94 giorio94 force-pushed the mio/clustermesh-reconnect-deletion-events-identities branch from 886b27d to 9380a6c Compare June 9, 2023 16:08
@giorio94
Copy link
Member Author

giorio94 commented Jun 9, 2023

Rebased onto main to fix conflicts.

@giorio94
Copy link
Member Author

giorio94 commented Jun 9, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' hit: #25958 (87.33% similarity)

@giorio94
Copy link
Member Author

/test-1.26-net-next

@giorio94
Copy link
Member Author

giorio94 commented Jun 12, 2023

/ci-ginkgo

Restarted to pick the fix for the matrix generation

@giorio94
Copy link
Member Author

/ci-aks

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 12, 2023
@giorio94
Copy link
Member Author

Conformance ginkgo failed with #25964 (I'm not rerunning it since it is failing almost consistently in all PRs, and the workflow is not marked as required at the moment). All other checks are green.

@tklauser tklauser merged commit 64f6cb0 into cilium:main Jun 13, 2023
63 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clustermesh Relates to multi-cluster routing functionality in Cilium. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/kvstore Impacts the KVStore package interactions.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants