New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix missed deletion events when reconnecting to/disconnecting from remote clusters (identities) #25677
Fix missed deletion events when reconnecting to/disconnecting from remote clusters (identities) #25677
Conversation
a32c53a
to
055bd44
Compare
055bd44
to
c1b23e2
Compare
/test |
c1b23e2
to
886b27d
Compare
Rebased onto main to fix conflicts. |
/test Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/599/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think that backporting is quite risky in this case, since the overall fix is a fairly large change which also depends on #25388, #25499 and #25675 (and likely more). I'd be more comfortable not marking it as needs-backport for the moment, and re-evaluate once all pieces are in. |
/ci-aks |
/test-1.26-net-next Hit known flake #24514 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 Nice to see beautiful lines of mgr.Register
on restartRemoteConnection
.
Currently, the RemoteIdentityWatcher interface includes the Close() method, which is never used, and shall never used in the clustermesh context, since it would close the main allocator instance. Hence, let's drop it. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
bfe7b88 ("clustermesh: correctly remove remoteCache on connection disruption") modified the handling of remote identities upon kvstore reconnection to fix a memory leak. Still, it didn't address the possibility that identities may be deleted in the time window between the termination of the previous connection and the establishment of a new one. If this happens, the stale identities will be removed when the map entry gets overwritten, but the corresponding deletion event will never be propagated to the rest of the agent processing logic. This commit fixes this issue draining all stale identities once the new cache for remote identities has initialized correctly. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
bfe7b88 ("clustermesh: correctly remove remoteCache on connection disruption") modified the handling of remote identities when a remote cluster is disconnected, removing the corresponding cache from the global allocator map. Still, it didn't emit a deletion event to propagate this information to the rest of the agent processing logic. This commit implements the missing logic to drain all remote identities when a remote kvstore is unregistered. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
886b27d
to
9380a6c
Compare
Rebased onto main to fix conflicts. |
/test Job 'Cilium-PR-K8s-1.26-kernel-net-next' hit: #25958 (87.33% similarity) |
/test-1.26-net-next |
/ci-ginkgo Restarted to pick the fix for the matrix generation |
/ci-aks |
Conformance ginkgo failed with #25964 (I'm not rerunning it since it is failing almost consistently in all PRs, and the workflow is not marked as required at the moment). All other checks are green. |
Follow up of #25499 targeting identities synchronization. Please refer to the above PR and the commit descriptions for additional information.
Related: #24740
Related: #25499
Fix missed deletion events when reconnecting to/disconnecting from remote clusters (identities)