New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
egressgw: fix race with endpoint deletion #26901
Conversation
with the current implementation of the workqueue used in the egressgw manager to handle CiliumEndpoint events and retries it's possible to trigger a race where by the time we process an add/update event, the related CiliumEndpoint has already been deleted from the pendingEndpointEvents, resulting in the agent panicing due to a nil access. This commit simplifies how the delete events are handled in order to eliminate the race. Fixes: 9fb24de ("egressgw: retry getIdentityLabels on failure") Co-authored-by: André Martins <andre@cilium.io> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
after the switch to the trigger based reconciliation, the reconciliation can handle multiple type of events in one batch. Update the logic so that we don't end up calling updatePoliciesBySourceIP() twice in case we are processing a batch for both endpoint and policy events. Co-authored-by: André Martins <andre@cilium.io> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
5ed7160
to
728e577
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried pretty hard to find another race, but I think this is safe - my assumption is that OnUpdateEndpoint
and OnDeleteEndpoint
are not called concurrently (otherwise it's possible to reorder Add/Delete in such a way that the EP should be there, but isn't). I think that's an ok assumption though.
728e577
to
5ed7160
Compare
removed the explicit deletion of the event from |
/test |
marking this as ready even though the sig-datapath review is missing:
|
with the current implementation of the workqueue used in the egressgw
manager to handle CiliumEndpoint events and retries it's possible to
trigger a race where by the time we process an add/update event, the
related CiliumEndpoint has already been deleted from the
pendingEndpointEvents, resulting in the agent panicing due to a nil
access.
This commit simplifies how the delete events are handled in order to
eliminate the race.
Fixes: 9fb24de ("egressgw: retry getIdentityLabels on failure")