Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

egressgw: fix race with endpoint deletion #26901

Merged
merged 2 commits into from Jul 20, 2023

Conversation

jibi
Copy link
Member

@jibi jibi commented Jul 18, 2023

with the current implementation of the workqueue used in the egressgw
manager to handle CiliumEndpoint events and retries it's possible to
trigger a race where by the time we process an add/update event, the
related CiliumEndpoint has already been deleted from the
pendingEndpointEvents, resulting in the agent panicing due to a nil
access.

This commit simplifies how the delete events are handled in order to
eliminate the race.

Fixes: 9fb24de ("egressgw: retry getIdentityLabels on failure")

@jibi jibi added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. feature/egress-gateway Impacts the egress IP gateway feature. sig/agent Cilium agent related. labels Jul 18, 2023
jibi and others added 2 commits July 19, 2023 14:26
with the current implementation of the workqueue used in the egressgw
manager to handle CiliumEndpoint events and retries it's possible to
trigger a race where by the time we process an add/update event, the
related CiliumEndpoint has already been deleted from the
pendingEndpointEvents, resulting in the agent panicing due to a nil
access.

This commit simplifies how the delete events are handled in order to
eliminate the race.

Fixes: 9fb24de ("egressgw: retry getIdentityLabels on failure")

Co-authored-by: André Martins <andre@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
after the switch to the trigger based reconciliation, the reconciliation
can handle multiple type of events in one batch.

Update the logic so that we don't end up calling
updatePoliciesBySourceIP() twice in case we are processing a batch for
both endpoint and policy events.

Co-authored-by: André Martins <andre@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
@jibi jibi force-pushed the pr/jibi/fix-egressgw-async-eps branch 2 times, most recently from 5ed7160 to 728e577 Compare July 19, 2023 14:14
@jibi jibi requested review from aanm and bimmlerd July 19, 2023 14:40
@jibi jibi marked this pull request as ready for review July 19, 2023 14:40
@jibi jibi requested a review from a team as a code owner July 19, 2023 14:40
@jibi jibi requested a review from ldelossa July 19, 2023 14:40
Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried pretty hard to find another race, but I think this is safe - my assumption is that OnUpdateEndpoint and OnDeleteEndpoint are not called concurrently (otherwise it's possible to reorder Add/Delete in such a way that the EP should be there, but isn't). I think that's an ok assumption though.

@jibi jibi force-pushed the pr/jibi/fix-egressgw-async-eps branch from 728e577 to 5ed7160 Compare July 19, 2023 18:49
@jibi
Copy link
Member Author

jibi commented Jul 19, 2023

removed the explicit deletion of the event from pendingEndpointEvents https://github.com/cilium/cilium/blob/728e5777b804d268043b58a719a80380870c367d/pkg/egressgateway/manager.go#L333-L340 as discussed offline

@jibi
Copy link
Member Author

jibi commented Jul 20, 2023

/test

@jibi jibi added affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Jul 20, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.14.0 Jul 20, 2023
@jibi jibi added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jul 20, 2023
@jibi
Copy link
Member Author

jibi commented Jul 20, 2023

marking this as ready even though the sig-datapath review is missing:

@jibi jibi removed the request for review from ldelossa July 20, 2023 14:49
@aditighag aditighag merged commit 72b560b into main Jul 20, 2023
257 of 344 checks passed
@aditighag aditighag deleted the pr/jibi/fix-egressgw-async-eps branch July 20, 2023 17:58
@nbusseneau nbusseneau mentioned this pull request Jul 24, 2023
21 tasks
@nbusseneau nbusseneau added the backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. label Jul 24, 2023
@nbusseneau nbusseneau removed the needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch label Jul 24, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.0 Jul 24, 2023
@nbusseneau nbusseneau added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Jul 25, 2023
@aanm aanm moved this from Backport pending to v1.14 to Backport done to v1.14 in 1.14.0 Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.12 This issue affects v1.12 branch affects/v1.13 This issue affects v1.13 branch backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. feature/egress-gateway Impacts the egress IP gateway feature. kind/bug This is a bug in the Cilium logic. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related.
Projects
No open projects
1.14.0
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

None yet

5 participants