Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

egressgw: retry getIdentityLabels on failure #26457

Merged
merged 2 commits into from Jul 10, 2023
Merged

Conversation

jibi
Copy link
Member

@jibi jibi commented Jun 23, 2023

when the manager fails to resolve the identity of an endpoint to a set of labels, instead of discarding the endpoint update, enqueue the event and retry identity labels resolution following the same exponential backoff logic used by the identity allocator

testing

I tested this locally by forcing the identity resolution to randomly fail:

diff --git a/pkg/egressgateway/manager.go b/pkg/egressgateway/manager.go
index ed7f19399c..0b867b2d12 100644
--- a/pkg/egressgateway/manager.go
+++ b/pkg/egressgateway/manager.go
@@ -7,6 +7,7 @@ import (
        "container/heap"
        "context"
        "fmt"
+       "math/rand"
        "net"
        "sort"
        "time"
@@ -171,6 +172,10 @@ func NewEgressGatewayManager(p Params) *Manager {
 // getIdentityLabels waits for the global identities to be populated to the cache,
 // then looks up identity by ID from the cached identity allocator and return its labels.
 func (manager *Manager) getIdentityLabels(securityIdentity uint32) (labels.Labels, error) {
+       if rand.Intn(16) > 0 {
+               return nil, fmt.Errorf("fake error")
+       }
+
➜  cilium git:(pr/jibi/egressgw-retry-identity) ks logs cilium-2dc5v --timestamps | grep subsys=egress | grep coredns-5d78c9869d-44vlf
Defaulted container "cilium-agent" out of: cilium-agent, config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init), install-cni-binaries (init)
2023-06-23T10:22:36.698593375Z level=warning msg="Failed to get identity labels for endpoint, will retry." error="fake error" k8sEndpointName=coredns-5d78c9869d-44vlf k8sNamespace=kube-system subsys=egressgateway
2023-06-23T10:22:37.590584472Z level=warning msg="Failed to get identity labels for endpoint, will retry." error="fake error" k8sEndpointName=coredns-5d78c9869d-44vlf k8sNamespace=kube-system subsys=egressgateway
2023-06-23T10:22:42.630015783Z level=warning msg="Failed to get identity labels for endpoint, will retry." error="fake error" k8sEndpointName=coredns-5d78c9869d-44vlf k8sNamespace=kube-system subsys=egressgateway
2023-06-23T10:22:44.710650728Z level=warning msg="Failed to get identity labels for endpoint, will retry." error="fake error" k8sEndpointName=coredns-5d78c9869d-44vlf k8sNamespace=kube-system subsys=egressgateway
2023-06-23T10:22:46.871488814Z level=warning msg="Failed to get identity labels for endpoint, will retry." error="fake error" k8sEndpointName=coredns-5d78c9869d-44vlf k8sNamespace=kube-system subsys=egressgateway
2023-06-23T10:22:52.190670670Z level=warning msg="Failed to get identity labels for endpoint, will retry." error="fake error" k8sEndpointName=coredns-5d78c9869d-44vlf k8sNamespace=kube-system subsys=egressgateway
2023-06-23T10:22:55.113841789Z level=warning msg="Failed to get identity labels for endpoint, will retry." error="fake error" k8sEndpointName=coredns-5d78c9869d-44vlf k8sNamespace=kube-system subsys=egressgateway
2023-06-23T10:22:58.952487945Z level=debug msg="Added CiliumEndpoint" k8sEndpointName=coredns-5d78c9869d-44vlf k8sNamespace=kube-system subsys=egressgateway

Fixes: #26158 (hopefully 🤞)

@jibi jibi added feature/egress-gateway Impacts the egress IP gateway feature. sig/agent Cilium agent related. labels Jun 23, 2023
@jibi jibi requested review from a team as code owners June 23, 2023 10:45
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 23, 2023
@jibi jibi added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Jun 23, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 23, 2023
@ti-mo
Copy link
Contributor

ti-mo commented Jun 23, 2023

I got pinged for TH review on pkg/backoff, could you assign a CODEOWNER to that package in this PR? 🙏

@jibi jibi requested a review from a team as a code owner June 23, 2023 12:43
@jibi jibi requested a review from bimmlerd June 23, 2023 12:43
Copy link
Member

@bimmlerd bimmlerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CODEOWNERS change lgtm

pkg/backoff/backoff.go Outdated Show resolved Hide resolved
pkg/egressgateway/manager.go Show resolved Hide resolved
pkg/egressgateway/manager.go Outdated Show resolved Hide resolved
pkg/egressgateway/manager.go Outdated Show resolved Hide resolved
pkg/egressgateway/manager.go Outdated Show resolved Hide resolved
@jibi jibi marked this pull request as draft June 26, 2023 12:59
@jibi jibi force-pushed the pr/jibi/egressgw-retry-identity branch 5 times, most recently from 69ebbc3 to a0dd816 Compare June 30, 2023 08:33
@jibi
Copy link
Member Author

jibi commented Jun 30, 2023

Thanks for the pointers @giorio94 , I switched this PR to a workqueue.

The approach I'm taking is the follow: there's an endpointEventRetryQueue workqueue that simply tracks the ID of the CiliumEndpoint that should be processed/reprocessed, and then in a separate pendingEndpointEvents object we store the ID -> CiliumEndpoint mapping.

This ensures we can deal nicely with CiliumEndpoints that get updated/deleted while in the workqueue: whenever we receive a new event, any existing one for the same ID gets removed from the queue (as otherwise we could end up retrying processing old revisions of a given CiliumEndpoint).

tested with a random fake error:

2023-06-30T08:04:33.847665474Z level=debug msg="Added CiliumEgressGatewayPolicy" ciliumEgressGatewayPolicyName=egress subsys=egressgateway
2023-06-30T08:04:34.300917210Z level=warning msg="Failed to get identity labels for endpoint" error="fake error" k8sEndpointName=ubuntu-1 k8sNamespace=default subsys=egressgateway
2023-06-30T08:04:34.300955321Z level=warning msg="Failed to get identity labels for endpoint" error="fake error" k8sEndpointName=ubuntu-1 k8sNamespace=default subsys=egressgateway
2023-06-30T08:04:34.321451715Z level=warning msg="Failed to get identity labels for endpoint" error="fake error" k8sEndpointName=ubuntu-1 k8sNamespace=default subsys=egressgateway
2023-06-30T08:04:34.361744042Z level=warning msg="Failed to get identity labels for endpoint" error="fake error" k8sEndpointName=ubuntu-1 k8sNamespace=default subsys=egressgateway
2023-06-30T08:04:34.442374668Z level=warning msg="Failed to get identity labels for endpoint" error="fake error" k8sEndpointName=ubuntu-1 k8sNamespace=default subsys=egressgateway
2023-06-30T08:04:34.602854941Z level=warning msg="Failed to get identity labels for endpoint" error="fake error" k8sEndpointName=ubuntu-1 k8sNamespace=default subsys=egressgateway
2023-06-30T08:04:34.923616503Z level=warning msg="Failed to get identity labels for endpoint" error="fake error" k8sEndpointName=ubuntu-1 k8sNamespace=default subsys=egressgateway
2023-06-30T08:04:35.564425666Z level=warning msg="Failed to get identity labels for endpoint" error="fake error" k8sEndpointName=ubuntu-1 k8sNamespace=default subsys=egressgateway
2023-06-30T08:04:36.845466966Z level=warning msg="Failed to get identity labels for endpoint" error="fake error" k8sEndpointName=ubuntu-1 k8sNamespace=default subsys=egressgateway
2023-06-30T08:04:39.405585974Z level=debug msg="Added CiliumEndpoint" k8sEndpointName=ubuntu-1 k8sNamespace=default subsys=egressgateway
2023-06-30T08:04:39.405952893Z level=debug msg="Egress gateway policy applied" destinationCIDR=0.0.0.0/0 egressIP=0.0.0.0 gatewayIP=0.0.0.0 sourceIP=10.244.2.32 subsys=egressgateway
2023-06-30T08:04:39.405959833Z level=debug msg="Egress gateway policy applied" destinationCIDR=8.8.8.8/32 egressIP=0.0.0.0 gatewayIP=0.0.0.1 sourceIP=10.244.2.32 subsys=egressgateway

@jibi jibi marked this pull request as ready for review June 30, 2023 08:41
@jibi jibi requested review from giorio94 and removed request for a team June 30, 2023 08:41
@jibi jibi force-pushed the pr/jibi/egressgw-retry-identity branch from a0dd816 to f7f6218 Compare June 30, 2023 09:50
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workqueue implementation looks good to me.
I wonder whether it could make sense to leverage it also during insertion for increased uniformity. WDYT?

pkg/egressgateway/manager.go Outdated Show resolved Hide resolved
pkg/egressgateway/manager.go Outdated Show resolved Hide resolved
@jibi jibi force-pushed the pr/jibi/egressgw-retry-identity branch from f7f6218 to 777cb59 Compare July 6, 2023 12:18
@jibi jibi force-pushed the pr/jibi/egressgw-retry-identity branch from 777cb59 to e550bda Compare July 6, 2023 12:19
@jibi jibi requested a review from giorio94 July 6, 2023 12:20
@jibi jibi added the needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch label Jul 6, 2023
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

pkg/egressgateway/manager.go Outdated Show resolved Hide resolved
@jibi jibi force-pushed the pr/jibi/egressgw-retry-identity branch 2 times, most recently from 68793a4 to 654c0c0 Compare July 8, 2023 12:44
jibi added 2 commits July 8, 2023 14:47
when the manager fails to resolve the identity of an endpoint to a set
of labels, instead of discarding the endpoint update, enqueue the event
and retry identity labels resolution following the same exponential
backoff logic used by the identity allocator

Signed-off-by: Gilberto Bertin <jibi@cilium.io>
inside getEndpointMetadata as that's where the rest of the checks for
the received CiliumEndpoint event live

Signed-off-by: Gilberto Bertin <jibi@cilium.io>
@jibi jibi force-pushed the pr/jibi/egressgw-retry-identity branch from 654c0c0 to 9fa34ab Compare July 8, 2023 12:48
@jibi
Copy link
Member Author

jibi commented Jul 8, 2023

/test

@jibi jibi merged commit a1007ef into main Jul 10, 2023
179 checks passed
@jibi jibi deleted the pr/jibi/egressgw-retry-identity branch July 10, 2023 06:33
@jibi jibi mentioned this pull request Jul 10, 2023
19 tasks
@jibi jibi added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. affects/v1.13 This issue affects v1.13 branch and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Jul 10, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport pending to v1.14 in 1.14.0 Jul 11, 2023
@julianwiedmann julianwiedmann added backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. and removed backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. labels Jul 11, 2023
@aanm aanm moved this from Backport pending to v1.14 to Backport done to v1.14 in 1.14.0 Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects/v1.13 This issue affects v1.13 branch backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. feature/egress-gateway Impacts the egress IP gateway feature. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/agent Cilium agent related.
Projects
No open projects
1.14.0
Backport done to v1.14
Development

Successfully merging this pull request may close these issues.

CI: Conformance E2E - egress-gateway
5 participants