-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datapath: Don't enforce ingress policies at overlay if endpoint routes are enabled #22333
Conversation
ea66944
to
835ae0e
Compare
4812c09
to
f09c44a
Compare
Does this change the situation as described below? Lines 1859 to 1876 in b9f7292
Thinking if we can move some of that logic into Lines 1010 to 1030 in b9f7292
HAVE_ENCAP /IS_BPF_OVERLAY ?).
|
I don't think it does. We still want to avoid bypassing the stack in case of kube-proxy and we still need to mark the packet as |
f09c44a
to
53dbf49
Compare
If I understand the code/comment correctly, it's built for the case where So I think we could skip that |
No, it's orthogonal to the tail-call. It's for the case where we would use a bpf_redirect to jump directly to the destination, regardless of whether we enforced policies are the source or the destination. This pull request doesn't change whether or not we do a bpf_redirect; it only changes whether or not we enforce ingress policies at the source for the overlay->pod path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto-acking for ci-structure, trivial changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked out the PR, and tested yet. But IIRC, service rev translation happens in the overlay program, so with this fix, won't we end up enforcing an ingress policy on the translated address? Did you manually test this fix for pod -> svc ip (selected backend on remote node) + policy case with the tunneling and ep routes combination? (I see that tests have passed, but I'm not sure about the test coverage for this case.)
No. The reverse translation happens before the local delivery. See |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But IIRC, service rev translation happens in the overlay program, so with this fix, won't we end up enforcing an ingress policy on the translated address?
No. The reverse translation happens before the local delivery. See
handle_ipv4
inbpf_overlay.c
.
👍 You are right. Any chance we can combine this with the per endpoint check - https://github.com/cilium/cilium/blob/master/pkg/datapath/linux/config/config.go#L862-L862? This is needed in general whenever endpoint routes are enabled, no?
I'm not sure I'm following what you're proposing here 😕 |
Can we consolidate the per endpoint and netdev template configs at a single place -
You already mentioned that endpoint routes config is either enabled or disabled for all endpoints -
|
Just to summarize our offline conversation:
|
53dbf49
to
82789fe
Compare
When endpoint routes are enabled, we should enforce ingress policies at the destination lxc interface, to which a BPF program will be attached. Nevertheless, today, for packets coming from the overlay, we enforce ingress policies twice, once at the e.g. cilium_vxlan interface and a second time at the lxc device. This is happening for two reasons: 1. bpf_overlay is not aware of the endpoint routes settings so it doesn't even know that it's not responsible for enforcing ingress policies. 2. We have a flag to force the enforcement of ingress policies at the source in this case. This flag exists for historic reasons that are not valid anymore. A separate patch will fix the reason 2 above. This commit fixes reason 1 by telling bpf_overlay to *not* enforce ingress policies when endpoint routes are enabled. Note that we do not support the case where some endpoint have endpoint routes enabled and others don't. If we did, additional logic would be required. Fixes: 3179a47 ("datapath: Support enable-endpoint-routes with encapsulation") Signed-off-by: Paul Chaignon <paul@cilium.io>
The previous commit changed the packet handling on the path overlay->lxc to fix a bug. More presicely, when endpoint routes are enabled, we won't enforce ingress policies on both the overlay and the lxc devices but only on the latter. However, as a consequence of that patch, we don't go through the policy-only program in bpf_lxc and we therefore changed the way the packet is transmitted between overlay and lxc devices in some cases. As a summary of changes made in the previous path, consider the following table for the path overlay -> lxc. Before the previous patch: | Endpoint routes | Enforcement | Path | |-----------------|-----------------|----------------------| | Enable | overlay AND lxc | bpf_redirect if KPR; | | | | stack otherwise | | Disabled | overlay | bpf_redirect | Now: | Endpoint routes | Enforcement | Path | |-----------------|-------------|--------------| | Enable | lxc | bpf_redirect | | Disabled | overlay | bpf_redirect | The previous patch intended to fix the enforcement to avoid the double policy enforcement, but it also changed the packet path in case endpoint routes are enabled. This patch now fixes this by adding the same exception we have in bpf_lxc to the l3.h logic we have. Hence, with the current patch, the table will look like: | Endpoint routes | Enforcement | Path | |-----------------|-------------|----------------------| | Enable | lxc | bpf_redirect if KPR; | | | | stack otherwise | | Disabled | overlay | bpf_redirect | I've kept this in a separate commit from the previous in an attempt to split up and the logic and more clearly show the deltas. Signed-off-by: Paul Chaignon <paul@cilium.io>
82789fe
to
eccdf95
Compare
Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
…erlay cilium#22333 fixed a bug for configs with tunnel-routing and per-EP routes. Here ingress policy was applied twice: first via tail-call, and then a second time by the to-container program as the packet traverses the veth pair. The fix was to avoid the tail-call, and only apply policy with the to-container program. But the tail-call also contains a kube-proxy workaround (potential service replies need to pass through kube-proxy for RevDNAT, so the tail-call punts them to the stack instead of calling redirect_ep() to forward them straight to the endpoint). So we copied that workaround into the l3_local_delivery() path. The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell if a packet was received from the tunnel. But as l3_local_delivery() is inlined into bpf_overlay, we can now limit the work-around to IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg. plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery(). Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy") Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
…erlay #22333 fixed a bug for configs with tunnel-routing and per-EP routes. Here ingress policy was applied twice: first via tail-call, and then a second time by the to-container program as the packet traverses the veth pair. The fix was to avoid the tail-call, and only apply policy with the to-container program. But the tail-call also contains a kube-proxy workaround (potential service replies need to pass through kube-proxy for RevDNAT, so the tail-call punts them to the stack instead of calling redirect_ep() to forward them straight to the endpoint). So we copied that workaround into the l3_local_delivery() path. The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell if a packet was received from the tunnel. But as l3_local_delivery() is inlined into bpf_overlay, we can now limit the work-around to IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg. plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery(). Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy") Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
…erlay [ upstream commit 334f7f0 ] #22333 fixed a bug for configs with tunnel-routing and per-EP routes. Here ingress policy was applied twice: first via tail-call, and then a second time by the to-container program as the packet traverses the veth pair. The fix was to avoid the tail-call, and only apply policy with the to-container program. But the tail-call also contains a kube-proxy workaround (potential service replies need to pass through kube-proxy for RevDNAT, so the tail-call punts them to the stack instead of calling redirect_ep() to forward them straight to the endpoint). So we copied that workaround into the l3_local_delivery() path. The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell if a packet was received from the tunnel. But as l3_local_delivery() is inlined into bpf_overlay, we can now limit the work-around to IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg. plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery(). Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy") Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
…erlay [ upstream commit 334f7f0 ] #22333 fixed a bug for configs with tunnel-routing and per-EP routes. Here ingress policy was applied twice: first via tail-call, and then a second time by the to-container program as the packet traverses the veth pair. The fix was to avoid the tail-call, and only apply policy with the to-container program. But the tail-call also contains a kube-proxy workaround (potential service replies need to pass through kube-proxy for RevDNAT, so the tail-call punts them to the stack instead of calling redirect_ep() to forward them straight to the endpoint). So we copied that workaround into the l3_local_delivery() path. The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell if a packet was received from the tunnel. But as l3_local_delivery() is inlined into bpf_overlay, we can now limit the work-around to IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg. plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery(). Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy") Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
…erlay [ upstream commit 334f7f0 ] #22333 fixed a bug for configs with tunnel-routing and per-EP routes. Here ingress policy was applied twice: first via tail-call, and then a second time by the to-container program as the packet traverses the veth pair. The fix was to avoid the tail-call, and only apply policy with the to-container program. But the tail-call also contains a kube-proxy workaround (potential service replies need to pass through kube-proxy for RevDNAT, so the tail-call punts them to the stack instead of calling redirect_ep() to forward them straight to the endpoint). So we copied that workaround into the l3_local_delivery() path. The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell if a packet was received from the tunnel. But as l3_local_delivery() is inlined into bpf_overlay, we can now limit the work-around to IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg. plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery(). Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy") Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
…erlay [ upstream commit 334f7f0 ] #22333 fixed a bug for configs with tunnel-routing and per-EP routes. Here ingress policy was applied twice: first via tail-call, and then a second time by the to-container program as the packet traverses the veth pair. The fix was to avoid the tail-call, and only apply policy with the to-container program. But the tail-call also contains a kube-proxy workaround (potential service replies need to pass through kube-proxy for RevDNAT, so the tail-call punts them to the stack instead of calling redirect_ep() to forward them straight to the endpoint). So we copied that workaround into the l3_local_delivery() path. The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell if a packet was received from the tunnel. But as l3_local_delivery() is inlined into bpf_overlay, we can now limit the work-around to IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg. plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery(). Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy") Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
This fixes bug #14657. See commits for details.
There's probably no point in backporting because to fully fix this issue we would also need to backport #22190.
Fixes: #13346.
Fixes: #14657.