Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datapath: Don't enforce ingress policies at overlay if endpoint routes are enabled #22333

Merged
merged 2 commits into from
Dec 14, 2022

Conversation

pchaigno
Copy link
Member

@pchaigno pchaigno commented Nov 23, 2022

This fixes bug #14657. See commits for details.

There's probably no point in backporting because to fully fix this issue we would also need to backport #22190.

Fixes: #13346.
Fixes: #14657.

Fix bug that caused ingress policies to be enforced twice when running with tunneling and endpoint routes.

@pchaigno pchaigno added kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Nov 23, 2022
@pchaigno pchaigno changed the title datapath: Don't enforce policies at overlay if ep routes are enabled datapath: Don't enforce policies at overlay if endpoint routes are enabled Nov 24, 2022
@pchaigno pchaigno changed the title datapath: Don't enforce policies at overlay if endpoint routes are enabled datapath: Don't enforce ingress policies at overlay if endpoint routes are enabled Nov 24, 2022
@pchaigno pchaigno force-pushed the fix-vxlan-with-ep-routes branch 3 times, most recently from 4812c09 to f09c44a Compare November 24, 2022 23:38
@pchaigno pchaigno marked this pull request as ready for review November 24, 2022 23:40
@pchaigno pchaigno requested review from a team as code owners November 24, 2022 23:40
@julianwiedmann
Copy link
Member

Does this change the situation as described below?

cilium/bpf/bpf_lxc.c

Lines 1859 to 1876 in b9f7292

#if !defined(ENABLE_ROUTING) && defined(TUNNEL_MODE) && !defined(ENABLE_NODEPORT)
/* In tunneling mode, we execute this code to send the packet from
* cilium_vxlan to lxc*. If we're using kube-proxy, we don't want to use
* redirect() because that would bypass conntrack and the reverse DNAT.
* Thus, we send packets to the stack, but since they have the wrong
* Ethernet addresses, we need to mark them as PACKET_HOST or the kernel
* will drop them.
* See #14646 for details.
*/
ctx_change_type(ctx, PACKET_HOST);
#else
ifindex = ctx_load_meta(ctx, CB_IFINDEX);
if (ifindex)
return redirect_ep(ctx, ifindex, from_host);
#endif /* !ENABLE_ROUTING && TUNNEL_MODE && !ENABLE_NODEPORT */
return CTX_ACT_OK;
}

Thinking if we can move some of that logic into

cilium/bpf/lib/common.h

Lines 1010 to 1030 in b9f7292

static __always_inline int redirect_ep(struct __ctx_buff *ctx __maybe_unused,
int ifindex __maybe_unused,
bool needs_backlog __maybe_unused)
{
/* Going via CPU backlog queue (aka needs_backlog) is required
* whenever we cannot do a fast ingress -> ingress switch but
* instead need an ingress -> egress netns traversal or vice
* versa.
*/
if (needs_backlog || !is_defined(ENABLE_HOST_ROUTING)) {
return ctx_redirect(ctx, ifindex, 0);
} else {
# ifdef HAVE_ENCAP
/* When coming from overlay, we need to set packet type
* to HOST as otherwise we might get dropped in IP layer.
*/
ctx_change_type(ctx, PACKET_HOST);
# endif /* HAVE_ENCAP */
return ctx_redirect_peer(ctx, ifindex, 0);
}
}
now (and also do a s/HAVE_ENCAP/IS_BPF_OVERLAY ?).

@pchaigno
Copy link
Member Author

Does this change the situation as described below?

I don't think it does. We still want to avoid bypassing the stack in case of kube-proxy and we still need to mark the packet as PACKET_HOST.

@julianwiedmann
Copy link
Member

julianwiedmann commented Dec 1, 2022

Does this change the situation as described below?

I don't think it does. We still want to avoid bypassing the stack in case of kube-proxy and we still need to mark the packet as PACKET_HOST.

If I understand the code/comment correctly, it's built for the case where from-overlay would tail-call into ipv4_policy() (with per-EP-routes enabled). And so we could return CTX_ACT_OK, and pass the packet to the host stack. With your fixes (and in default config) that's no longer the case, we redirect from from-overlay to the endpoint and run ipv4_policy() in the attached BPF program.

So I think we could skip that ctx_change_type(ctx, PACKET_HOST) in a few more scenarios. Or at least update the comment, to reflect that we typically don't hand over to the host stack afterwards.

@pchaigno
Copy link
Member Author

pchaigno commented Dec 1, 2022

If I understand the code/comment correctly, it's built for the case where from-overlay would tail-call into ipv4_policy() (with per-EP-routes enabled).

No, it's orthogonal to the tail-call. It's for the case where we would use a bpf_redirect to jump directly to the destination, regardless of whether we enforced policies are the source or the destination. This pull request doesn't change whether or not we do a bpf_redirect; it only changes whether or not we enforce ingress policies at the source for the overlay->pod path.

Copy link
Member

@nbusseneau nbusseneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-acking for ci-structure, trivial changes.

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked out the PR, and tested yet. But IIRC, service rev translation happens in the overlay program, so with this fix, won't we end up enforcing an ingress policy on the translated address? Did you manually test this fix for pod -> svc ip (selected backend on remote node) + policy case with the tunneling and ep routes combination? (I see that tests have passed, but I'm not sure about the test coverage for this case.)

@pchaigno
Copy link
Member Author

pchaigno commented Dec 1, 2022

But IIRC, service rev translation happens in the overlay program, so with this fix, won't we end up enforcing an ingress policy on the translated address?

No. The reverse translation happens before the local delivery. See handle_ipv4 in bpf_overlay.c.

Copy link
Member

@aditighag aditighag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But IIRC, service rev translation happens in the overlay program, so with this fix, won't we end up enforcing an ingress policy on the translated address?

No. The reverse translation happens before the local delivery. See handle_ipv4 in bpf_overlay.c.

👍 You are right. Any chance we can combine this with the per endpoint check - https://github.com/cilium/cilium/blob/master/pkg/datapath/linux/config/config.go#L862-L862? This is needed in general whenever endpoint routes are enabled, no?

@pchaigno
Copy link
Member Author

pchaigno commented Dec 1, 2022

Any chance we can combine this with the per endpoint check - https://github.com/cilium/cilium/blob/master/pkg/datapath/linux/config/config.go#L862-L862? This is needed in general whenever endpoint routes are enabled, no?

I'm not sure I'm following what you're proposing here 😕

@aditighag
Copy link
Member

Any chance we can combine this with the per endpoint check - https://github.com/cilium/cilium/blob/master/pkg/datapath/linux/config/config.go#L862-L862? This is needed in general whenever endpoint routes are enabled, no?

I'm not sure I'm following what you're proposing here 😕

Can we consolidate the per endpoint and netdev template configs at a single place -

if e.RequireEgressProg() {
		fmt.Fprintf(fw, "#define USE_BPF_PROG_FOR_INGRESS_POLICY 1\n")
}

You already mentioned that endpoint routes config is either enabled or disabled for all endpoints -

Note that we do not support the case where some endpoint have endpoint routes enabled and others don't. If we did, additional logic would be required.

@julianwiedmann
Copy link
Member

If I understand the code/comment correctly, it's built for the case where from-overlay would tail-call into ipv4_policy() (with per-EP-routes enabled).

No, it's orthogonal to the tail-call. It's for the case where we would use a bpf_redirect to jump directly to the destination, regardless of whether we enforced policies are the source or the destination. This pull request doesn't change whether or not we do a bpf_redirect; it only changes whether or not we enforce ingress policies at the source for the overlay->pod path.

Just to summarize our offline conversation:

  • Currently from-overlay tail-calls into bpf_lxc's policy code, which can then decide whether it passes the packet to the stack (return CTX_ACT_OK), or bpf_redirect() to the pod's interface.
  • with this PR from-overlay redirects to the pod's interface, and the egress prog runs the same policy code. If the egress prog now returns CTX_ACT_OK, the packet will not go the stack.

@pchaigno pchaigno added the release-blocker/1.13 This issue will prevent the release of the next version of Cilium. label Dec 2, 2022
@aditighag aditighag self-requested a review December 2, 2022 15:09
@aanm aanm added the needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch label Dec 9, 2022
@pchaigno pchaigno requested a review from a team as a code owner December 14, 2022 16:01
When endpoint routes are enabled, we should enforce ingress policies at
the destination lxc interface, to which a BPF program will be attached.
Nevertheless, today, for packets coming from the overlay, we enforce
ingress policies twice, once at the e.g. cilium_vxlan interface and a
second time at the lxc device.

This is happening for two reasons:
  1. bpf_overlay is not aware of the endpoint routes settings so it
     doesn't even know that it's not responsible for enforcing ingress
     policies.
  2. We have a flag to force the enforcement of ingress policies at the
     source in this case. This flag exists for historic reasons that are
     not valid anymore.

A separate patch will fix the reason 2 above. This commit fixes reason 1
by telling bpf_overlay to *not* enforce ingress policies when endpoint
routes are enabled.

Note that we do not support the case where some endpoint have endpoint
routes enabled and others don't. If we did, additional logic would be
required.

Fixes: 3179a47 ("datapath: Support enable-endpoint-routes with encapsulation")
Signed-off-by: Paul Chaignon <paul@cilium.io>
The previous commit changed the packet handling on the path
overlay->lxc to fix a bug. More presicely, when endpoint routes are
enabled, we won't enforce ingress policies on both the overlay and the
lxc devices but only on the latter.

However, as a consequence of that patch, we don't go through the
policy-only program in bpf_lxc and we therefore changed the way the
packet is transmitted between overlay and lxc devices in some cases. As
a summary of changes made in the previous path, consider the following
table for the path overlay -> lxc.

Before the previous patch:
| Endpoint routes | Enforcement     | Path                 |
|-----------------|-----------------|----------------------|
| Enable          | overlay AND lxc | bpf_redirect if KPR; |
|                 |                 | stack otherwise      |
| Disabled        | overlay         | bpf_redirect         |

Now:
| Endpoint routes | Enforcement | Path         |
|-----------------|-------------|--------------|
| Enable          | lxc         | bpf_redirect |
| Disabled        | overlay     | bpf_redirect |

The previous patch intended to fix the enforcement to avoid the double
policy enforcement, but it also changed the packet path in case endpoint
routes are enabled.

This patch now fixes this by adding the same exception we have in
bpf_lxc to the l3.h logic we have. Hence, with the current patch, the
table will look like:
| Endpoint routes | Enforcement | Path                 |
|-----------------|-------------|----------------------|
| Enable          | lxc         | bpf_redirect if KPR; |
|                 |             | stack otherwise      |
| Disabled        | overlay     | bpf_redirect         |

I've kept this in a separate commit from the previous in an attempt to
split up and the logic and more clearly show the deltas.

Signed-off-by: Paul Chaignon <paul@cilium.io>
@maintainer-s-little-helper
Copy link

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sDatapathConfig AutoDirectNodeRoutes Check direct connectivity with per endpoint routes

Failure Output

FAIL: Found 1 k8s-app=cilium logs matching list of errors that must be investigated:

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

@pchaigno
Copy link
Member Author

ConformanceAKS failed with known flake #22162. k8s-1.26-kernel-net-next failed with known flake #22601.

@pchaigno pchaigno merged commit 3d2ceaf into cilium:master Dec 14, 2022
@pchaigno pchaigno deleted the fix-vxlan-with-ep-routes branch December 14, 2022 18:46
@joestringer joestringer added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Dec 21, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Backport pending to v1.13 in 1.13.0-rc4 Dec 21, 2022
@joestringer joestringer added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Dec 22, 2022
@joestringer joestringer moved this from Backport pending to v1.13 to Backport done to v1.10 in 1.13.0-rc4 Dec 22, 2022
julianwiedmann added a commit to julianwiedmann/cilium that referenced this pull request Sep 4, 2023
…erlay

cilium#22333 fixed a bug for configs with
tunnel-routing and per-EP routes. Here ingress policy was applied twice:
first via tail-call, and then a second time by the to-container program as
the packet traverses the veth pair.

The fix was to avoid the tail-call, and only apply policy with the
to-container program. But the tail-call also contains a kube-proxy
workaround (potential service replies need to pass through kube-proxy for
RevDNAT, so the tail-call punts them to the stack instead of calling
redirect_ep() to forward them straight to the endpoint). So we copied that
workaround into the l3_local_delivery() path.

The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell
if a packet was received from the tunnel. But as l3_local_delivery() is
inlined into bpf_overlay, we can now limit the work-around to
IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg.
plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery().

Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy")
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
youngnick pushed a commit that referenced this pull request Sep 5, 2023
…erlay

#22333 fixed a bug for configs with
tunnel-routing and per-EP routes. Here ingress policy was applied twice:
first via tail-call, and then a second time by the to-container program as
the packet traverses the veth pair.

The fix was to avoid the tail-call, and only apply policy with the
to-container program. But the tail-call also contains a kube-proxy
workaround (potential service replies need to pass through kube-proxy for
RevDNAT, so the tail-call punts them to the stack instead of calling
redirect_ep() to forward them straight to the endpoint). So we copied that
workaround into the l3_local_delivery() path.

The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell
if a packet was received from the tunnel. But as l3_local_delivery() is
inlined into bpf_overlay, we can now limit the work-around to
IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg.
plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery().

Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy")
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
gandro pushed a commit that referenced this pull request Sep 12, 2023
…erlay

[ upstream commit 334f7f0 ]

#22333 fixed a bug for configs with
tunnel-routing and per-EP routes. Here ingress policy was applied twice:
first via tail-call, and then a second time by the to-container program as
the packet traverses the veth pair.

The fix was to avoid the tail-call, and only apply policy with the
to-container program. But the tail-call also contains a kube-proxy
workaround (potential service replies need to pass through kube-proxy for
RevDNAT, so the tail-call punts them to the stack instead of calling
redirect_ep() to forward them straight to the endpoint). So we copied that
workaround into the l3_local_delivery() path.

The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell
if a packet was received from the tunnel. But as l3_local_delivery() is
inlined into bpf_overlay, we can now limit the work-around to
IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg.
plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery().

Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy")
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
gandro pushed a commit that referenced this pull request Sep 25, 2023
…erlay

[ upstream commit 334f7f0 ]

#22333 fixed a bug for configs with
tunnel-routing and per-EP routes. Here ingress policy was applied twice:
first via tail-call, and then a second time by the to-container program as
the packet traverses the veth pair.

The fix was to avoid the tail-call, and only apply policy with the
to-container program. But the tail-call also contains a kube-proxy
workaround (potential service replies need to pass through kube-proxy for
RevDNAT, so the tail-call punts them to the stack instead of calling
redirect_ep() to forward them straight to the endpoint). So we copied that
workaround into the l3_local_delivery() path.

The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell
if a packet was received from the tunnel. But as l3_local_delivery() is
inlined into bpf_overlay, we can now limit the work-around to
IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg.
plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery().

Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy")
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
jibi pushed a commit that referenced this pull request Jan 10, 2024
…erlay

[ upstream commit 334f7f0 ]

#22333 fixed a bug for configs with
tunnel-routing and per-EP routes. Here ingress policy was applied twice:
first via tail-call, and then a second time by the to-container program as
the packet traverses the veth pair.

The fix was to avoid the tail-call, and only apply policy with the
to-container program. But the tail-call also contains a kube-proxy
workaround (potential service replies need to pass through kube-proxy for
RevDNAT, so the tail-call punts them to the stack instead of calling
redirect_ep() to forward them straight to the endpoint). So we copied that
workaround into the l3_local_delivery() path.

The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell
if a packet was received from the tunnel. But as l3_local_delivery() is
inlined into bpf_overlay, we can now limit the work-around to
IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg.
plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery().

Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy")
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
dylandreimerink pushed a commit that referenced this pull request Jan 15, 2024
…erlay

[ upstream commit 334f7f0 ]

#22333 fixed a bug for configs with
tunnel-routing and per-EP routes. Here ingress policy was applied twice:
first via tail-call, and then a second time by the to-container program as
the packet traverses the veth pair.

The fix was to avoid the tail-call, and only apply policy with the
to-container program. But the tail-call also contains a kube-proxy
workaround (potential service replies need to pass through kube-proxy for
RevDNAT, so the tail-call punts them to the stack instead of calling
redirect_ep() to forward them straight to the endpoint). So we copied that
workaround into the l3_local_delivery() path.

The tail-call is compiled as part of bpf_lxc, and thus couldn't easily tell
if a packet was received from the tunnel. But as l3_local_delivery() is
inlined into bpf_overlay, we can now limit the work-around to
IS_BPF_OVERLAY. This ensures that the workaround is not applied to eg.
plain pod-to-pod traffic, where bpf_lxc also calls l3_local_delivery().

Fixes: 3d2ceaf ("bpf: Preserve overlay->lxc path with kube-proxy")
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-blocker/1.13 This issue will prevent the release of the next version of Cilium. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.13.0-rc4
Backport done to v1.13
Development

Successfully merging this pull request may close these issues.

Packets to pods processed twice with per-endpoint routes + VXLAN
7 participants