bpf: correctly encapsulate pod to node traffic with kube-proxy+hostfw #30818

giorio94 · 2024-02-16T17:31:19Z

When the host firewall is enabled in tunneling mode, pod to node traffic needs to be forwarded through the tunnel in order to preserve the security identity (as otherwise the source IP address would be SNATted), which is required to enforce ingress host policies.

One tricky case is represented by node (or hostns pod) to pod traffic via services with local ExternalTrafficPolicy, when KPR is disabled. Indeed, in this case, the SYN packet is routed natively (as both the source and the destination are node IPs) to the destination node, and then DNATted to one of the backend IPs, without being SNATted at the same time. Yet, the SYN+ACK packet would then be incorrectly redirected through the tunnel (as the destination is a node IP, associated with a tunnel endpoint in the ipcache), hence breaking the connection, while it should be passed to the stack to be rev DNATted and then forwarded accordingly.

In detail, reporting the description from c8052a1, the broken packet path is node1 --VIP--> pod@node2 (VIP is node2IP):

SYN leaves node1 via native device with node1IP -> VIP
SYN is DNATed on node2 to node1IP -> podIP
SYN is delivered to lxc device with node1IP -> podIP
SYN+ACK is sent from lxc device with podIP -> node1IP
SYN+ACK is redirected in BPF directly to cilium_vxlan
SYN+ACK arrives on node1 via tunnel with podIP -> node1IP
RST is sent because podIP doesn't match VIP

c8052a1 attempted to fix this issue for the kube-proxy+hostfw (and IPSec) scenarios by always passing the packets to the stack, so that it doesn't bypass conntrack. The IPSec specific workaround got then removed in 0a8f2c4, as that path asymmetry is no longer present. However, always passing packets to the stack breaks the host firewall policy enforcement for pod to node traffic, as at that point there's no route which redirects these packets back to the tunnel to preserve the security identity, and they get simply masqueraded and routed natively.

To prevent this issue, let's pass packets to the stack only if they are a reply with destination identity matching a remote node, as in that case they may need to be rev DNATted. There are two possibilities at that point: (a) the destination is a CiliumInternalIP address, and the reply needs to go through the tunnel -- node routes ensure that the packet is first forwarded to cilium_host, before being redirected through the tunnel; (b) the destination is one of the other node
addresses, and the reply needs to be forwarded natively according to the local routing table (as node to pod/node traffic never goes through the tunnel unless the source is a CiliumInternalIP address).

Overall, this change addresses the externalTrafficPolicy=local service case, while still preserving encapsulation in all other cases. As a side effect, it also improves the performance in the kube-proxy + hostfw case, as pod to pod traffic gets now also redirected immediately through the tunnel, instead of being sent via the stack.

Fixes: c8052a1 ("bpf: Do not bypass conntrack if running kube-proxy+hostfw or IPSec")

Fix host firewall policy enforcement for pod to node traffic when tunneling is enabled and KPR is disabled

giorio94 · 2024-02-16T17:31:28Z

/test

giorio94 · 2024-02-19T15:29:01Z

Force pushed to clarify the commit message around the possibility of traffic with destination identity matching a remote node needing to go through the tunnel if the destination is a CiliumInternalIP. Marking ready for review again.

giorio94 · 2024-02-19T15:29:27Z

/test

bpf/bpf_lxc.c

When the host firewall is enabled in tunneling mode, pod to node traffic needs to be forwarded through the tunnel in order to preserve the security identity (as otherwise the source IP address would be SNATted), which is required to enforce ingress host policies. One tricky case is represented by node (or hostns pod) to pod traffic via services with local ExternalTrafficPolicy, when KPR is disabled. Indeed, in this case, the SYN packet is routed natively (as both the source and the destination are node IPs) to the destination node, and then DNATted to one of the backend IPs, without being SNATted at the same time. Yet, the SYN+ACK packet would then be incorrectly redirected through the tunnel (as the destination is a node IP, associated with a tunnel endpoint in the ipcache), hence breaking the connection, while it should be passed to the stack to be rev DNATted and then forwarded accordingly. In detail, reporting the description from c8052a1, the broken packet path is node1 --VIP--> pod@node2 (VIP is node2IP): - SYN leaves node1 via native device with node1IP -> VIP - SYN is DNATed on node2 to node1IP -> podIP - SYN is delivered to lxc device with node1IP -> podIP - SYN+ACK is sent from lxc device with podIP -> node1IP - SYN+ACK is redirected in BPF directly to cilium_vxlan - SYN+ACK arrives on node1 via tunnel with podIP -> node1IP - RST is sent because podIP doesn't match VIP c8052a1 attempted to fix this issue for the kube-proxy+hostfw (and IPSec) scenarios by always passing the packets to the stack, so that it doesn't bypass conntrack. The IPSec specific workaround got then removed in 0a8f2c4, as that path asymmetry is no longer present. However, always passing packets to the stack breaks the host firewall policy enforcement for pod to node traffic, as at that point there's no route which redirects these packets back to the tunnel to preserve the security identity, and they get simply masqueraded and routed natively. To prevent this issue, let's pass packets to the stack only if they are a reply with destination identity matching a remote node, as in that case they may need to be rev DNATted. There are two possibilities at that point: (a) the destination is a CiliumInternalIP address, and the reply needs to go through the tunnel -- node routes ensure that the packet is first forwarded to cilium_host, before being redirected through the tunnel; (b) the destination is one of the other node addresses, and the reply needs to be forwarded natively according to the local routing table (as node to pod/node traffic never goes through the tunnel unless the source is a CiliumInternalIP address). Overall, this change addresses the externalTrafficPolicy=local service case, while still preserving encapsulation in all other cases. As a side effect, it also improves the performance in the kube-proxy + hostfw case, as pod to pod traffic gets now also redirected immediately through the tunnel, instead of being sent via the stack. Fixes: c8052a1 ("bpf: Do not bypass conntrack if running kube-proxy+hostfw or IPSec") Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>

giorio94 · 2024-02-19T17:04:12Z

/test

pchaigno

Thanks for tackling this! It's been known broken since forever 😞

EDIT: I meant to comment instead of approving 🤷

bpf/bpf_lxc.c

pchaigno

LGTM!

Is there an end-to-end test coming to cover this?

giorio94 · 2024-02-21T10:51:06Z

Is there an end-to-end test coming to cover this?

Yep, I've opened cilium/cilium-cli#2306 to introduce basic host firewall connectivity tests as part of the Cilium CLI, and once merged I'll update the conformance-e2e workflow to enable host firewall also for the KPR=disabled case (either as part of this PR or a separate one, as you prefer).

Do you have opinions on backporting this change to stable branches? I've currently marked it for v1.15, but I wonder whether it could make sense to backport it to older ones as well.

pchaigno · 2024-02-21T11:28:27Z

Do you have opinions on backporting this change to stable branches? I've currently marked it for v1.15, but I wonder whether it could make sense to backport it to older ones as well.

I wouldn't backport further back for now, unless users ask for it. We can always backport later if needed and, in the meantime, we gain some confidence that everything works as expected.

giorio94 · 2024-02-23T08:10:10Z

I've opened #30914 to switch two existing Conformance E2E configurations (with KPR disabled) to also enable host firewall. I couldn't add a commit to this PR as the branch is from a fork.

giorio94 · 2024-02-23T14:59:32Z

CI is green, reviews are in, marking ready to merge.

maintainer-s-little-helper bot added this to Needs backport from main in 1.15.2 Feb 16, 2024

giorio94 requested review from brb and pchaigno February 16, 2024 18:14

giorio94 marked this pull request as ready for review February 16, 2024 18:14

giorio94 requested a review from a team as a code owner February 16, 2024 18:14

giorio94 marked this pull request as draft February 19, 2024 15:18

giorio94 force-pushed the mio/host-firewall-tunnel branch from 254d6d3 to 181e215 Compare February 19, 2024 15:25

giorio94 marked this pull request as ready for review February 19, 2024 15:29

julianwiedmann added area/kube-proxy Issues related to kube-proxy (not the kube-proxy-free mode). area/host-firewall Impacts the host firewall or the host endpoint. labels Feb 19, 2024

julianwiedmann reviewed Feb 19, 2024

View reviewed changes

bpf/bpf_lxc.c Outdated Show resolved Hide resolved

giorio94 force-pushed the mio/host-firewall-tunnel branch from 181e215 to 6a24a39 Compare February 19, 2024 15:47

giorio94 force-pushed the mio/host-firewall-tunnel branch from 6a24a39 to 28c4a49 Compare February 19, 2024 16:55

giorio94 mentioned this pull request Feb 19, 2024

connectivity: add forbidden ICMPv6 message as expected drop reason cilium/cilium-cli#2317

Merged

pchaigno approved these changes Feb 20, 2024

View reviewed changes

bpf/bpf_lxc.c Show resolved Hide resolved

giorio94 requested a review from pchaigno February 21, 2024 07:51

pchaigno approved these changes Feb 21, 2024

View reviewed changes

giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 23, 2024

aanm added this pull request to the merge queue Feb 26, 2024

Merged via the queue into cilium:main with commit 8fcfad9 Feb 26, 2024
62 checks passed

YutaroHayakawa mentioned this pull request Feb 27, 2024

v1.15 Backports 2024-02-27 #30997

Merged

9 tasks

YutaroHayakawa added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Feb 27, 2024

maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.15 in 1.15.2 Feb 27, 2024

maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 27, 2024

github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Feb 28, 2024

maintainer-s-little-helper bot removed this from Backport pending to v1.15 in 1.15.2 Feb 28, 2024

giorio94 mentioned this pull request Mar 12, 2024

v1.15 Backports 2024-03-12 #31342

Merged

16 tasks

jrajahalme mentioned this pull request Mar 13, 2024

Prepare for release v1.15.2 #31371

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpf: correctly encapsulate pod to node traffic with kube-proxy+hostfw #30818

bpf: correctly encapsulate pod to node traffic with kube-proxy+hostfw #30818

giorio94 commented Feb 16, 2024 •

edited

giorio94 commented Feb 16, 2024

giorio94 commented Feb 19, 2024

giorio94 commented Feb 19, 2024

giorio94 commented Feb 19, 2024

pchaigno left a comment •

edited

pchaigno left a comment

giorio94 commented Feb 21, 2024

pchaigno commented Feb 21, 2024

giorio94 commented Feb 23, 2024

giorio94 commented Feb 23, 2024

bpf: correctly encapsulate pod to node traffic with kube-proxy+hostfw #30818

bpf: correctly encapsulate pod to node traffic with kube-proxy+hostfw #30818

Conversation

giorio94 commented Feb 16, 2024 • edited

giorio94 commented Feb 16, 2024

giorio94 commented Feb 19, 2024

giorio94 commented Feb 19, 2024

giorio94 commented Feb 19, 2024

pchaigno left a comment • edited

Choose a reason for hiding this comment

pchaigno left a comment

Choose a reason for hiding this comment

giorio94 commented Feb 21, 2024

pchaigno commented Feb 21, 2024

giorio94 commented Feb 23, 2024

giorio94 commented Feb 23, 2024

giorio94 commented Feb 16, 2024 •

edited

pchaigno left a comment •

edited