New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpf: correctly encapsulate pod to node traffic with kube-proxy+hostfw #30818
Conversation
/test |
254d6d3
to
181e215
Compare
Force pushed to clarify the commit message around the possibility of traffic with destination identity matching a remote node needing to go through the tunnel if the destination is a CiliumInternalIP. Marking ready for review again. |
/test |
181e215
to
6a24a39
Compare
When the host firewall is enabled in tunneling mode, pod to node traffic needs to be forwarded through the tunnel in order to preserve the security identity (as otherwise the source IP address would be SNATted), which is required to enforce ingress host policies. One tricky case is represented by node (or hostns pod) to pod traffic via services with local ExternalTrafficPolicy, when KPR is disabled. Indeed, in this case, the SYN packet is routed natively (as both the source and the destination are node IPs) to the destination node, and then DNATted to one of the backend IPs, without being SNATted at the same time. Yet, the SYN+ACK packet would then be incorrectly redirected through the tunnel (as the destination is a node IP, associated with a tunnel endpoint in the ipcache), hence breaking the connection, while it should be passed to the stack to be rev DNATted and then forwarded accordingly. In detail, reporting the description from c8052a1, the broken packet path is node1 --VIP--> pod@node2 (VIP is node2IP): - SYN leaves node1 via native device with node1IP -> VIP - SYN is DNATed on node2 to node1IP -> podIP - SYN is delivered to lxc device with node1IP -> podIP - SYN+ACK is sent from lxc device with podIP -> node1IP - SYN+ACK is redirected in BPF directly to cilium_vxlan - SYN+ACK arrives on node1 via tunnel with podIP -> node1IP - RST is sent because podIP doesn't match VIP c8052a1 attempted to fix this issue for the kube-proxy+hostfw (and IPSec) scenarios by always passing the packets to the stack, so that it doesn't bypass conntrack. The IPSec specific workaround got then removed in 0a8f2c4, as that path asymmetry is no longer present. However, always passing packets to the stack breaks the host firewall policy enforcement for pod to node traffic, as at that point there's no route which redirects these packets back to the tunnel to preserve the security identity, and they get simply masqueraded and routed natively. To prevent this issue, let's pass packets to the stack only if they are a reply with destination identity matching a remote node, as in that case they may need to be rev DNATted. There are two possibilities at that point: (a) the destination is a CiliumInternalIP address, and the reply needs to go through the tunnel -- node routes ensure that the packet is first forwarded to cilium_host, before being redirected through the tunnel; (b) the destination is one of the other node addresses, and the reply needs to be forwarded natively according to the local routing table (as node to pod/node traffic never goes through the tunnel unless the source is a CiliumInternalIP address). Overall, this change addresses the externalTrafficPolicy=local service case, while still preserving encapsulation in all other cases. As a side effect, it also improves the performance in the kube-proxy + hostfw case, as pod to pod traffic gets now also redirected immediately through the tunnel, instead of being sent via the stack. Fixes: c8052a1 ("bpf: Do not bypass conntrack if running kube-proxy+hostfw or IPSec") Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
6a24a39
to
28c4a49
Compare
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this! It's been known broken since forever 😞
EDIT: I meant to comment instead of approving 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Is there an end-to-end test coming to cover this?
Yep, I've opened cilium/cilium-cli#2306 to introduce basic host firewall connectivity tests as part of the Cilium CLI, and once merged I'll update the conformance-e2e workflow to enable host firewall also for the KPR=disabled case (either as part of this PR or a separate one, as you prefer). Do you have opinions on backporting this change to stable branches? I've currently marked it for v1.15, but I wonder whether it could make sense to backport it to older ones as well. |
I wouldn't backport further back for now, unless users ask for it. We can always backport later if needed and, in the meantime, we gain some confidence that everything works as expected. |
I've opened #30914 to switch two existing Conformance E2E configurations (with KPR disabled) to also enable host firewall. I couldn't add a commit to this PR as the branch is from a fork. |
CI is green, reviews are in, marking ready to merge. |
When the host firewall is enabled in tunneling mode, pod to node traffic needs to be forwarded through the tunnel in order to preserve the security identity (as otherwise the source IP address would be SNATted), which is required to enforce ingress host policies.
One tricky case is represented by node (or hostns pod) to pod traffic via services with local ExternalTrafficPolicy, when KPR is disabled. Indeed, in this case, the SYN packet is routed natively (as both the source and the destination are node IPs) to the destination node, and then DNATted to one of the backend IPs, without being SNATted at the same time. Yet, the SYN+ACK packet would then be incorrectly redirected through the tunnel (as the destination is a node IP, associated with a tunnel endpoint in the ipcache), hence breaking the connection, while it should be passed to the stack to be rev DNATted and then forwarded accordingly.
In detail, reporting the description from c8052a1, the broken packet path is node1 --VIP--> pod@node2 (VIP is node2IP):
c8052a1 attempted to fix this issue for the kube-proxy+hostfw (and IPSec) scenarios by always passing the packets to the stack, so that it doesn't bypass conntrack. The IPSec specific workaround got then removed in 0a8f2c4, as that path asymmetry is no longer present. However, always passing packets to the stack breaks the host firewall policy enforcement for pod to node traffic, as at that point there's no route which redirects these packets back to the tunnel to preserve the security identity, and they get simply masqueraded and routed natively.
To prevent this issue, let's pass packets to the stack only if they are a reply with destination identity matching a remote node, as in that case they may need to be rev DNATted. There are two possibilities at that point: (a) the destination is a CiliumInternalIP address, and the reply needs to go through the tunnel -- node routes ensure that the packet is first forwarded to cilium_host, before being redirected through the tunnel; (b) the destination is one of the other node
addresses, and the reply needs to be forwarded natively according to the local routing table (as node to pod/node traffic never goes through the tunnel unless the source is a CiliumInternalIP address).
Overall, this change addresses the externalTrafficPolicy=local service case, while still preserving encapsulation in all other cases. As a side effect, it also improves the performance in the kube-proxy + hostfw case, as pod to pod traffic gets now also redirected immediately through the tunnel, instead of being sent via the stack.
Fixes: c8052a1 ("bpf: Do not bypass conntrack if running kube-proxy+hostfw or IPSec")