Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpf: correctly encapsulate pod to node traffic with kube-proxy+hostfw #30818

Merged
merged 1 commit into from Feb 26, 2024

Conversation

giorio94
Copy link
Member

@giorio94 giorio94 commented Feb 16, 2024

When the host firewall is enabled in tunneling mode, pod to node traffic needs to be forwarded through the tunnel in order to preserve the security identity (as otherwise the source IP address would be SNATted), which is required to enforce ingress host policies.

One tricky case is represented by node (or hostns pod) to pod traffic via services with local ExternalTrafficPolicy, when KPR is disabled. Indeed, in this case, the SYN packet is routed natively (as both the source and the destination are node IPs) to the destination node, and then DNATted to one of the backend IPs, without being SNATted at the same time. Yet, the SYN+ACK packet would then be incorrectly redirected through the tunnel (as the destination is a node IP, associated with a tunnel endpoint in the ipcache), hence breaking the connection, while it should be passed to the stack to be rev DNATted and then forwarded accordingly.

In detail, reporting the description from c8052a1, the broken packet path is node1 --VIP--> pod@node2 (VIP is node2IP):

  • SYN leaves node1 via native device with node1IP -> VIP
  • SYN is DNATed on node2 to node1IP -> podIP
  • SYN is delivered to lxc device with node1IP -> podIP
  • SYN+ACK is sent from lxc device with podIP -> node1IP
  • SYN+ACK is redirected in BPF directly to cilium_vxlan
  • SYN+ACK arrives on node1 via tunnel with podIP -> node1IP
  • RST is sent because podIP doesn't match VIP

c8052a1 attempted to fix this issue for the kube-proxy+hostfw (and IPSec) scenarios by always passing the packets to the stack, so that it doesn't bypass conntrack. The IPSec specific workaround got then removed in 0a8f2c4, as that path asymmetry is no longer present. However, always passing packets to the stack breaks the host firewall policy enforcement for pod to node traffic, as at that point there's no route which redirects these packets back to the tunnel to preserve the security identity, and they get simply masqueraded and routed natively.

To prevent this issue, let's pass packets to the stack only if they are a reply with destination identity matching a remote node, as in that case they may need to be rev DNATted. There are two possibilities at that point: (a) the destination is a CiliumInternalIP address, and the reply needs to go through the tunnel -- node routes ensure that the packet is first forwarded to cilium_host, before being redirected through the tunnel; (b) the destination is one of the other node
addresses, and the reply needs to be forwarded natively according to the local routing table (as node to pod/node traffic never goes through the tunnel unless the source is a CiliumInternalIP address).

Overall, this change addresses the externalTrafficPolicy=local service case, while still preserving encapsulation in all other cases. As a side effect, it also improves the performance in the kube-proxy + hostfw case, as pod to pod traffic gets now also redirected immediately through the tunnel, instead of being sent via the stack.

Fixes: c8052a1 ("bpf: Do not bypass conntrack if running kube-proxy+hostfw or IPSec")

Fix host firewall policy enforcement for pod to node traffic when tunneling is enabled and KPR is disabled

@giorio94 giorio94 added kind/bug This is a bug in the Cilium logic. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Feb 16, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from main in 1.15.2 Feb 16, 2024
@giorio94
Copy link
Member Author

/test

@giorio94 giorio94 marked this pull request as ready for review February 16, 2024 18:14
@giorio94 giorio94 requested a review from a team as a code owner February 16, 2024 18:14
@giorio94 giorio94 marked this pull request as draft February 19, 2024 15:18
@giorio94
Copy link
Member Author

Force pushed to clarify the commit message around the possibility of traffic with destination identity matching a remote node needing to go through the tunnel if the destination is a CiliumInternalIP. Marking ready for review again.

@giorio94 giorio94 marked this pull request as ready for review February 19, 2024 15:29
@giorio94
Copy link
Member Author

/test

@julianwiedmann julianwiedmann added area/kube-proxy Issues related to kube-proxy (not the kube-proxy-free mode). area/host-firewall Impacts the host firewall or the host endpoint. labels Feb 19, 2024
bpf/bpf_lxc.c Outdated Show resolved Hide resolved
When the host firewall is enabled in tunneling mode, pod to node traffic
needs to be forwarded through the tunnel in order to preserve the security
identity (as otherwise the source IP address would be SNATted), which is
required to enforce ingress host policies.

One tricky case is represented by node (or hostns pod) to pod traffic via
services with local ExternalTrafficPolicy, when KPR is disabled. Indeed,
in this case, the SYN packet is routed natively (as both the source and
the destination are node IPs) to the destination node, and then DNATted
to one of the backend IPs, without being SNATted at the same time. Yet,
the SYN+ACK packet would then be incorrectly redirected through the tunnel
(as the destination is a node IP, associated with a tunnel endpoint in
the ipcache), hence breaking the connection, while it should be passed to
the stack to be rev DNATted and then forwarded accordingly.

In detail, reporting the description from c8052a1, the broken packet
path is node1 --VIP--> pod@node2 (VIP is node2IP):

- SYN leaves node1 via native device with  node1IP -> VIP
- SYN is DNATed on node2 to                node1IP -> podIP
- SYN is delivered to lxc device with      node1IP -> podIP
- SYN+ACK is sent from lxc device with     podIP   -> node1IP
- SYN+ACK is redirected in BPF directly to cilium_vxlan
- SYN+ACK arrives on node1 via tunnel with podIP   -> node1IP
- RST is sent because podIP doesn't match VIP

c8052a1 attempted to fix this issue for the kube-proxy+hostfw (and
IPSec) scenarios by always passing the packets to the stack, so that it
doesn't bypass conntrack. The IPSec specific workaround got then removed
in 0a8f2c4, as that path asymmetry is no longer present. However,
always passing packets to the stack breaks the host firewall policy
enforcement for pod to node traffic, as at that point there's no
route which redirects these packets back to the tunnel to preserve the
security identity, and they get simply masqueraded and routed natively.

To prevent this issue, let's pass packets to the stack only if they
are a reply with destination identity matching a remote node, as in
that case they may need to be rev DNATted. There are two possibilities
at that point: (a) the destination is a CiliumInternalIP address, and
the reply needs to go through the tunnel -- node routes ensure that
the packet is first forwarded to cilium_host, before being redirected
through the tunnel; (b) the destination is one of the other node
addresses, and the reply needs to be forwarded natively according
to the local routing table (as node to pod/node traffic never goes
through the tunnel unless the source is a CiliumInternalIP address).

Overall, this change addresses the externalTrafficPolicy=local service
case, while still preserving encapsulation in all other cases. As a
side effect, it also improves the performance in the kube-proxy + hostfw
case, as pod to pod traffic gets now also redirected immediately through
the tunnel, instead of being sent via the stack.

Fixes: c8052a1 ("bpf: Do not bypass conntrack if running kube-proxy+hostfw or IPSec")
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
@giorio94
Copy link
Member Author

/test

Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this! It's been known broken since forever 😞

EDIT: I meant to comment instead of approving 🤷

bpf/bpf_lxc.c Show resolved Hide resolved
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Is there an end-to-end test coming to cover this?

@giorio94
Copy link
Member Author

Is there an end-to-end test coming to cover this?

Yep, I've opened cilium/cilium-cli#2306 to introduce basic host firewall connectivity tests as part of the Cilium CLI, and once merged I'll update the conformance-e2e workflow to enable host firewall also for the KPR=disabled case (either as part of this PR or a separate one, as you prefer).

Do you have opinions on backporting this change to stable branches? I've currently marked it for v1.15, but I wonder whether it could make sense to backport it to older ones as well.

@pchaigno
Copy link
Member

Do you have opinions on backporting this change to stable branches? I've currently marked it for v1.15, but I wonder whether it could make sense to backport it to older ones as well.

I wouldn't backport further back for now, unless users ask for it. We can always backport later if needed and, in the meantime, we gain some confidence that everything works as expected.

@giorio94
Copy link
Member Author

I've opened #30914 to switch two existing Conformance E2E configurations (with KPR disabled) to also enable host firewall. I couldn't add a commit to this PR as the branch is from a fork.

@giorio94
Copy link
Member Author

CI is green, reviews are in, marking ready to merge.

@giorio94 giorio94 added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 23, 2024
@aanm aanm added this pull request to the merge queue Feb 26, 2024
Merged via the queue into cilium:main with commit 8fcfad9 Feb 26, 2024
62 checks passed
@YutaroHayakawa YutaroHayakawa mentioned this pull request Feb 27, 2024
9 tasks
@YutaroHayakawa YutaroHayakawa added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Feb 27, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.15 in 1.15.2 Feb 27, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 27, 2024
@github-actions github-actions bot added backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Feb 28, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot removed this from Backport pending to v1.15 in 1.15.2 Feb 28, 2024
@giorio94 giorio94 mentioned this pull request Mar 12, 2024
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/host-firewall Impacts the host firewall or the host endpoint. area/kube-proxy Issues related to kube-proxy (not the kube-proxy-free mode). backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants