Skip to content

Commit

Permalink
bpf: correctly encapsulate pod to node traffic with kube-proxy+hostfw
Browse files Browse the repository at this point in the history
When the host firewall is enabled in tunneling mode, pod to node traffic
needs to be forwarded through the tunnel in order to preserve the security
identity (as otherwise the source IP address would be SNATted), which is
required to enforce ingress host policies.

One tricky case is represented by node (or hostns pod) to pod traffic via
services with local ExternalTrafficPolicy, when KPR is disabled. Indeed,
in this case, the SYN packet is routed natively (as both the source and
the destination are node IPs) to the destination node, and then DNATted
to one of the backend IPs, without being SNATted at the same time. Yet,
the SYN+ACK packet would then be incorrectly redirected through the tunnel
(as the destination is a node IP, associated with a tunnel endpoint in
the ipcache), hence breaking the connection, while it should be passed to
the stack to be rev DNATted and then forwarded accordingly.

In detail, reporting the description from c8052a1, the broken packet
path is node1 --VIP--> pod@node2 (VIP is node2IP):

- SYN leaves node1 via native device with  node1IP -> VIP
- SYN is DNATed on node2 to                node1IP -> podIP
- SYN is delivered to lxc device with      node1IP -> podIP
- SYN+ACK is sent from lxc device with     podIP   -> node1IP
- SYN+ACK is redirected in BPF directly to cilium_vxlan
- SYN+ACK arrives on node1 via tunnel with podIP   -> node1IP
- RST is sent because podIP doesn't match VIP

c8052a1 attempted to fix this issue for the kube-proxy+hostfw (and
IPSec) scenarios by always passing the packets to the stack, so that it
doesn't bypass conntrack. The IPSec specific workaround got then removed
in 0a8f2c4, as that path asymmetry is no longer present. However,
always passing packets to the stack breaks the host firewall policy
enforcement for pod to node traffic, as at that point there's no
route which redirects these packets back to the tunnel to preserve the
security identity, and they get simply masqueraded and routed natively.

To prevent this issue, let's pass packets to the stack only if they
are a reply with destination identity matching a remote node, as in
that case they may need to be rev DNATted. There are two possibilities
at that point: (a) the destination is a CiliumInternalIP address, and
the reply needs to go through the tunnel -- node routes ensure that
the packet is first forwarded to cilium_host, before being redirected
through the tunnel; (b) the destination is one of the other node
addresses, and the reply needs to be forwarded natively according
to the local routing table (as node to pod/node traffic never goes
through the tunnel unless the source is a CiliumInternalIP address).

Overall, this change addresses the externalTrafficPolicy=local service
case, while still preserving encapsulation in all other cases. As a
side effect, it also improves the performance in the kube-proxy + hostfw
case, as pod to pod traffic gets now also redirected immediately through
the tunnel, instead of being sent via the stack.

Fixes: c8052a1 ("bpf: Do not bypass conntrack if running kube-proxy+hostfw or IPSec")
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
  • Loading branch information
giorio94 committed Feb 19, 2024
1 parent d1834ba commit 6a24a39
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 18 deletions.
22 changes: 22 additions & 0 deletions bpf/bpf_lxc.c
Original file line number Diff line number Diff line change
Expand Up @@ -650,6 +650,13 @@ static __always_inline int handle_ipv6_from_lxc(struct __ctx_buff *ctx, __u32 *d
key.ip6.p4 = 0;
key.family = ENDPOINT_KEY_IPV6;

#if !defined(ENABLE_NODEPORT) && defined(ENABLE_HOST_FIREWALL)
/* See comment in handle_ipv4_from_lxc(). */
if ((ct_status == CT_REPLY || ct_status == CT_RELATED) &&
identity_is_remote_node(*dst_sec_identity))
goto encrypt_to_stack;
#endif /* !ENABLE_NODEPORT && ENABLE_HOST_FIREWALL */

/* Three cases exist here either (a) the encap and redirect could
* not find the tunnel so fallthrough to nat46 and stack, (b)
* the packet needs IPSec encap so push ctx to stack for encap, or
Expand Down Expand Up @@ -1192,6 +1199,21 @@ static __always_inline int handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *d
key.family = ENDPOINT_KEY_IPV4;
key.cluster_id = (__u8)cluster_id;

#if !defined(ENABLE_NODEPORT) && defined(ENABLE_HOST_FIREWALL)
/*
* For the host firewall, traffic from a pod to a remote node is sent
* through the tunnel. In the case of node to remote pod traffic via
* externalTrafficPolicy=Local services, packets may be DNATed when
* they enter the remote node (without being SNATed at the same time).
* If kube-proxy is used, the response needs to go through the stack
* to apply the correct reverse DNAT, and then be routed accordingly.
* See #14674 for details.
*/
if ((ct_status == CT_REPLY || ct_status == CT_RELATED) &&
identity_is_remote_node(*dst_sec_identity))
goto encrypt_to_stack;
#endif /* !ENABLE_NODEPORT && ENABLE_HOST_FIREWALL */

#ifdef ENABLE_CLUSTER_AWARE_ADDRESSING
/*
* The destination is remote node, but the connection is originated from tunnel.
Expand Down
18 changes: 0 additions & 18 deletions bpf/lib/encap.h
Original file line number Diff line number Diff line change
Expand Up @@ -100,26 +100,8 @@ __encap_and_redirect_lxc(struct __ctx_buff *ctx, __be32 tunnel_endpoint,
seclabel, false);
#endif

#if !defined(ENABLE_NODEPORT) && defined(ENABLE_HOST_FIREWALL)
/* For the host firewall, traffic from a pod to a remote node is sent
* through the tunnel. In the case of node --> VIP@remote pod, packets may
* be DNATed when they enter the remote node. If kube-proxy is used, the
* response needs to go through the stack on the way to the tunnel, to
* apply the correct reverse DNAT.
* See #14674 for details.
*/
ret = __encap_with_nodeid(ctx, 0, 0, tunnel_endpoint, seclabel, dstid,
NOT_VTEP_DST, trace->reason, trace->monitor,
&ifindex);
if (ret != CTX_ACT_REDIRECT)
return ret;

/* tell caller that this packet needs to go through the stack: */
return CTX_ACT_OK;
#else
return encap_and_redirect_with_nodeid(ctx, tunnel_endpoint, 0, seclabel,
dstid, trace);
#endif /* !ENABLE_NODEPORT && ENABLE_HOST_FIREWALL */
}

#if defined(TUNNEL_MODE) || defined(ENABLE_HIGH_SCALE_IPCACHE)
Expand Down

0 comments on commit 6a24a39

Please sign in to comment.