Skip to content

Commit

Permalink
bpf: Remove link scope of cilium_host's IPv4 address
Browse files Browse the repository at this point in the history
Kube-proxy always masquerades DNATed packets going to NodePort services.
This is to ensure that reply packets always flow through the
intermediate, DNATing node. Consider the following path:

    pod@node1 -> nodeport@node2 -> backend@node3

A packet is sent from pod@node1 to a NodePort service with node2's IP
address. Node2 DNATs the packet and forwards it to the backend on node3.
If node2 doesn't also masquerade the packet, the reply packet will be
sent directly to node1, bypassing the reverse DNAT.

In tunneling mode however, kube-proxy appears unable to pick the correct
source IP for masquerading. Consider the following packet flow (under
VXLAN + endpoint routes + IPsec [1]):

    <- endpoint 656 flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN
    -> stack flow 0x5c7eb4 , identity 20590->host state new ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN
    <- host flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex lxc7e0fe2229abe orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
    -> stack flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex cilium_host orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
    <- stack encrypted  flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_net orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN
    -> overlay encrypted  flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN

Client pod 10.0.1.172 sends a packet to NodePort 30656 on node 2. That
packet is masqueraded to 10.0.2.15 (line 3), the IP on the default
interface. This choice is incorrect as the packet will then go through
the tunnel and not the underlay. The reply will therefore not be sent
through the tunnel and may even fail if 10.0.2.15 isn't routable from
node 2 (as is the case in our testing setup).

Instead, kube-proxy should pick the IP address of cilium_host, which
belongs to the node's pod CIDR, thus ensuring the reply will be routed
through the tunnel. Why isn't it?

Checking the kernel's source code [2], we can see that the scope of IP
addresses on the interfaces is taken into account in addition to the
destination IP (and other packet information in case of source routing,
etc.). Specifically, in the case of netfilter's masquerading,
inet_select_addr is called with a scope of RT_SCOPE_UNIVERSE (0).
Therefore, only IP addresses with a scope equal to RT_SCOPE_UNIVERSE
will be picked.

This commit thus removes the link scope on the IPv4 address of
cilium_host, such that the address now has a RT_SCOPE_UNIVERSE scope
(default).

This will be tested in the Cilium Datapath workflow via a subsequent
pull request, but we need to fix one other bug before we can do that.

1 - IPsec doesn't matter to the bug here. Endpoint routes however does.
    If endpoint routes is enabled, Cilium adds a masquerading rule in
    front of kube-proxy's to always masquerade DNATed pod traffic to
    cilium_host IP address. See [3] for details.
2 - https://github.com/torvalds/linux/blob/v5.19/net/ipv4/devinet.c#L1324
3 - https://github.com/cilium/cilium/blob/v1.13.0-rc4/pkg/datapath/iptables/iptables.go#L1216-L1242
Co-authored-by: Liu Xu <liuxu623@gmail.com>
Signed-off-by: Paul Chaignon <paul@cilium.io>
  • Loading branch information
pchaigno authored and ldelossa committed Jan 25, 2023
1 parent 4944457 commit 92a3e31
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion bpf/init.sh
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,7 @@ esac
[ -n "$(ip -6 addr show to $IP6_HOST dev $HOST_DEV1)" ] || ip -6 addr add $IP6_HOST dev $HOST_DEV1
fi
if [ "$IP4_HOST" != "<nil>" ]; then
[ -n "$(ip -4 addr show to $IP4_HOST dev $HOST_DEV1)" ] || ip -4 addr add $IP4_HOST dev $HOST_DEV1 scope link
[ -n "$(ip -4 addr show to $IP4_HOST dev $HOST_DEV1)" ] || ip -4 addr add $IP4_HOST dev $HOST_DEV1
fi

if [ "$PROXY_RULE" = "true" ]; then
Expand Down

0 comments on commit 92a3e31

Please sign in to comment.