Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
bpf: Remove link scope of cilium_host's IPv4 address
Kube-proxy always masquerades DNATed packets going to NodePort services. This is to ensure that reply packets always flow through the intermediate, DNATing node. Consider the following path: pod@node1 -> nodeport@node2 -> backend@node3 A packet is sent from pod@node1 to a NodePort service with node2's IP address. Node2 DNATs the packet and forwards it to the backend on node3. If node2 doesn't also masquerade the packet, the reply packet will be sent directly to node1, bypassing the reverse DNAT. In tunneling mode however, kube-proxy appears unable to pick the correct source IP for masquerading. Consider the following packet flow (under VXLAN + endpoint routes + IPsec [1]): <- endpoint 656 flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN -> stack flow 0x5c7eb4 , identity 20590->host state new ifindex 0 orig-ip 0.0.0.0: 10.0.1.172:57110 -> 192.168.56.12:30656 tcp SYN <- host flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex lxc7e0fe2229abe orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN -> stack flow 0x5c7eb4 , identity 20590->unknown state unknown ifindex cilium_host orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN <- stack encrypted flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_net orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN -> overlay encrypted flow 0x5c7eb4 , identity 20590->unknown state new ifindex cilium_vxlan orig-ip 0.0.0.0: 10.0.2.15:45035 -> 10.0.0.165:8080 tcp SYN Client pod 10.0.1.172 sends a packet to NodePort 30656 on node 2. That packet is masqueraded to 10.0.2.15 (line 3), the IP on the default interface. This choice is incorrect as the packet will then go through the tunnel and not the underlay. The reply will therefore not be sent through the tunnel and may even fail if 10.0.2.15 isn't routable from node 2 (as is the case in our testing setup). Instead, kube-proxy should pick the IP address of cilium_host, which belongs to the node's pod CIDR, thus ensuring the reply will be routed through the tunnel. Why isn't it? Checking the kernel's source code [2], we can see that the scope of IP addresses on the interfaces is taken into account in addition to the destination IP (and other packet information in case of source routing, etc.). Specifically, in the case of netfilter's masquerading, inet_select_addr is called with a scope of RT_SCOPE_UNIVERSE (0). Therefore, only IP addresses with a scope equal to RT_SCOPE_UNIVERSE will be picked. This commit thus removes the link scope on the IPv4 address of cilium_host, such that the address now has a RT_SCOPE_UNIVERSE scope (default). This will be tested in the Cilium Datapath workflow via a subsequent pull request, but we need to fix one other bug before we can do that. 1 - IPsec doesn't matter to the bug here. Endpoint routes however does. If endpoint routes is enabled, Cilium adds a masquerading rule in front of kube-proxy's to always masquerade DNATed pod traffic to cilium_host IP address. See [3] for details. 2 - https://github.com/torvalds/linux/blob/v5.19/net/ipv4/devinet.c#L1324 3 - https://github.com/cilium/cilium/blob/v1.13.0-rc4/pkg/datapath/iptables/iptables.go#L1216-L1242 Co-authored-by: Liu Xu <liuxu623@gmail.com> Signed-off-by: Paul Chaignon <paul@cilium.io>
- Loading branch information