datapath: Change FIB lookups to enable NodePort multihoming #18585

brb · 2022-01-21T20:48:50Z

Previously, the bpf_fib_lookup() helper was invoked in nodeport.h with
the BPF_FIB_LOOKUP_{OUTPUT,DIRECT} flags set. This commit drops both
flags to make the NodePort BPF multihoming possible.

The former flag instructed the helper to do the FIB lookup from egress
perspective. This required us to set the correct ifindex for the lookup.
In the NodePort BPF code we used DIRECT_ROUTING_DEV_IFINDEX which
corresponds to an iface which is used to communicate between k8s Nodes
(i.e., the iface with k8s Node InternalIP or ExternalIP). However, the
reliance on the DIRECT_ROUTING_DEV_IFINDEX meant that in the case of
more than a single iface used to connect the nodes, only one iface could
be used for the NodePort BPF traffic (forwarded service requests).

To fix this, we can drop the BPF_FIB_LOOKUP_OUTPUT and set the ingress
iface in the FIB lookup parameter. This makes the FIB lookup from the
perspective of the device which received a packet (before forwarding to
egress iface), and it eliminates the need for
DIRECT_ROUTING_DEV_IFINDEX in the NodePort BPF.

The latter flag meant to speed up the FIB lookups. The kernel commit which
introduced it [1] says:

BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
straight to the table associated with the device (expert setting for
those looking to maximize throughput)

This means that we bypass ip rules. However, cilium-agent for some
setups installs IP rules (e.g., EKS, WireGuard, etc). Therefore, to
avoid any potential pitfals we drop the latter flag so that the IP rules
can be respected.

To measure a potential perf regression, I provisioned [2] scripts on
three AWS EC2 c5a.8xlarge machines (10Gbit, Ubuntu 20.04 LTS, 5.11
kernel). The netperf client was sending requests to the LB node which
was forwarding them to the destination node. The only Helm setting for
Cilium was KPR=strict. The results from three runs each setup are the
following:

---------------------+-----------------+-------------------------
                     | v1.11.1         | v1.11.1 with FIB changes
---------------------+-----------------+-------------------------
TCP_RR MEAN_LATENCY  | 358.74 +- 27.08 | 359.56 +- 30.50
with STDDEV_LATENCY  | 360.28 +- 26.01 | 361.76 +- 26.71
                     | 368.18 +- 28.22 | 365.33 +- 35.83
---------------------+-----------------+-------------------------

As you can see, the increase in latency is negligible. Therefore, this
commit drops the latter flag unconditionally, i.e., without checking
whether there are any non-default IP rules or Cilium is running on a
multi-homing setup.

[1]: torvalds/linux@87f5fc7
[2]: https://github.com/brb/cilium-without-netfilter

Fix #18199

brb · 2022-01-21T20:54:40Z

/ci-l4lb

brb · 2022-01-21T21:07:28Z

/test-1.23-net-next

brb · 2022-01-22T06:00:11Z

Oh wow, net-next everything except the flaky L7 test has passed. Running one more time to gain more confidence.

brb · 2022-01-22T06:00:20Z

/test

#18585 Signed-off-by: Martynas Pumputis <m@lambda.lt>

brb · 2022-01-24T19:38:12Z

/test

brb · 2022-01-25T20:38:19Z

/test

Previously, the bpf_fib_lookup() helper was invoked in nodeport.h with the BPF_FIB_LOOKUP_{OUTPUT,DIRECT} flags set. This commit drops both flags to make the NodePort BPF multihoming possible. The former flag instructed the helper to do the FIB lookup from egress perspective. This required us to set the correct ifindex for the lookup. In the NodePort BPF code we used DIRECT_ROUTING_DEV_IFINDEX which corresponds to an iface which is used to communicate between k8s Nodes (i.e., the iface with k8s Node InternalIP or ExternalIP). However, the reliance on the DIRECT_ROUTING_DEV_IFINDEX meant that in the case of more than a single iface used to connect the nodes, only one iface could be used for the NodePort BPF traffic (forwarded service requests). To fix this, we can drop the BPF_FIB_LOOKUP_OUTPUT and set the ingress iface in the FIB lookup parameter. This makes the FIB lookup from the perspective of the device which received a packet (before forwarding to egress iface), and it eliminates the need for DIRECT_ROUTING_DEV_IFINDEX in the NodePort BPF. The latter flag meant to speed up the FIB lookups. The kernel commit which introduced it [1] says: BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes straight to the table associated with the device (expert setting for those looking to maximize throughput) This means that we bypass ip rules. However, cilium-agent for some setups installs IP rules (e.g., EKS, WireGuard, etc). Therefore, to avoid any potential pitfals we drop the latter flag so that the IP rules can be respected. To measure a potential perf regression, I provisioned [2] scripts on three AWS EC2 c5a.8xlarge machines (10Gbit, Ubuntu 20.04 LTS, 5.11 kernel). The netperf client was sending requests to the LB node which was forwarding them to the destination node. The only Helm setting for Cilium was KPR=strict. The results from three runs each setup are the following: ---------------------+-----------------+------------------------- | v1.11.1 | v1.11.1 with FIB changes ---------------------+-----------------+------------------------- TCP_RR MEAN_LATENCY | 358.74 +- 27.08 | 359.56 +- 30.50 with STDDEV_LATENCY | 360.28 +- 26.01 | 361.76 +- 26.71 | 368.18 +- 28.22 | 365.33 +- 35.83 ---------------------+-----------------+------------------------- As you can see, the increase in latency is negligible. Therefore, this commit drops the latter flag unconditionally, i.e., without checking whether there are any non-default IP rules or Cilium is running on a multi-homing setup. [1]: torvalds/linux@87f5fc7 [2]: https://github.com/brb/cilium-without-netfilter Signed-off-by: Martynas Pumputis <m@lambda.lt>

brb · 2022-01-26T09:24:21Z

Previously all relevant unquarantined tests have passed (net-next https://jenkins.cilium.io/job/Cilium-PR-K8s-1.23-kernel-net-next/313/ hit the known flakes). Dropped the unquarantine CI commit.

brb · 2022-01-26T09:52:32Z

/test

borkmann · 2022-01-26T10:54:47Z

Therefore, to avoid any potential pitfals we drop the latter flag so that the IP rules can be respected.

Given we know exactly when we install ip rules in the agent, I would suggest we should have the agent transparently opt-out of BPF_FIB_LOOKUP_DIRECT when we install them, and otherwise not.

Regarding BPF_FIB_LOOKUP_OUTPUT, should we retain it for hair-pinning LB? If the oif is set, only routes (nexthops really) using that device are considered (see fib_lookup_good_nhc()), which makes a big difference (e.g. XDP). (For egress when we do not know the egress device, set the OUTPUT flag and make sure params->ifindex = 0 to allow the lookup to determine nexthop and device. Wouldn't that work given we know at this point that the pkt is not for local delivery anyway..?)

brb · 2022-01-26T12:14:49Z

Given we know exactly when we install ip rules in the agent, I would suggest we should have the agent transparently opt-out of BPF_FIB_LOOKUP_DIRECT when we install them, and otherwise not.

I think it's not straight forward to determine whether cilium-agent is going to install IP rules. Also, we don't have any control about IP rules users might install. Not sure whether the perf benefits could offset the complexity in the agent and UX.

(For egress when we do not know the egress device, set the OUTPUT flag and make sure params->ifindex = 0 to allow the lookup to determine nexthop and device. Wouldn't that work given we know at this point that the pkt is not for local delivery anyway..?)

Unfortunately, we cannot set params->ifindex to 0, as then the lookup will fail due to https://github.com/torvalds/linux/blob/v5.16/net/core/filter.c#L5394 (my attempt to drop this check was NACK-ed).

borkmann · 2022-01-27T13:45:44Z

bpf/lib/nodeport.h

+		fib_params.ipv4_dst = ip4->daddr;
+
+		fib_ret = fib_lookup(ctx, &fib_params, sizeof(fib_params), 0);
+


Ok, lgtm, tiny nit: lets remove the newline above and in v6 part.

borkmann · 2022-01-27T13:46:56Z

Not sure whether the perf benefits could offset the complexity in the agent and UX.

Ok, fair enough, but agree that if needed we can optimize it later on.

Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since cilium#18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: cilium#19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>

Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since #18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: #19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>

[ upstream commit 6d208e7 ] Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since cilium#18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: cilium#19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 6d208e7 ] Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since #18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: #19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp> Signed-off-by: Paul Chaignon <paul@cilium.io>

[ upstream commit 6d208e7 ] Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since cilium#18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: cilium#19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp> Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>

Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since cilium#18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: cilium#19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>

brb added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. kind/feature This introduces new functionality. release-note/misc This PR makes changes that have no direct user impact. labels Jan 21, 2022

brb added a commit that referenced this pull request Jan 24, 2022

WIP: NodePort multi-homing

dea2328

#18585 Signed-off-by: Martynas Pumputis <m@lambda.lt>

brb mentioned this pull request Jan 24, 2022

DSR mode is broken #18199

Closed

2 tasks

brb force-pushed the pr/brb/bpf-multihoming-v2 branch 2 times, most recently from 0739392 to d39fe4c Compare January 25, 2022 20:36

brb force-pushed the pr/brb/bpf-multihoming-v2 branch from d39fe4c to 9e6dc66 Compare January 26, 2022 09:18

brb changed the title ~~WIP: datapath: Add multihoming to NodePort BPF~~ datapath: Change FIB lookups to enable NodePort multihoming Jan 26, 2022

brb force-pushed the pr/brb/bpf-multihoming-v2 branch from 9e6dc66 to 70ea6c7 Compare January 26, 2022 09:23

brb requested a review from borkmann January 26, 2022 09:24

maintainer-s-little-helper bot assigned borkmann Jan 26, 2022

brb marked this pull request as ready for review January 26, 2022 09:24

brb requested a review from a team January 26, 2022 09:24

brb added this to the 1.12 milestone Jan 26, 2022

brb added the needs-backport/1.11 label Jan 26, 2022

maintainer-s-little-helper bot added this to Needs backport from master in 1.11.2 Jan 26, 2022

borkmann reviewed Jan 27, 2022

View reviewed changes

borkmann approved these changes Jan 27, 2022

View reviewed changes

maintainer-s-little-helper bot unassigned borkmann Jan 27, 2022

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 27, 2022

glibsm merged commit 96152dc into master Jan 27, 2022

glibsm deleted the pr/brb/bpf-multihoming-v2 branch January 27, 2022 16:28

glibsm mentioned this pull request Jan 31, 2022

v1.11 backports 2022-01-31 #18669

Merged

glibsm added backport-pending/1.11 and removed needs-backport/1.11 labels Jan 31, 2022

maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.2 Jan 31, 2022

joamaki added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Feb 8, 2022

maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.11 in 1.11.2 Feb 8, 2022

joestringer mentioned this pull request Feb 23, 2022

Prepare for release v1.11.2 #18922

Merged

travisghansen mentioned this pull request Mar 24, 2022

packet response mac address incorrect #18712

Closed

2 tasks

ysksuzuki mentioned this pull request May 22, 2022

Packets to LB dropped due to FIB lookup failed #19908

Closed

2 tasks

akhilles mentioned this pull request Nov 29, 2022

CFP: Support per-interface NodePort SNAT IP #22429

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datapath: Change FIB lookups to enable NodePort multihoming #18585

datapath: Change FIB lookups to enable NodePort multihoming #18585

brb commented Jan 21, 2022 •

edited

Loading

brb commented Jan 21, 2022

brb commented Jan 21, 2022

brb commented Jan 22, 2022

brb commented Jan 22, 2022

brb commented Jan 24, 2022

brb commented Jan 25, 2022

brb commented Jan 26, 2022

brb commented Jan 26, 2022

borkmann commented Jan 26, 2022 •

edited

Loading

brb commented Jan 26, 2022

borkmann Jan 27, 2022

borkmann commented Jan 27, 2022

		fib_params.ipv4_dst = ip4->daddr;

		fib_ret = fib_lookup(ctx, &fib_params, sizeof(fib_params), 0);

datapath: Change FIB lookups to enable NodePort multihoming #18585

datapath: Change FIB lookups to enable NodePort multihoming #18585

Conversation

brb commented Jan 21, 2022 • edited Loading

brb commented Jan 21, 2022

brb commented Jan 21, 2022

brb commented Jan 22, 2022

brb commented Jan 22, 2022

brb commented Jan 24, 2022

brb commented Jan 25, 2022

brb commented Jan 26, 2022

brb commented Jan 26, 2022

borkmann commented Jan 26, 2022 • edited Loading

brb commented Jan 26, 2022

borkmann Jan 27, 2022

Choose a reason for hiding this comment

borkmann commented Jan 27, 2022

brb commented Jan 21, 2022 •

edited

Loading

borkmann commented Jan 26, 2022 •

edited

Loading