-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datapath: Change FIB lookups to enable NodePort multihoming #18585
Conversation
/ci-l4lb |
/test-1.23-net-next |
Oh wow, net-next everything except the flaky L7 test has passed. Running one more time to gain more confidence. |
/test |
#18585 Signed-off-by: Martynas Pumputis <m@lambda.lt>
/test |
0739392
to
d39fe4c
Compare
/test |
d39fe4c
to
9e6dc66
Compare
Previously, the bpf_fib_lookup() helper was invoked in nodeport.h with the BPF_FIB_LOOKUP_{OUTPUT,DIRECT} flags set. This commit drops both flags to make the NodePort BPF multihoming possible. The former flag instructed the helper to do the FIB lookup from egress perspective. This required us to set the correct ifindex for the lookup. In the NodePort BPF code we used DIRECT_ROUTING_DEV_IFINDEX which corresponds to an iface which is used to communicate between k8s Nodes (i.e., the iface with k8s Node InternalIP or ExternalIP). However, the reliance on the DIRECT_ROUTING_DEV_IFINDEX meant that in the case of more than a single iface used to connect the nodes, only one iface could be used for the NodePort BPF traffic (forwarded service requests). To fix this, we can drop the BPF_FIB_LOOKUP_OUTPUT and set the ingress iface in the FIB lookup parameter. This makes the FIB lookup from the perspective of the device which received a packet (before forwarding to egress iface), and it eliminates the need for DIRECT_ROUTING_DEV_IFINDEX in the NodePort BPF. The latter flag meant to speed up the FIB lookups. The kernel commit which introduced it [1] says: BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes straight to the table associated with the device (expert setting for those looking to maximize throughput) This means that we bypass ip rules. However, cilium-agent for some setups installs IP rules (e.g., EKS, WireGuard, etc). Therefore, to avoid any potential pitfals we drop the latter flag so that the IP rules can be respected. To measure a potential perf regression, I provisioned [2] scripts on three AWS EC2 c5a.8xlarge machines (10Gbit, Ubuntu 20.04 LTS, 5.11 kernel). The netperf client was sending requests to the LB node which was forwarding them to the destination node. The only Helm setting for Cilium was KPR=strict. The results from three runs each setup are the following: ---------------------+-----------------+------------------------- | v1.11.1 | v1.11.1 with FIB changes ---------------------+-----------------+------------------------- TCP_RR MEAN_LATENCY | 358.74 +- 27.08 | 359.56 +- 30.50 with STDDEV_LATENCY | 360.28 +- 26.01 | 361.76 +- 26.71 | 368.18 +- 28.22 | 365.33 +- 35.83 ---------------------+-----------------+------------------------- As you can see, the increase in latency is negligible. Therefore, this commit drops the latter flag unconditionally, i.e., without checking whether there are any non-default IP rules or Cilium is running on a multi-homing setup. [1]: torvalds/linux@87f5fc7 [2]: https://github.com/brb/cilium-without-netfilter Signed-off-by: Martynas Pumputis <m@lambda.lt>
9e6dc66
to
70ea6c7
Compare
Previously all relevant unquarantined tests have passed (net-next https://jenkins.cilium.io/job/Cilium-PR-K8s-1.23-kernel-net-next/313/ hit the known flakes). Dropped the unquarantine CI commit. |
/test |
Given we know exactly when we install ip rules in the agent, I would suggest we should have the agent transparently opt-out of BPF_FIB_LOOKUP_DIRECT when we install them, and otherwise not. Regarding BPF_FIB_LOOKUP_OUTPUT, should we retain it for hair-pinning LB? If the oif is set, only routes (nexthops really) using that device are considered (see fib_lookup_good_nhc()), which makes a big difference (e.g. XDP). (For egress when we do not know the egress device, set the OUTPUT flag and make sure params->ifindex = 0 to allow the lookup to determine nexthop and device. Wouldn't that work given we know at this point that the pkt is not for local delivery anyway..?) |
I think it's not straight forward to determine whether cilium-agent is going to install IP rules. Also, we don't have any control about IP rules users might install. Not sure whether the perf benefits could offset the complexity in the agent and UX.
Unfortunately, we cannot set params->ifindex to 0, as then the lookup will fail due to https://github.com/torvalds/linux/blob/v5.16/net/core/filter.c#L5394 (my attempt to drop this check was NACK-ed). |
fib_params.ipv4_dst = ip4->daddr; | ||
|
||
fib_ret = fib_lookup(ctx, &fib_params, sizeof(fib_params), 0); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, lgtm, tiny nit: lets remove the newline above and in v6 part.
Ok, fair enough, but agree that if needed we can optimize it later on. |
Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since cilium#18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: cilium#19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since #18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: #19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
[ upstream commit 6d208e7 ] Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since cilium#18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: cilium#19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 6d208e7 ] Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since #18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: #19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 6d208e7 ] Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since cilium#18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: cilium#19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp> Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
Currently, the neighbor discovery only supports a single device specified as the directRoutingDevice. That causes packet drops due to FIB lookup failed error in a multi-device environment. The BPF Nodeport implementation no longer relies on the directRoutingDevice since cilium#18585. The following example illustrates the issue which occurs in the multi-device environment. Each node has two devices connected to a different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2. +--------------+ +--------------+ | node1 | | node2 | | 10.69.0.1/26 | | 10.69.0.2/26 | | eno1+-----+eno1 | | | | | | | | 10.69.0.65/26| |10.69.0.66/26 | | | | | | eno2+-----+eno2 | | | | | | | |10.69.0.129/26| |10.69.0.130/26| +--------------+ +--------------+ (On node1) $ ip route show 10.69.0.2 nexthop via 10.69.0.66 dev eno1 weight 1 nexthop via 10.69.0.130 dev eno2 weight 1 Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to the selected backend on an intermediate node, it will fail due to FIB lookup failed because there's no neighbor entry for the device eno2 that is not the directRoutingDevice. This PR fixes the issue by populating L2 addresses with all target devices. Fixes: cilium#19908 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
Previously, the bpf_fib_lookup() helper was invoked in nodeport.h with
the BPF_FIB_LOOKUP_{OUTPUT,DIRECT} flags set. This commit drops both
flags to make the NodePort BPF multihoming possible.
The former flag instructed the helper to do the FIB lookup from egress
perspective. This required us to set the correct ifindex for the lookup.
In the NodePort BPF code we used DIRECT_ROUTING_DEV_IFINDEX which
corresponds to an iface which is used to communicate between k8s Nodes
(i.e., the iface with k8s Node InternalIP or ExternalIP). However, the
reliance on the DIRECT_ROUTING_DEV_IFINDEX meant that in the case of
more than a single iface used to connect the nodes, only one iface could
be used for the NodePort BPF traffic (forwarded service requests).
To fix this, we can drop the BPF_FIB_LOOKUP_OUTPUT and set the ingress
iface in the FIB lookup parameter. This makes the FIB lookup from the
perspective of the device which received a packet (before forwarding to
egress iface), and it eliminates the need for
DIRECT_ROUTING_DEV_IFINDEX in the NodePort BPF.
The latter flag meant to speed up the FIB lookups. The kernel commit which
introduced it [1] says:
This means that we bypass ip rules. However, cilium-agent for some
setups installs IP rules (e.g., EKS, WireGuard, etc). Therefore, to
avoid any potential pitfals we drop the latter flag so that the IP rules
can be respected.
To measure a potential perf regression, I provisioned [2] scripts on
three AWS EC2 c5a.8xlarge machines (10Gbit, Ubuntu 20.04 LTS, 5.11
kernel). The netperf client was sending requests to the LB node which
was forwarding them to the destination node. The only Helm setting for
Cilium was KPR=strict. The results from three runs each setup are the
following:
As you can see, the increase in latency is negligible. Therefore, this
commit drops the latter flag unconditionally, i.e., without checking
whether there are any non-default IP rules or Cilium is running on a
multi-homing setup.
[1]: torvalds/linux@87f5fc7
[2]: https://github.com/brb/cilium-without-netfilter
Fix #18199