Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datapath: Change FIB lookups to enable NodePort multihoming #18585

Merged
merged 1 commit into from
Jan 27, 2022

Conversation

brb
Copy link
Member

@brb brb commented Jan 21, 2022

Previously, the bpf_fib_lookup() helper was invoked in nodeport.h with
the BPF_FIB_LOOKUP_{OUTPUT,DIRECT} flags set. This commit drops both
flags to make the NodePort BPF multihoming possible.

The former flag instructed the helper to do the FIB lookup from egress
perspective. This required us to set the correct ifindex for the lookup.
In the NodePort BPF code we used DIRECT_ROUTING_DEV_IFINDEX which
corresponds to an iface which is used to communicate between k8s Nodes
(i.e., the iface with k8s Node InternalIP or ExternalIP). However, the
reliance on the DIRECT_ROUTING_DEV_IFINDEX meant that in the case of
more than a single iface used to connect the nodes, only one iface could
be used for the NodePort BPF traffic (forwarded service requests).

To fix this, we can drop the BPF_FIB_LOOKUP_OUTPUT and set the ingress
iface in the FIB lookup parameter. This makes the FIB lookup from the
perspective of the device which received a packet (before forwarding to
egress iface), and it eliminates the need for
DIRECT_ROUTING_DEV_IFINDEX in the NodePort BPF.

The latter flag meant to speed up the FIB lookups. The kernel commit which
introduced it [1] says:

BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
straight to the table associated with the device (expert setting for
those looking to maximize throughput)

This means that we bypass ip rules. However, cilium-agent for some
setups installs IP rules (e.g., EKS, WireGuard, etc). Therefore, to
avoid any potential pitfals we drop the latter flag so that the IP rules
can be respected.

To measure a potential perf regression, I provisioned [2] scripts on
three AWS EC2 c5a.8xlarge machines (10Gbit, Ubuntu 20.04 LTS, 5.11
kernel). The netperf client was sending requests to the LB node which
was forwarding them to the destination node. The only Helm setting for
Cilium was KPR=strict. The results from three runs each setup are the
following:

---------------------+-----------------+-------------------------
                     | v1.11.1         | v1.11.1 with FIB changes
---------------------+-----------------+-------------------------
TCP_RR MEAN_LATENCY  | 358.74 +- 27.08 | 359.56 +- 30.50
with STDDEV_LATENCY  | 360.28 +- 26.01 | 361.76 +- 26.71
                     | 368.18 +- 28.22 | 365.33 +- 35.83
---------------------+-----------------+-------------------------

As you can see, the increase in latency is negligible. Therefore, this
commit drops the latter flag unconditionally, i.e., without checking
whether there are any non-default IP rules or Cilium is running on a
multi-homing setup.

[1]: torvalds/linux@87f5fc7
[2]: https://github.com/brb/cilium-without-netfilter

Fix #18199

@brb brb added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. kind/feature This introduces new functionality. release-note/misc This PR makes changes that have no direct user impact. labels Jan 21, 2022
@brb
Copy link
Member Author

brb commented Jan 21, 2022

/ci-l4lb

@brb
Copy link
Member Author

brb commented Jan 21, 2022

/test-1.23-net-next

@brb
Copy link
Member Author

brb commented Jan 22, 2022

Oh wow, net-next everything except the flaky L7 test has passed. Running one more time to gain more confidence.

@brb
Copy link
Member Author

brb commented Jan 22, 2022

/test

brb added a commit that referenced this pull request Jan 24, 2022
#18585

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb brb mentioned this pull request Jan 24, 2022
2 tasks
@brb
Copy link
Member Author

brb commented Jan 24, 2022

/test

@brb brb force-pushed the pr/brb/bpf-multihoming-v2 branch 2 times, most recently from 0739392 to d39fe4c Compare January 25, 2022 20:36
@brb
Copy link
Member Author

brb commented Jan 25, 2022

/test

@brb brb force-pushed the pr/brb/bpf-multihoming-v2 branch from d39fe4c to 9e6dc66 Compare January 26, 2022 09:18
@brb brb changed the title WIP: datapath: Add multihoming to NodePort BPF datapath: Change FIB lookups to enable NodePort multihoming Jan 26, 2022
Previously, the bpf_fib_lookup() helper was invoked in nodeport.h with
the BPF_FIB_LOOKUP_{OUTPUT,DIRECT} flags set. This commit drops both
flags to make the NodePort BPF multihoming possible.

The former flag instructed the helper to do the FIB lookup from egress
perspective. This required us to set the correct ifindex for the lookup.
In the NodePort BPF code we used DIRECT_ROUTING_DEV_IFINDEX which
corresponds to an iface which is used to communicate between k8s Nodes
(i.e., the iface with k8s Node InternalIP or ExternalIP). However, the
reliance on the DIRECT_ROUTING_DEV_IFINDEX meant that in the case of
more than a single iface used to connect the nodes, only one iface could
be used for the NodePort BPF traffic (forwarded service requests).

To fix this, we can drop the BPF_FIB_LOOKUP_OUTPUT and set the ingress
iface in the FIB lookup parameter. This makes the FIB lookup from the
perspective of the device which received a packet (before forwarding to
egress iface), and it eliminates the need for
DIRECT_ROUTING_DEV_IFINDEX in the NodePort BPF.

The latter flag meant to speed up the FIB lookups. The kernel commit which
introduced it [1] says:

    BPF_FIB_LOOKUP_DIRECT to do a lookup that bypasses FIB rules and goes
    straight to the table associated with the device (expert setting for
    those looking to maximize throughput)

This means that we bypass ip rules. However, cilium-agent for some
setups installs IP rules (e.g., EKS, WireGuard, etc). Therefore, to
avoid any potential pitfals we drop the latter flag so that the IP rules
can be respected.

To measure a potential perf regression, I provisioned [2] scripts on
three AWS EC2 c5a.8xlarge machines (10Gbit, Ubuntu 20.04 LTS, 5.11
kernel). The netperf client was sending requests to the LB node which
was forwarding them to the destination node. The only Helm setting for
Cilium was KPR=strict. The results from three runs each setup are the
following:

---------------------+-----------------+-------------------------
                     | v1.11.1         | v1.11.1 with FIB changes
---------------------+-----------------+-------------------------
TCP_RR MEAN_LATENCY  | 358.74 +- 27.08 | 359.56 +- 30.50
with STDDEV_LATENCY  | 360.28 +- 26.01 | 361.76 +- 26.71
                     | 368.18 +- 28.22 | 365.33 +- 35.83
---------------------+-----------------+-------------------------

As you can see, the increase in latency is negligible. Therefore, this
commit drops the latter flag unconditionally, i.e., without checking
whether there are any non-default IP rules or Cilium is running on a
multi-homing setup.

[1]: torvalds/linux@87f5fc7
[2]: https://github.com/brb/cilium-without-netfilter

Signed-off-by: Martynas Pumputis <m@lambda.lt>
@brb brb force-pushed the pr/brb/bpf-multihoming-v2 branch from 9e6dc66 to 70ea6c7 Compare January 26, 2022 09:23
@brb
Copy link
Member Author

brb commented Jan 26, 2022

Previously all relevant unquarantined tests have passed (net-next https://jenkins.cilium.io/job/Cilium-PR-K8s-1.23-kernel-net-next/313/ hit the known flakes). Dropped the unquarantine CI commit.

@brb brb requested a review from borkmann January 26, 2022 09:24
@brb brb marked this pull request as ready for review January 26, 2022 09:24
@brb brb requested a review from a team January 26, 2022 09:24
@brb brb added this to the 1.12 milestone Jan 26, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot added this to Needs backport from master in 1.11.2 Jan 26, 2022
@brb
Copy link
Member Author

brb commented Jan 26, 2022

/test

@borkmann
Copy link
Member

borkmann commented Jan 26, 2022

Therefore, to avoid any potential pitfals we drop the latter flag so that the IP rules can be respected.

Given we know exactly when we install ip rules in the agent, I would suggest we should have the agent transparently opt-out of BPF_FIB_LOOKUP_DIRECT when we install them, and otherwise not.

Regarding BPF_FIB_LOOKUP_OUTPUT, should we retain it for hair-pinning LB? If the oif is set, only routes (nexthops really) using that device are considered (see fib_lookup_good_nhc()), which makes a big difference (e.g. XDP). (For egress when we do not know the egress device, set the OUTPUT flag and make sure params->ifindex = 0 to allow the lookup to determine nexthop and device. Wouldn't that work given we know at this point that the pkt is not for local delivery anyway..?)

@brb
Copy link
Member Author

brb commented Jan 26, 2022

Given we know exactly when we install ip rules in the agent, I would suggest we should have the agent transparently opt-out of BPF_FIB_LOOKUP_DIRECT when we install them, and otherwise not.

I think it's not straight forward to determine whether cilium-agent is going to install IP rules. Also, we don't have any control about IP rules users might install. Not sure whether the perf benefits could offset the complexity in the agent and UX.

(For egress when we do not know the egress device, set the OUTPUT flag and make sure params->ifindex = 0 to allow the lookup to determine nexthop and device. Wouldn't that work given we know at this point that the pkt is not for local delivery anyway..?)

Unfortunately, we cannot set params->ifindex to 0, as then the lookup will fail due to https://github.com/torvalds/linux/blob/v5.16/net/core/filter.c#L5394 (my attempt to drop this check was NACK-ed).

fib_params.ipv4_dst = ip4->daddr;

fib_ret = fib_lookup(ctx, &fib_params, sizeof(fib_params), 0);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lgtm, tiny nit: lets remove the newline above and in v6 part.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jan 27, 2022
@borkmann
Copy link
Member

Not sure whether the perf benefits could offset the complexity in the agent and UX.

Ok, fair enough, but agree that if needed we can optimize it later on.

@glibsm glibsm merged commit 96152dc into master Jan 27, 2022
@glibsm glibsm deleted the pr/brb/bpf-multihoming-v2 branch January 27, 2022 16:28
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Needs backport from master to Backport pending to v1.10 in 1.11.2 Jan 31, 2022
@joamaki joamaki added backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. and removed backport-pending/1.11 labels Feb 8, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.11 in 1.11.2 Feb 8, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot moved this from Backport pending to v1.10 to Backport done to v1.11 in 1.11.2 Feb 8, 2022
ysksuzuki added a commit to ysksuzuki/cilium that referenced this pull request Jun 21, 2022
Currently, the neighbor discovery only supports a single device specified
as the directRoutingDevice.  That causes packet drops due to FIB lookup
failed error in a multi-device environment.  The BPF Nodeport
implementation no longer relies on the directRoutingDevice since cilium#18585.

The following example illustrates the issue which occurs in the
multi-device environment. Each node has two devices connected to a
different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope
addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to
node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2.

+--------------+     +--------------+
|    node1     |     |    node2     |
| 10.69.0.1/26 |     | 10.69.0.2/26 |
|          eno1+-----+eno1          |
|          |   |     |   |          |
| 10.69.0.65/26|     |10.69.0.66/26 |
|              |     |              |
|          eno2+-----+eno2          |
|          |   |     | |            |
|10.69.0.129/26|     |10.69.0.130/26|
+--------------+     +--------------+

(On node1)
$ ip route show
10.69.0.2
        nexthop via 10.69.0.66 dev eno1 weight 1
        nexthop via 10.69.0.130 dev eno2 weight 1

Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation
selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to
the selected backend on an intermediate node, it will fail due to FIB lookup
failed because there's no neighbor entry for the device eno2 that is not the
directRoutingDevice. This PR fixes the issue by populating L2 addresses
with all target devices.

Fixes: cilium#19908

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
borkmann pushed a commit that referenced this pull request Jun 24, 2022
Currently, the neighbor discovery only supports a single device specified
as the directRoutingDevice.  That causes packet drops due to FIB lookup
failed error in a multi-device environment.  The BPF Nodeport
implementation no longer relies on the directRoutingDevice since #18585.

The following example illustrates the issue which occurs in the
multi-device environment. Each node has two devices connected to a
different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope
addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to
node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2.

+--------------+     +--------------+
|    node1     |     |    node2     |
| 10.69.0.1/26 |     | 10.69.0.2/26 |
|          eno1+-----+eno1          |
|          |   |     |   |          |
| 10.69.0.65/26|     |10.69.0.66/26 |
|              |     |              |
|          eno2+-----+eno2          |
|          |   |     | |            |
|10.69.0.129/26|     |10.69.0.130/26|
+--------------+     +--------------+

(On node1)
$ ip route show
10.69.0.2
        nexthop via 10.69.0.66 dev eno1 weight 1
        nexthop via 10.69.0.130 dev eno2 weight 1

Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation
selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to
the selected backend on an intermediate node, it will fail due to FIB lookup
failed because there's no neighbor entry for the device eno2 that is not the
directRoutingDevice. This PR fixes the issue by populating L2 addresses
with all target devices.

Fixes: #19908

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
pchaigno pushed a commit to pchaigno/cilium that referenced this pull request Jun 28, 2022
[ upstream commit 6d208e7 ]

Currently, the neighbor discovery only supports a single device specified
as the directRoutingDevice.  That causes packet drops due to FIB lookup
failed error in a multi-device environment.  The BPF Nodeport
implementation no longer relies on the directRoutingDevice since cilium#18585.

The following example illustrates the issue which occurs in the
multi-device environment. Each node has two devices connected to a
different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope
addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to
node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2.

+--------------+     +--------------+
|    node1     |     |    node2     |
| 10.69.0.1/26 |     | 10.69.0.2/26 |
|          eno1+-----+eno1          |
|          |   |     |   |          |
| 10.69.0.65/26|     |10.69.0.66/26 |
|              |     |              |
|          eno2+-----+eno2          |
|          |   |     | |            |
|10.69.0.129/26|     |10.69.0.130/26|
+--------------+     +--------------+

(On node1)
$ ip route show
10.69.0.2
        nexthop via 10.69.0.66 dev eno1 weight 1
        nexthop via 10.69.0.130 dev eno2 weight 1

Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation
selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to
the selected backend on an intermediate node, it will fail due to FIB lookup
failed because there's no neighbor entry for the device eno2 that is not the
directRoutingDevice. This PR fixes the issue by populating L2 addresses
with all target devices.

Fixes: cilium#19908

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
Signed-off-by: Paul Chaignon <paul@cilium.io>
qmonnet pushed a commit that referenced this pull request Jul 5, 2022
[ upstream commit 6d208e7 ]

Currently, the neighbor discovery only supports a single device specified
as the directRoutingDevice.  That causes packet drops due to FIB lookup
failed error in a multi-device environment.  The BPF Nodeport
implementation no longer relies on the directRoutingDevice since #18585.

The following example illustrates the issue which occurs in the
multi-device environment. Each node has two devices connected to a
different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope
addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to
node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2.

+--------------+     +--------------+
|    node1     |     |    node2     |
| 10.69.0.1/26 |     | 10.69.0.2/26 |
|          eno1+-----+eno1          |
|          |   |     |   |          |
| 10.69.0.65/26|     |10.69.0.66/26 |
|              |     |              |
|          eno2+-----+eno2          |
|          |   |     | |            |
|10.69.0.129/26|     |10.69.0.130/26|
+--------------+     +--------------+

(On node1)
$ ip route show
10.69.0.2
        nexthop via 10.69.0.66 dev eno1 weight 1
        nexthop via 10.69.0.130 dev eno2 weight 1

Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation
selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to
the selected backend on an intermediate node, it will fail due to FIB lookup
failed because there's no neighbor entry for the device eno2 that is not the
directRoutingDevice. This PR fixes the issue by populating L2 addresses
with all target devices.

Fixes: #19908

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
Signed-off-by: Paul Chaignon <paul@cilium.io>
gandro pushed a commit to gandro/cilium that referenced this pull request Aug 4, 2022
[ upstream commit 6d208e7 ]

Currently, the neighbor discovery only supports a single device specified
as the directRoutingDevice.  That causes packet drops due to FIB lookup
failed error in a multi-device environment.  The BPF Nodeport
implementation no longer relies on the directRoutingDevice since cilium#18585.

The following example illustrates the issue which occurs in the
multi-device environment. Each node has two devices connected to a
different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope
addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to
node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2.

+--------------+     +--------------+
|    node1     |     |    node2     |
| 10.69.0.1/26 |     | 10.69.0.2/26 |
|          eno1+-----+eno1          |
|          |   |     |   |          |
| 10.69.0.65/26|     |10.69.0.66/26 |
|              |     |              |
|          eno2+-----+eno2          |
|          |   |     | |            |
|10.69.0.129/26|     |10.69.0.130/26|
+--------------+     +--------------+

(On node1)
$ ip route show
10.69.0.2
        nexthop via 10.69.0.66 dev eno1 weight 1
        nexthop via 10.69.0.130 dev eno2 weight 1

Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation
selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to
the selected backend on an intermediate node, it will fail due to FIB lookup
failed because there's no neighbor entry for the device eno2 that is not the
directRoutingDevice. This PR fixes the issue by populating L2 addresses
with all target devices.

Fixes: cilium#19908

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Michi Mutsuzaki <michi@isovalent.com>
dezmodue pushed a commit to dezmodue/cilium that referenced this pull request Aug 10, 2022
Currently, the neighbor discovery only supports a single device specified
as the directRoutingDevice.  That causes packet drops due to FIB lookup
failed error in a multi-device environment.  The BPF Nodeport
implementation no longer relies on the directRoutingDevice since cilium#18585.

The following example illustrates the issue which occurs in the
multi-device environment. Each node has two devices connected to a
different L3 network (10.69.0.64/26 and 10.69.0.128/26), and global scope
addresses each(10.69.0.1/26 and 10.69.0.2/26). A nexthop from node1 to
node2 is either 10.69.0.66 dev eno1 or 10.69.0.130 dev eno2.

+--------------+     +--------------+
|    node1     |     |    node2     |
| 10.69.0.1/26 |     | 10.69.0.2/26 |
|          eno1+-----+eno1          |
|          |   |     |   |          |
| 10.69.0.65/26|     |10.69.0.66/26 |
|              |     |              |
|          eno2+-----+eno2          |
|          |   |     | |            |
|10.69.0.129/26|     |10.69.0.130/26|
+--------------+     +--------------+

(On node1)
$ ip route show
10.69.0.2
        nexthop via 10.69.0.66 dev eno1 weight 1
        nexthop via 10.69.0.130 dev eno2 weight 1

Assuming the directRoutingDevice is eno1, if BPF Nodeport implementation
selects 10.69.0.130 dev eno2 as the nexthop when it redirects a packet to
the selected backend on an intermediate node, it will fail due to FIB lookup
failed because there's no neighbor entry for the device eno2 that is not the
directRoutingDevice. This PR fixes the issue by populating L2 addresses
with all target devices.

Fixes: cilium#19908

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.11 The backport for Cilium 1.11.x for this PR is done. kind/feature This introduces new functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
No open projects
1.11.2
Backport done to v1.11
Development

Successfully merging this pull request may close these issues.

DSR mode is broken
4 participants