New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cilium: initial xdp-based nodeport implementation #10877
Conversation
test-docs-please |
test-me-please |
b661200
to
44a2e20
Compare
test-me-please |
test-docs-please |
44a2e20
to
3833c49
Compare
test-me-please |
(docs green, travis green, 4.19 green in prior run) |
d0655d0
to
8966b22
Compare
test-me-please |
test-focus K8sDatapathConfig.* |
1 similar comment
test-focus K8sDatapathConfig.* |
Commit 3d9ce09ddd31ef3ca29dccb40270310e5f9f19b2 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
test-focus K8sDatapathConfig.* |
3d9ce09
to
aace60d
Compare
test-focus K8sDatapathConfig.* |
1 similar comment
test-focus K8sDatapathConfig.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor nits below, this is awesome:)
@@ -292,7 +292,10 @@ function xdp_load() | |||
SEC=$6 | |||
CIDR_MAP=$7 | |||
|
|||
bpf_compile $IN $OUT obj "$OPTS" | |||
NODE_MAC=$(ip link show $DEV | grep ether | awk '{print $2}') | |||
NODE_MAC="{.addr=$(mac2array $NODE_MAC)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NODE_MAC
is already defined in the endpoint's headers, can we reuse that one instead of extending the init.sh
script? It's easier to debug & introspect that way (not to mention reduces the number of things we'll need to change to eventually get rid of init.sh
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have it for bpf_load
, but I agree that generally, it might be good. Perhaps that would also make the code generation easier in Martynas' PR. Noting for follow-up.
xdp_load $XDP_DEV $XDP_MODE "$COPTS" bpf_prefilter.c bpf_prefilter.o from-netdev $CIDR_MAP | ||
COPTS="-DSECLABEL=${ID_WORLD} -DCALLS_MAP=cilium_calls_xdp" | ||
if [ "$NODE_PORT" = "true" ]; then | ||
COPTS="${COPTS} -DLB_L3 -DLB_L4 -DDISABLE_LOOPBACK_LB" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a follow-up cleanup but I wonder if we should move towards having a dedicated xdp_config.h
header, ideally with as much code sharing with netdev_config.h
as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree, any special casing for XDP should be avoided so we can have better integration with the rest of the code.
|
||
static __always_inline int check_v4_lb(struct __ctx_buff *ctx) | ||
{ | ||
ep_tail_call(ctx, CILIUM_CALL_IPV4_FROM_LXC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the tail calls here? (even if we don't, we can always follow up after this PR..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From inside the nodeport handling we basically recirculate back to here (CILIUM_CALL_IPV4_FROM_LXC). Avoiding the tail call would be nice, but means we somehow need to get rid of the tail call chain inside our nodeport code. I can check if it's feasible in a follow-up.
Implement a minimal ctx_adjust_room() handler so that we can start to use NodePort DSR in XDP. The implementation is currently kept to a very minimum specifically to handle the DSR case (which is the only one in our code base right now). If we need other support in future, we can always extend it at the cost of verifier complexity. Note, the skb-based implementation uses skb->protocol to figure out where to insert room. In XDP, we don't have such facility, but given we do this in the code-base only in two locations, i) from v4 DSR and ii) from v6 DSR, I've used the len_diff as an indicator for the proto here. A bit of a hack, but given it's a constant, it allows for code optimization, can compile out either case and therefore reduce complexity. If we start using this in more locaitons, we need to parse the protocol from the eth header via xdp->data manually. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
187b1f1
to
b518ebe
Compare
Commit b518ebe66956a2af25e8e4df6c6dc85d5358c584 does not contain "Signed-off-by". Please follow instructions provided in https://docs.cilium.io/en/stable/contributing/development/contributing_guide/#developer-s-certificate-of-origin |
test-focus K8sDatapathConfig.* |
b518ebe
to
ef1d263
Compare
test-focus K8sDatapathConfig.* |
dc41469
to
af16d23
Compare
test-focus K8sDatapathConfig.* |
test-focus K8sDatapathConfig.* |
af16d23
to
e3fd29a
Compare
test-focus K8sDatapathConfig.* |
e3fd29a
to
e5be08a
Compare
test-focus K8sDatapathConfig.* |
Rework the XDP prefilter early drop to now also include NodePort service handling. Idea is that after prefilter phase, we can accelerate the case where the NodePort(/ExternalIP) backend is remote and we are forced to push the packet back out again via SNAT or DSR. Instead of first crafting an skb, pushing it through GRO engine, packet taps to eventually end up at tc ingress of the phy device, we can do all this right in the driver's XDP layer instead at a much higher per packet rate. This is the case where XDP can really help to help scaling up the backends. XDP won't be able to help much for the case where the backend is local to the node, and when we're hitting cases where we need to punt to remote backends, then can can do so at max the rate that tc ingress can handle, so moving this down into XDP will improve that limit. At the same time it will also create less burden given unneeded layers are bypassed. All major 10/40G NICs these days support XDP in the upstream kernel. This is also in particular useful e.g. in cloud-based environments with SRIO-V networking or on bare metal machines in general. This work piggy-back on the entire datapath conversion that has been done for Cilium's BPF code in order to be generic for skb and xdp. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This add the necessary daemon-only parts to enable xdp-based nodeport. We add daemon config flag, so we can opt-into --node-port-acceleration={none, generic,driver} setting in order to enable NodePort via XDP. The default is 'none' which will then use the regular tc ingress based mechanism. So far it's opt-in, and at some point we can think about integrating it into the kube-proxy-free probe mode. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
e5be08a
to
8750cec
Compare
Add an initial paragraph on configuring nodeport XDP acceleration to the kube-proxy-free guide as well as Helm support. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
8750cec
to
65a1351
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, discussed that we can follow up on the remaining items.
option.Config.NodePortAcceleration != option.NodePortAccelerationNone { | ||
if option.Config.XDPDevice != "undefined" && | ||
option.Config.XDPDevice != option.Config.Device { | ||
log.Fatalf("Cannot set NodePort acceleration device: mismatch between Prefilter device %s and NodePort device %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One extra fixup later, the check here is option.Config.Device
but the message says "Prefilter" (which I would expect to be option.Config.DevicePreFilter
).
See commits.