New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High-scale IPCache: Nodeport LB support Part 2 #25854
Conversation
/test |
cd1b120
to
73019fc
Compare
…MODE The high-scale IP cache mode requires native routing. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When the TC path load-balances a request in the from-overlay program, the original SrcSecID (as extracted from the encap header) gets passed to nodeport_lb4() via the `src_identity` parameter. Pass this through to the DSR path, so that we can re-insert it into the fresh encapsulation. This preserves the client's SrcSecID all the way to the backend. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
For in-cluster access to services, we require Geneve encapsulation to transport the source's security identity to the backend. We haven't looked at enabling this in the Nodeport NAT path, so require full-DSR mode. Also reflect this in the relevant BPF compile tests: - add ENABLE_DSR and DSR_ENCAP_GENEVE - remove ENABLE_DSR_HYBRID Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Pods connecting to a remote service backend must consider that the LB needs to insert a GENEVE-DSR option. In practice this is only relevant for UDP packets, as for TCP we have the optimization to only insert the option into the SYN packet. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Add minimal support for inserting the GENEVE DSR IPv4 option into an existing GENEVE header. Note that this currently only supports packets that have fixed-length encap headers (ie. no IP or GENEVE options). Callers are expected to adjust the outer headers as needed (eg. update the length field in the UDP header). Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
nodeport_lb*() currently assumes that the L3 header is located inside the packet at offset ETH_HLEN. To enable load-balancing for the inner packet of an encapsulated connection, parameterize this offset and the actual ip4 pointer. For some programs it's an actual win to just pass through the ip4 pointer, as they have already validated it. For now only touch the IPv4 path to avoid churn. IPv6 will follow when needed. Also add variants of the L2 / L3 validation helpers that are able to deal with variable offsets. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Enable the XDP Loadbalancer to handle GENEVE-encapsulated service requests. Such traffic would usually get passed up to TC, have its encapsulation stripped off and then get load-balanced in the from-overlay program. But in combination with DSR_ENCAP_GENEVE, we can preserve the encap headers (and manually adjust them as needed), insert the DSR option (if needed) and redirect the packet to its backend right away. In detail: - the XDP program detects GENEVE encapsulation, and determines the l3_off for the inner packet. - nodeport_lb4() then transparently load-balances the inner packet, and passes the l3_off to the DSR stage. - the DSR stage for XDP has all the smarts to adjust the encapsulation headers, and insert the DSR option if necessary. The GENEVE detection in bpf_xdp is fairly minimalistic now. But this mostly matches the restrictions we have on the ctx_set_tunnel_opt() side for XDP. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
73019fc
to
ebc6651
Compare
/test |
(the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total noob as to sig-datapath changes, I'll try and find someone else to review. Other owned changes LGTM.
} | ||
} | ||
|
||
out: | ||
if (IS_ERR(ret)) | ||
return send_drop_notify_error_ext(ctx, 0, ret, ext_err, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this will now send _ext
err if the tail call failed. Does that matter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. In the end this is for fine-granular debugging, so we would know whether ext_err
is even relevant for the reported DROP reason.
return true; | ||
} | ||
|
||
static __always_inline __maybe_unused bool | ||
__revalidate_data_pull(struct __ctx_buff *ctx, void **data, void **data_end, | ||
void **l3, const __u32 l3_len, const bool pull) | ||
void **l3, const __u32 l3_off, const __u32 l3_len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: feels more conventional if l3
and l3_len
arent split in args.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷 either way feels fine to me
struct genevehdr geneve; | ||
|
||
/* add free space after GENEVE header: */ | ||
if (ctx_adjust_hroom(ctx, opt_len, BPF_ADJ_ROOM_MAC, ctx_adjust_hroom_flags()) < 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noob question: this doesn't take geneve_off
so how does the function know where to add space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the corresponding change in xdp.h
's ctx_adjust_hroom()
. It's completely custom and fragile - we just "know" how to behave for specific hroom
values.
if (ctx_load_bytes(ctx, geneve_off, &geneve, sizeof(geneve)) < 0) | ||
return DROP_INVALID; | ||
|
||
geneve.opt_len += (__u8)(opt_len >> 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess opt_len must be a multiple of 4 or something. Can we build_bug_on
or similar? Or document that somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To a certain extent I would hope that callers know what data format they need to follow for such an option (as it also needs to work for the TC variant of set_tunnel_opt()
), and get to keep the pieces if they don't.
But yep, a build_bug() should be doable - will queue that up as follow-on, thanks!
@@ -101,7 +105,7 @@ type Configuration struct { | |||
// specified, otherwise it will be automatically detected. if encapEnabled is | |||
// true, the MTU is adjusted to account for encapsulation overhead for all | |||
// routes involved in node to node communication. | |||
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, mtu int, mtuDetectIP net.IP) Configuration { | |||
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, hsIpcacheDSRenabled bool, mtu int, mtuDetectIP net.IP) Configuration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive by comment: this really ought to take a struct. All of the bools makes this unreadable at callsites.
Nit: capitalization of hsIpcacheDSRenabled
seems wonky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No disagreement from my side, this has grown way out of proportion 😬 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My codeowners looks good to me. Seconding the proposal to refactor the MTU constructor at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All nits, LGTM.
@@ -32,6 +32,10 @@ const ( | |||
// Total extra bytes: 50B | |||
TunnelOverhead = 50 | |||
|
|||
// DsrTunnelOverhead is about the GENEVE DSR option that gets inserted | |||
// by the LB, when addressing a Service in hs-ipcache mode | |||
DsrTunnelOverhead = 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this name?
If the comments immediately point out that this value is used for "GENEVE DSR" then why not make the value also contain this detail? Seems pretty specific to Geneve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 probably just leftovers from the initial prototype, and missed while polishing.
@@ -101,7 +105,7 @@ type Configuration struct { | |||
// specified, otherwise it will be automatically detected. if encapEnabled is | |||
// true, the MTU is adjusted to account for encapsulation overhead for all | |||
// routes involved in node to node communication. | |||
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, mtu int, mtuDetectIP net.IP) Configuration { | |||
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, hsIpcacheDSRenabled bool, mtu int, mtuDetectIP net.IP) Configuration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit but:
s/hsIpcacheDSRenabled/hsIPCacheDSREnabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐫
@@ -150,12 +150,13 @@ union v6addr { | |||
__u8 addr[16]; | |||
} __packed; | |||
|
|||
static __always_inline bool validate_ethertype(struct __ctx_buff *ctx, | |||
__u16 *proto) | |||
static __always_inline bool validate_ethertype_l2_off(struct __ctx_buff *ctx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your thoughts on taking and optionally filling a pointer to an ethhdr
here, akin to how we fill in layer 3 headers in revalidate_data
.
Could be handy if we quickly need to inspect l2 attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any paths that need more than just the ethertype today?
Reviews are in, tests are green (with one exception for #25854 (comment)) -> ready to merge. @dylandreimerink could you please do the honors? 🙏 |
What is the deal with the |
@dylandreimerink @julianwiedmann Why was this merged when it is known to break Coccicheck? Coccicheck is a Required CI job. |
It was a false-positive for this change, and we already had plenty of existing code paths that follow the same pattern. Why is coccicheck now complaining about this specific part, but the 20 others are fine? 😠 Either way, #25924. |
I understand all that, but it's still a CI breakage. |
ack, sorry. The expectation was that it would only check the actual patch diff 😞 |
Ah, I see 👍 Only checkpatch works like that today, which is part of the reason why checkpatch is not marked Required. |
Continuing from #25745, this PR is in the context of the high-scale ipcache feature described at cilium/design-cfps#7.
It enables the nodeport LB to handle an GENEVE-encapsulated service request (from inside the clustermesh), and forward the request with the client's original security identity to the backend using GENEVE-DSR. It also adds full XDP passthrough support for this feature.
As we want to support in-clustermesh traffic to dedicated LB nodes (where the service translation happens in the LB, and not at the source), reduce the MTU sufficiently to tolerate the GENEVE DSR option that gets added by the LB.
In detail without XDP:
from-netdev
. Gets manually decapsulated, and redirected tocilium_geneve
.from-overlay
,nodeport_lb4()
load-balances the inner packet. Tail-calls to the DSR egress path, and propagates the client's security identity along (that we previously obtained from the tunnel encapsulation).And with XDP:
nodeport_lb4()
now operates on the inner packet (service lookup, DNAT etc).