Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-scale IPCache: Nodeport LB support Part 2 #25854

Merged
merged 7 commits into from Jun 6, 2023

Conversation

julianwiedmann
Copy link
Member

@julianwiedmann julianwiedmann commented Jun 2, 2023

Continuing from #25745, this PR is in the context of the high-scale ipcache feature described at cilium/design-cfps#7.

It enables the nodeport LB to handle an GENEVE-encapsulated service request (from inside the clustermesh), and forward the request with the client's original security identity to the backend using GENEVE-DSR. It also adds full XDP passthrough support for this feature.

As we want to support in-clustermesh traffic to dedicated LB nodes (where the service translation happens in the LB, and not at the source), reduce the MTU sufficiently to tolerate the GENEVE DSR option that gets added by the LB.

In detail without XDP:

  • encapsulated packet enters from-netdev. Gets manually decapsulated, and redirected to cilium_geneve.
  • in from-overlay, nodeport_lb4() load-balances the inner packet. Tail-calls to the DSR egress path, and propagates the client's security identity along (that we previously obtained from the tunnel encapsulation).
  • The DSR-GENEVE egress code adds fresh encapsulation, and inserts the Security identity here (instead of WORLD_ID).
  • the load-balanced request is forwarded to the backend node.

And with XDP:

  • as the packet enters the XDP program, we detect that it's GENEVE-encapsulated and locate the inner IPv4 header.
  • nodeport_lb4() now operates on the inner packet (service lookup, DNAT etc).
  • The DSR-GENEVE egress code learns how to update the existing encapsulation headers, and punches the DSR GENEVE option into the right location.
  • the load-balanced request is forwarded to the backend node from inside XDP.
Add support for load-balancing encapsulated requests in a configuration with high-scale ipcache.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 2, 2023
@julianwiedmann julianwiedmann added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. kind/feature This introduces new functionality. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. feature/high-scale-ipcache Relates to the high-scale ipcache feature. labels Jun 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 2, 2023
@julianwiedmann julianwiedmann added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. feature/lb-only Impacts cilium running in lb-only datapath mode labels Jun 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jun 2, 2023
@julianwiedmann julianwiedmann added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. area/loadbalancing Impacts load-balancing and Kubernetes service implementations labels Jun 2, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jun 2, 2023
@julianwiedmann julianwiedmann added the release-blocker/1.14 This issue will prevent the release of the next version of Cilium. label Jun 2, 2023
@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann julianwiedmann force-pushed the 1.14-hs-ipcache branch 2 times, most recently from cd1b120 to 73019fc Compare June 2, 2023 08:53
…MODE

The high-scale IP cache mode requires native routing.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When the TC path load-balances a request in the from-overlay program, the
original SrcSecID (as extracted from the encap header) gets passed to
nodeport_lb4() via the `src_identity` parameter.

Pass this through to the DSR path, so that we can re-insert it into the
fresh encapsulation. This preserves the client's SrcSecID all the way to
the backend.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
For in-cluster access to services, we require Geneve encapsulation to
transport the source's security identity to the backend.

We haven't looked at enabling this in the Nodeport NAT path, so require
full-DSR mode.

Also reflect this in the relevant BPF compile tests:
- add ENABLE_DSR and DSR_ENCAP_GENEVE
- remove ENABLE_DSR_HYBRID

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Pods connecting to a remote service backend must consider that the LB needs
to insert a GENEVE-DSR option.

In practice this is only relevant for UDP packets, as for TCP we have the
optimization to only insert the option into the SYN packet.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Add minimal support for inserting the GENEVE DSR IPv4 option into an
existing GENEVE header. Note that this currently only supports packets that
have fixed-length encap headers (ie. no IP or GENEVE options).

Callers are expected to adjust the outer headers as needed (eg. update the
length field in the UDP header).

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
nodeport_lb*() currently assumes that the L3 header is located inside the
packet at offset ETH_HLEN. To enable load-balancing for the inner packet
of an encapsulated connection, parameterize this offset and the actual
ip4 pointer. For some programs it's an actual win to just pass through the
ip4 pointer, as they have already validated it.

For now only touch the IPv4 path to avoid churn. IPv6 will follow when
needed.

Also add variants of the L2 / L3 validation helpers that are able to deal
with variable offsets.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Enable the XDP Loadbalancer to handle GENEVE-encapsulated service requests.
Such traffic would usually get passed up to TC, have its encapsulation
stripped off and then get load-balanced in the from-overlay program.

But in combination with DSR_ENCAP_GENEVE, we can preserve the encap headers
(and manually adjust them as needed), insert the DSR option (if needed) and
redirect the packet to its backend right away.

In detail:
- the XDP program detects GENEVE encapsulation, and determines the l3_off
  for the inner packet.
- nodeport_lb4() then transparently load-balances the inner packet, and
  passes the l3_off to the DSR stage.
- the DSR stage for XDP has all the smarts to adjust the encapsulation
  headers, and insert the DSR option if necessary.

The GENEVE detection in bpf_xdp is fairly minimalistic now. But this
mostly matches the restrictions we have on the ctx_set_tunnel_opt() side
for XDP.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann julianwiedmann marked this pull request as ready for review June 2, 2023 13:04
@julianwiedmann julianwiedmann requested review from a team as code owners June 2, 2023 13:04
@julianwiedmann
Copy link
Member Author

(the coccicheck warning is a false-positive. We merely raise the DROP_MISSED_TAIL_CALL drop notification a bit later in the code).

Copy link
Contributor

@lmb lmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total noob as to sig-datapath changes, I'll try and find someone else to review. Other owned changes LGTM.

}
}

out:
if (IS_ERR(ret))
return send_drop_notify_error_ext(ctx, 0, ret, ext_err,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this will now send _ext err if the tail call failed. Does that matter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. In the end this is for fine-granular debugging, so we would know whether ext_err is even relevant for the reported DROP reason.

return true;
}

static __always_inline __maybe_unused bool
__revalidate_data_pull(struct __ctx_buff *ctx, void **data, void **data_end,
void **l3, const __u32 l3_len, const bool pull)
void **l3, const __u32 l3_off, const __u32 l3_len,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: feels more conventional if l3 and l3_len arent split in args.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 either way feels fine to me

struct genevehdr geneve;

/* add free space after GENEVE header: */
if (ctx_adjust_hroom(ctx, opt_len, BPF_ADJ_ROOM_MAC, ctx_adjust_hroom_flags()) < 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noob question: this doesn't take geneve_off so how does the function know where to add space?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the corresponding change in xdp.h's ctx_adjust_hroom(). It's completely custom and fragile - we just "know" how to behave for specific hroom values.

if (ctx_load_bytes(ctx, geneve_off, &geneve, sizeof(geneve)) < 0)
return DROP_INVALID;

geneve.opt_len += (__u8)(opt_len >> 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess opt_len must be a multiple of 4 or something. Can we build_bug_on or similar? Or document that somewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To a certain extent I would hope that callers know what data format they need to follow for such an option (as it also needs to work for the TC variant of set_tunnel_opt()), and get to keep the pieces if they don't.

But yep, a build_bug() should be doable - will queue that up as follow-on, thanks!

@@ -101,7 +105,7 @@ type Configuration struct {
// specified, otherwise it will be automatically detected. if encapEnabled is
// true, the MTU is adjusted to account for encapsulation overhead for all
// routes involved in node to node communication.
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, mtu int, mtuDetectIP net.IP) Configuration {
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, hsIpcacheDSRenabled bool, mtu int, mtuDetectIP net.IP) Configuration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by comment: this really ought to take a struct. All of the bools makes this unreadable at callsites.

Nit: capitalization of hsIpcacheDSRenabled seems wonky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No disagreement from my side, this has grown way out of proportion 😬 .

@julianwiedmann
Copy link
Member Author

Total noob as to sig-datapath changes, I'll try and find someone else to review. Other owned changes LGTM.

Thank you! Hopefully @ldelossa and/or @borkmann can have a look :).

Copy link
Member

@gandro gandro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My codeowners looks good to me. Seconding the proposal to refactor the MTU constructor at some point.

Copy link
Contributor

@ldelossa ldelossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All nits, LGTM.

@@ -32,6 +32,10 @@ const (
// Total extra bytes: 50B
TunnelOverhead = 50

// DsrTunnelOverhead is about the GENEVE DSR option that gets inserted
// by the LB, when addressing a Service in hs-ipcache mode
DsrTunnelOverhead = 12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this name?

If the comments immediately point out that this value is used for "GENEVE DSR" then why not make the value also contain this detail? Seems pretty specific to Geneve.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 probably just leftovers from the initial prototype, and missed while polishing.

@@ -101,7 +105,7 @@ type Configuration struct {
// specified, otherwise it will be automatically detected. if encapEnabled is
// true, the MTU is adjusted to account for encapsulation overhead for all
// routes involved in node to node communication.
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, mtu int, mtuDetectIP net.IP) Configuration {
func NewConfiguration(authKeySize int, encryptEnabled bool, encapEnabled bool, wireguardEnabled bool, hsIpcacheDSRenabled bool, mtu int, mtuDetectIP net.IP) Configuration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit but:

s/hsIpcacheDSRenabled/hsIPCacheDSREnabled

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐫

@@ -150,12 +150,13 @@ union v6addr {
__u8 addr[16];
} __packed;

static __always_inline bool validate_ethertype(struct __ctx_buff *ctx,
__u16 *proto)
static __always_inline bool validate_ethertype_l2_off(struct __ctx_buff *ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's your thoughts on taking and optionally filling a pointer to an ethhdr here, akin to how we fill in layer 3 headers in revalidate_data.

Could be handy if we quickly need to inspect l2 attributes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any paths that need more than just the ethertype today?

@julianwiedmann julianwiedmann added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Jun 6, 2023
@julianwiedmann
Copy link
Member Author

Reviews are in, tests are green (with one exception for #25854 (comment)) -> ready to merge.

@dylandreimerink could you please do the honors? 🙏

@dylandreimerink
Copy link
Member

dylandreimerink commented Jun 6, 2023

What is the deal with the file ./bpf_xdp.c: DROP_MISSED_TAIL_CALL missing after tail call on line 189 error thrown by coccicheck? Seems like you are breaking a custom rule https://github.com/cilium/cilium/blob/main/contrib/coccinelle/tail_calls.cocci#L102

@dylandreimerink dylandreimerink merged commit ba78c61 into cilium:main Jun 6, 2023
60 of 61 checks passed
@julianwiedmann julianwiedmann deleted the 1.14-hs-ipcache branch June 6, 2023 07:58
@pchaigno
Copy link
Member

pchaigno commented Jun 6, 2023

@dylandreimerink @julianwiedmann Why was this merged when it is known to break Coccicheck? Coccicheck is a Required CI job.

@julianwiedmann
Copy link
Member Author

@dylandreimerink @julianwiedmann Why was this merged when it is known to break Coccicheck? Coccicheck is a Required CI job.

It was a false-positive for this change, and we already had plenty of existing code paths that follow the same pattern. Why is coccicheck now complaining about this specific part, but the 20 others are fine? 😠

Either way, #25924.

@pchaigno
Copy link
Member

pchaigno commented Jun 6, 2023

I understand all that, but it's still a CI breakage.

@julianwiedmann
Copy link
Member Author

I understand all that, but it's still a CI breakage.

ack, sorry. The expectation was that it would only check the actual patch diff 😞

@pchaigno
Copy link
Member

pchaigno commented Jun 6, 2023

Ah, I see 👍 Only checkpatch works like that today, which is part of the reason why checkpatch is not marked Required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/loadbalancing Impacts load-balancing and Kubernetes service implementations feature/high-scale-ipcache Relates to the high-scale ipcache feature. feature/lb-only Impacts cilium running in lb-only datapath mode kind/feature This introduces new functionality. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants