Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-scale IPCache: Nodeport LB support Part 1 #25745

Merged
merged 6 commits into from Jun 1, 2023

Conversation

julianwiedmann
Copy link
Member

@julianwiedmann julianwiedmann commented May 29, 2023

This PR is in the context of the high-scale ipcache feature described at cilium/design-cfps#7.

It enables the nodeport LB to handle an unencapsulated service request, and forward the request to the backend using GENEVE-DSR. It also adds handling for reply traffic.

In detail:

  • the request enters the LB in from-netdev (either XDP or TC), and nodeport_lb4() selects a backend.
  • the DNATed packet goes down the DSR egress code path, and has GENEVE encapsulation added (+ DSR option as needed). In the context of hs-ipcache, we use the backend's IP address as OuterDstIP (same as if it was a pod-to-pod connection). If the load-balancing is done in XDP, we punt up to TC for adding the tunnel headers.
  • the packet is sent to the backend node
  • at the backend node, the from-netdev program strips off the encapsulation and redirects it to from-overlay. We manually transfer the DSR info across this redirect. The from-overlay program processes the DSR info and creates a corresponding SNAT entry.
  • replies are revDNATed in to-overlay (when they go to a destination inside the Clustermesh), or to-netdev (when the client belongs to one of the configured WorldCIDRs).
Add support for load-balancing unencapsulated requests in a configuration with high-scale ipcache.

@julianwiedmann julianwiedmann added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. feature/high-scale-ipcache Relates to the high-scale ipcache feature. labels May 29, 2023
@julianwiedmann julianwiedmann changed the title High-scale IPCache: Nodeport support Part 1 High-scale IPCache: Nodeport LB support Part 1 May 29, 2023
@julianwiedmann
Copy link
Member Author

julianwiedmann commented May 30, 2023

/test

Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed:

Click to show.

Test Name

K8sAgentPolicyTest Basic Test Traffic redirections to proxy Tests proxy visibility interactions with policy lifecycle operations

Failure Output

FAIL: Failed to start hubble observe

Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/279/

If it is a flake and a GitHub issue doesn't already exist to track it, comment /mlh new-flake Cilium-PR-K8s-1.26-kernel-net-next so I can create one.

Then please upload the Jenkins artifacts to that issue.

@julianwiedmann julianwiedmann added kind/feature This introduces new functionality. feature/lb-only Impacts cilium running in lb-only datapath mode labels May 30, 2023
@julianwiedmann julianwiedmann marked this pull request as ready for review May 30, 2023 10:37
@julianwiedmann julianwiedmann requested a review from a team as a code owner May 30, 2023 10:37
Copy link
Contributor

@bleggett bleggett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The custom decap code for high-scale ipcache in from-netdev transfers
the source's sec_identity from the tunnel header into CB_SRC_LABEL.

But from-overlay initializes its src_sec_identity variable with 0, and only
loads from CB_SRC_LABEL inside handle_ipv4(). In a config with BPF
masquerading, the packet passes through nodeport_lb4() first - which stores
the passed-in src_sec_identity (== 0) into CB_SRC_LABEL, checks for revSNAT
and then tail-calls back to the start of the IPv4 from-overlay path.
Thus we currently lose the sec_identity for eg. pod-to-pod connections when
BPF Masquerading is enabled.

Align this a bit closer with how bpf_host is working - load from
CB_SRC_LABEL at the beginning of the tail-call, and clear it. This way we
can feed the src_sec_identity into the call to nodeport_lb4(), where it
then gets restored to CB_SRC_LABEL before tail-calling back.

If needed, l3_local_delivery() will subsequently fill CB_SRC_LABEL again
before redirecting to the local endpoint.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Determine the tunnel endpoint earlier, and initialize the GENEVE option
struct earlier.

This is just prep work for a subsequent patch, no functional change.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When using the nodeport LB in GENEVE-DSR mode with hs-ipcache, don't rely
on the ipcache to select the GENEVE tunnel endpoint for the selected
backend.

Use the InnerDstIP (== backend IP) instead, same as we do for pod-to-pod
traffic.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Prefer DROP reasons over a raw CTX_ACT_DROP.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
A backend node in hs-ipcache mode currently strips off the tunnel headers
and manually redirects the packet to cilium_geneve. The SrcSecID is
transferred via CB_SRC_LABEL.

For GENEVE-DSR we also need to transfer the DSR option, so that the
nodeport_lb4() call in from-overlay can process it as usual. But we can't
use ctx_set_tunnel_opt() for this, as the metadata_dst will be scrubbed
from the skb when redirecting to cilium_geneve's Ingress.

Transfer the DSR info via skb->cb instead, and copy it back to a
metadata_dst in from-overlay so that things look identical for the nodeport
DSR code.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When a DSR backend replies back to the client in a hs-ipcache
configuration, it potentially uses tunnel encapsulation (based on the
configured WorldCIDR). RevDNAT for the reply is then handled in to-overlay.

To match the LB path (where both the inner and outer DstIP were set
to the service IP), we should also revDNAT the outer SrcIP. As we're in
the to-overlay program, the SrcIP is stored in the tunnel_key.

Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
@julianwiedmann
Copy link
Member Author

Rebased on top of #24422.

@julianwiedmann
Copy link
Member Author

/test

@julianwiedmann julianwiedmann added the release-blocker/1.14 This issue will prevent the release of the next version of Cilium. label May 31, 2023
@julianwiedmann
Copy link
Member Author

net-next failed in K8sDatapathServicesTest Checks E/W loadbalancing (ClusterIP, NodePort from inside cluster, etc) Checks in-cluster KPR with L7 policy, but looks like there's no dump because the jenkins run timed out afterwards.

@julianwiedmann
Copy link
Member Author

/test-1.26-net-next

Copy link
Contributor

@ldelossa ldelossa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me. 👍

@julianwiedmann julianwiedmann merged commit f541499 into cilium:main Jun 1, 2023
61 checks passed
@julianwiedmann julianwiedmann deleted the 1.14-hsipcache-part1 branch June 1, 2023 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/high-scale-ipcache Relates to the high-scale ipcache feature. feature/lb-only Impacts cilium running in lb-only datapath mode kind/feature This introduces new functionality. release-blocker/1.14 This issue will prevent the release of the next version of Cilium. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants