New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High-scale IPCache: Nodeport LB support Part 1 #25745
High-scale IPCache: Nodeport LB support Part 1 #25745
Conversation
ddcd1aa
to
44a1742
Compare
/test Job 'Cilium-PR-K8s-1.26-kernel-net-next' failed: Click to show.Test Name
Failure Output
Jenkins URL: https://jenkins.cilium.io/job/Cilium-PR-K8s-1.26-kernel-net-next/279/ If it is a flake and a GitHub issue doesn't already exist to track it, comment Then please upload the Jenkins artifacts to that issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The custom decap code for high-scale ipcache in from-netdev transfers the source's sec_identity from the tunnel header into CB_SRC_LABEL. But from-overlay initializes its src_sec_identity variable with 0, and only loads from CB_SRC_LABEL inside handle_ipv4(). In a config with BPF masquerading, the packet passes through nodeport_lb4() first - which stores the passed-in src_sec_identity (== 0) into CB_SRC_LABEL, checks for revSNAT and then tail-calls back to the start of the IPv4 from-overlay path. Thus we currently lose the sec_identity for eg. pod-to-pod connections when BPF Masquerading is enabled. Align this a bit closer with how bpf_host is working - load from CB_SRC_LABEL at the beginning of the tail-call, and clear it. This way we can feed the src_sec_identity into the call to nodeport_lb4(), where it then gets restored to CB_SRC_LABEL before tail-calling back. If needed, l3_local_delivery() will subsequently fill CB_SRC_LABEL again before redirecting to the local endpoint. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Determine the tunnel endpoint earlier, and initialize the GENEVE option struct earlier. This is just prep work for a subsequent patch, no functional change. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When using the nodeport LB in GENEVE-DSR mode with hs-ipcache, don't rely on the ipcache to select the GENEVE tunnel endpoint for the selected backend. Use the InnerDstIP (== backend IP) instead, same as we do for pod-to-pod traffic. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Prefer DROP reasons over a raw CTX_ACT_DROP. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
A backend node in hs-ipcache mode currently strips off the tunnel headers and manually redirects the packet to cilium_geneve. The SrcSecID is transferred via CB_SRC_LABEL. For GENEVE-DSR we also need to transfer the DSR option, so that the nodeport_lb4() call in from-overlay can process it as usual. But we can't use ctx_set_tunnel_opt() for this, as the metadata_dst will be scrubbed from the skb when redirecting to cilium_geneve's Ingress. Transfer the DSR info via skb->cb instead, and copy it back to a metadata_dst in from-overlay so that things look identical for the nodeport DSR code. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When a DSR backend replies back to the client in a hs-ipcache configuration, it potentially uses tunnel encapsulation (based on the configured WorldCIDR). RevDNAT for the reply is then handled in to-overlay. To match the LB path (where both the inner and outer DstIP were set to the service IP), we should also revDNAT the outer SrcIP. As we're in the to-overlay program, the SrcIP is stored in the tunnel_key. Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
44a1742
to
34063bc
Compare
Rebased on top of #24422. |
/test |
net-next failed in |
/test-1.26-net-next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me. 👍
This PR is in the context of the high-scale ipcache feature described at cilium/design-cfps#7.
It enables the nodeport LB to handle an unencapsulated service request, and forward the request to the backend using GENEVE-DSR. It also adds handling for reply traffic.
In detail:
from-netdev
(either XDP or TC), andnodeport_lb4()
selects a backend.from-netdev
program strips off the encapsulation and redirects it tofrom-overlay
. We manually transfer the DSR info across this redirect. Thefrom-overlay
program processes the DSR info and creates a corresponding SNAT entry.to-overlay
(when they go to a destination inside the Clustermesh), orto-netdev
(when the client belongs to one of the configured WorldCIDRs).