New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High-Scale IPcache: Chapter 3 #25438
High-Scale IPcache: Chapter 3 #25438
Conversation
8715b0b
to
73980e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One licensing aspect, and some (non-blocking) cleanups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config-related changes LGTM
73980e5
to
dabc174
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks all good, thank you!
dabc174
to
95cfc4d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for agent-related changes.
One question: have you considered adding the new worldcidrsmap
to the maps
cell in pkg/maps
? This could ease the config options management and the testing. Might be something to consider for a follow-up PR.
95cfc4d
to
31a4cf4
Compare
Yep, I've created #25552 to track this. It doesn't look that trivial of a change so I'd rather have that as a follow up rather than blocking this PR and subsequent ones on this. |
31a4cf4
to
ab87c2a
Compare
This commit adds a new LPM map for world CIDRs. Only IPv4 is supported for now. The map will be populated via a new CRD in a separate patchset. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Similarly to the egress gateway feature, the high-scale ipcache mode requires the tunneling device even if we're running with native routing. This is because we use that device to send pod-to-pod traffic encapsulated with the pod IPs (just to carry the identity). Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
In case the high-scale ipcache mode is enabled, we want to encapsulate pod-to-pod traffic with the pod IP address. The goal of this encapsulation is simply to carry the security identity for the source. Previous commits introduced a new CRD and its corresponding map. They tell us which IP addresses belong to entities outside of the cluster, or at least entities to which we shouldn't encapsulate traffic when in high-scale IPcache mode. This commit makes use of the new map. We perform an LPM lookup into the map to know if we should encapsulate traffic or not. We also need to take this into account when computing the MTU. Thus, the TunnelExists() function is updated to return true when the high-scale ipcache mode is enabled. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When the high-scale ipcache is enabled, we will receive traffic encapsulated with the pod IP addresses on the native device. Since the destination IP is not assigned to the host (but to a container), the Linux stack won't demultiplex it to the overlay device (e.g., cilium_vxlan). Instead, the packets will follow their normal way to the container, via cilium_host if endpoint routes are disabled. We want to decapsulate the packet before they reach the lxc devices. We could decapsulate in bpf_lxc, but then the packet paths would be assymetric. This commit adds support for decapsulation in cilium_host. Note that will only work when endpoint routes are disabled. Therefore, in bpf_host, we filter all incoming VXLAN traffic based on the UDP port and remove the first IP, UDP, VXLAN, and Ethernet headers. We also parse the source security identity from the VXLAN header. We then redirect the packet to its expected path, via cilium_vxlan, with the security identity in skb->cb. In bpf_overlay, we need a special case for the high-scale ipcache, to retrieve the security identity from skb->cb instead of getting it from the tunnel metadata as usual (since packet is already decapsulated at this point). Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
This commit defines a new kernel probe to check that the bpf_skb_set_tunnel_key BPF helper can be used to set the outer source IP address. This new probe is then used to fatal if it isn't supported and high-scale IPcache mode is enabled. IPcache mode will require this kernel feature (see subsequent commit). Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
We can use the recent extension of the bpf_skb_set_tunnel_key BPF helper to set the outer source IP of encapsulated packets. When the high-scale IPcache mode is enabled, we want to set that outer source IP to the source pod IP address. We don't set the outer source IP from XDP as this is only relevant for traffic from pods. For older kernels, we need to pass a smaller bpf_tunnel_key struct to the helper because they don't support the larger struct with the source IP. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Now that the datapath encapsulation and decapsulation are implemented, the end-to-end test will fail. To decide whether we should encapsulate or not, we lookup the world CIDR map. If no match is found, we encapsulate. That means that we will currently always encapsulate (even when e.g. resolving a domain name from 8.8.8.8) because the world CIDR map is not currently populated. The world CIDR map will be populate once the patchset introducing the new CRD is merged. Until then, we can add a catch-all 0.0.0.0/0 entry to not encapsulate anything. This commit can be reverted once the CiliumWorldCIDRSet CRD is merged. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
ab87c2a
to
0df4db7
Compare
/test |
This is in the context of the high-scale ipcache feature described at cilium/design-cfps#7.
This pull request extends the datapath to be able to encapsulate and decapsulate the special VXLAN tunnel of high-scale ipcache. In this special tunnel, the outer IP addresses are the same as the inner IP addresses; the encapsulation header is just used to carry the source security identity.
This encapsulation therefore allows us to enforce ingress network policies in high-scale ipcache mode. The end-to-end test will however be updated in a separate pull request to reflect that because we still need #25436 to populate the World CIDR map (see first commit).
Updates: #25243.