Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High-Scale IPcache: Chapter 3 #25438

Merged
merged 8 commits into from May 22, 2023

Conversation

pchaigno
Copy link
Member

This is in the context of the high-scale ipcache feature described at cilium/design-cfps#7.

This pull request extends the datapath to be able to encapsulate and decapsulate the special VXLAN tunnel of high-scale ipcache. In this special tunnel, the outer IP addresses are the same as the inner IP addresses; the encapsulation header is just used to carry the source security identity.

This encapsulation therefore allows us to enforce ingress network policies in high-scale ipcache mode. The end-to-end test will however be updated in a separate pull request to reflect that because we still need #25436 to populate the World CIDR map (see first commit).

Updates: #25243.

Support ingress network policies in high-scale ipcache mode.

@pchaigno pchaigno added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels May 13, 2023
@pchaigno pchaigno force-pushed the high-scale-ipcache-encap-decap branch from 8715b0b to 73980e5 Compare May 13, 2023 16:49
@pchaigno pchaigno marked this pull request as ready for review May 14, 2023 14:40
@pchaigno pchaigno requested review from a team as code owners May 14, 2023 14:40
@pchaigno pchaigno requested review from pippolo84, asauber, YutaroHayakawa, rgo3 and julianwiedmann and removed request for YutaroHayakawa and rgo3 May 14, 2023 14:40
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One licensing aspect, and some (non-blocking) cleanups.

bpf/bpf_host.c Outdated Show resolved Hide resolved
bpf/bpf_overlay.c Outdated Show resolved Hide resolved
bpf/include/linux/vxlan.h Outdated Show resolved Hide resolved
bpf/lib/overloadable_skb.h Outdated Show resolved Hide resolved
Copy link
Member

@asauber asauber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config-related changes LGTM

@pchaigno pchaigno force-pushed the high-scale-ipcache-encap-decap branch from 73980e5 to dabc174 Compare May 16, 2023 22:17
@pchaigno pchaigno added the feature/high-scale-ipcache Relates to the high-scale ipcache feature. label May 16, 2023
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks all good, thank you!

pkg/probe/probe.go Outdated Show resolved Hide resolved
Copy link
Member

@pippolo84 pippolo84 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for agent-related changes.
One question: have you considered adding the new worldcidrsmap to the maps cell in pkg/maps? This could ease the config options management and the testing. Might be something to consider for a follow-up PR.

@pchaigno pchaigno force-pushed the high-scale-ipcache-encap-decap branch from 95cfc4d to 31a4cf4 Compare May 19, 2023 10:56
@pchaigno
Copy link
Member Author

pchaigno commented May 19, 2023

LGTM for agent-related changes. One question: have you considered adding the new worldcidrsmap to the maps cell in pkg/maps? This could ease the config options management and the testing. Might be something to consider for a follow-up PR.

Yep, I've created #25552 to track this. It doesn't look that trivial of a change so I'd rather have that as a follow up rather than blocking this PR and subsequent ones on this.

@pchaigno pchaigno force-pushed the high-scale-ipcache-encap-decap branch from 31a4cf4 to ab87c2a Compare May 22, 2023 11:13
@pchaigno pchaigno requested a review from rgo3 May 22, 2023 11:13
This commit adds a new LPM map for world CIDRs. Only IPv4 is supported
for now. The map will be populated via a new CRD in a separate patchset.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Similarly to the egress gateway feature, the high-scale ipcache mode
requires the tunneling device even if we're running with native routing.
This is because we use that device to send pod-to-pod traffic
encapsulated with the pod IPs (just to carry the identity).

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
In case the high-scale ipcache mode is enabled, we want to encapsulate
pod-to-pod traffic with the pod IP address. The goal of this
encapsulation is simply to carry the security identity for the source.

Previous commits introduced a new CRD and its corresponding map. They
tell us which IP addresses belong to entities outside of the cluster, or
at least entities to which we shouldn't encapsulate traffic when in
high-scale IPcache mode.

This commit makes use of the new map. We perform an LPM lookup into the
map to know if we should encapsulate traffic or not.

We also need to take this into account when computing the MTU. Thus, the
TunnelExists() function is updated to return true when the high-scale
ipcache mode is enabled.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
When the high-scale ipcache is enabled, we will receive traffic
encapsulated with the pod IP addresses on the native device. Since the
destination IP is not assigned to the host (but to a container), the
Linux stack won't demultiplex it to the overlay device (e.g.,
cilium_vxlan). Instead, the packets will follow their normal way to the
container, via cilium_host if endpoint routes are disabled.

We want to decapsulate the packet before they reach the lxc devices.
We could decapsulate in bpf_lxc, but then the packet paths would be
assymetric. This commit adds support for decapsulation in cilium_host.
Note that will only work when endpoint routes are disabled.

Therefore, in bpf_host, we filter all incoming VXLAN traffic based on
the UDP port and remove the first IP, UDP, VXLAN, and Ethernet headers.
We also parse the source security identity from the VXLAN header. We
then redirect the packet to its expected path, via cilium_vxlan, with
the security identity in skb->cb.

In bpf_overlay, we need a special case for the high-scale ipcache, to
retrieve the security identity from skb->cb instead of getting it from
the tunnel metadata as usual (since packet is already decapsulated at
this point).

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
This commit defines a new kernel probe to check that the
bpf_skb_set_tunnel_key BPF helper can be used to set the outer source IP
address. This new probe is then used to fatal if it isn't supported and
high-scale IPcache mode is enabled. IPcache mode will require this
kernel feature (see subsequent commit).

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
We can use the recent extension of the bpf_skb_set_tunnel_key BPF helper
to set the outer source IP of encapsulated packets. When the high-scale
IPcache mode is enabled, we want to set that outer source IP to the
source pod IP address.

We don't set the outer source IP from XDP as this is only relevant for
traffic from pods.

For older kernels, we need to pass a smaller bpf_tunnel_key struct to
the helper because they don't support the larger struct with the source
IP.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Julian Wiedmann <jwi@isovalent.com>
Now that the datapath encapsulation and decapsulation are implemented,
the end-to-end test will fail. To decide whether we should encapsulate
or not, we lookup the world CIDR map. If no match is found, we
encapsulate. That means that we will currently always encapsulate (even
when e.g. resolving a domain name from 8.8.8.8) because the world CIDR
map is not currently populated.

The world CIDR map will be populate once the patchset introducing the
new CRD is merged. Until then, we can add a catch-all 0.0.0.0/0 entry to
not encapsulate anything. This commit can be reverted once the
CiliumWorldCIDRSet CRD is merged.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
@pchaigno pchaigno force-pushed the high-scale-ipcache-encap-decap branch from ab87c2a to 0df4db7 Compare May 22, 2023 12:17
@pchaigno
Copy link
Member Author

/test

@pchaigno pchaigno merged commit 13aff18 into cilium:main May 22, 2023
57 of 58 checks passed
@pchaigno pchaigno deleted the high-scale-ipcache-encap-decap branch May 22, 2023 15:38
@sayboras sayboras mentioned this pull request Aug 2, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/high-scale-ipcache Relates to the high-scale ipcache feature. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants