-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.13 backports 2023-03-01 #24086
v1.13 backports 2023-03-01 #24086
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
0118787
to
7741e4f
Compare
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My commits look good. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved for my commit.
It seems that #23091 is not included |
[ upstream commit c4278e4 ] When Cilium resolves a src ID from ipcache with a remote node IP, the resolved ID is ignored and returns the world ID. This is because the resolve_srcid_ipv[4,6] function checks if the resolved src ID is a reserved ID, which includes the remote node ID, and if it's true, then returns the world ID. This PR fixes this problem by removing !identity_is_reserved check. This issue occurs if BPF host routing is in use because Cilium stores the src ID resolved by resolve_srcid_ipv[4,6] in ipv[4,6]_local_delivery and enforces the policy using the stored src id. While Cilium uses the src ID resolved in bpx_lxc tail_ipv4_to_endpoint if the legacy routing mode is enabled. There's no corresponding !identity_is_reserved check in bpf_lxc side. Therefore it works with the legacy routing mode. According to cilium#4874 cilium#6703, this check was added when those introduced the fallback to use the ipcache data if the packet info does not contain any useful information. As far as I look it into those PR, there's no solid reason to keep the !identity_is_reserved check, because resolve_srcid_ipv[4,6] works as follows without the check which is the same as bpx_lxc tail_ipv4_to_endpoint side. 1. the packet info, ctx->mark, cotaints the identity, then return it 2. the packet info does not contain any useful information, then resolved from ipcache and return it 3. the identity is not resolved from both the packet info and ipcache then return the world ID Fixes: cilium#18042 Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp> Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit cce7976 ] Commit 16e132e ("bpf, maps: Give the tunnel map its own structs") gave the tunnel map its own structs for key and value. In that commit, I however forgot to replace the structs in the lookups, on the BPF side. CI passed nonetheless because the fields used in the former and new structs are currently identical (because the cluster ID isn't used). This commit fixes it by actually using the new tunnel_{key,value} structs. Fixes: 16e132e ("bpf, maps: Give the tunnel map its own structs") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 0685982 ] This commit is to make sure that users can enable/disable SA token auto mount, which is recommended in NSA security hardening guide. Kindly note that the default value hubble relay SA automount is false for consistency with current behavior. https://media.defense.gov/2022/Aug/29/2003066362/-1/-1/0/CTR_KUBERNETES_HARDENING_GUIDANCE_1.2_20220829.PDF Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 5cc12b9 ] Most of the time, when we see a stale probe, it's due to a deadlock. So, write a stack dump to disk (since we're probably going to be restarted soon due to a liveness probe). To prevent any sort of excessive resource consumption, only dump stack once every 5 minutes, and always write to the same file. Also, let's make the check lock-free while we're at it. Also, make sure we capture this file in bugtool. Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 7f1503f ] When attempting to take ownership of already created CiliumEndpoint resources, the endpoint sync controller will check that it is safe to do so. One condition is that if the endpoint is to take ownership of a already created CEP, that CEP needs to be "local". This comparison was being done in all cases, however in the case where the agent is restarted and given a different IP this would prevent restored endpoints from being managed correctly. To fix this, this change will skip all checks if the endpoint UID already matches the CEP UID - and allow the endpointsync to backfill the new IP. Fixes: cilium#23487 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit d38093a ] Endpointsync does not delete/recreate CEPs anymore when doing endpoint sync init. Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 5aa1c23 ] This commit is to make sure that users can have options to configure security contexts in pod and container levels for all cilium related workload. Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 1e98776 ] Signed-off-by: Jed Salazar <jed@chainguard.dev> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 95ca167 ] Signed-off-by: Jed Salazar <jed@chainguard.dev> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit c08f691 ] Signed-off-by: Jed Salazar <jed@chainguard.dev> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit c67ede3 ] This reverts commit 88cbfdb. The above commit introduced two new configuration fields to tune the delay when handling deletion events in the shared store in the context of clustermesh, while changing the already existing SharedKeyDeleteDelay parameter to a pointer to distinguish unset from 0. Still, this approach has a few shortcomings: * By default, a 30 seconds delay is still applied to deletion events (i.e., if not specified differently). This defaulting logic (which was already present), is hidden in the code, leading to all shared store occurrences sticking to that; * The newly added configuration fields are never set, hence leading to the default behavior; * The usage of a pointer for that field is more cumbersome when it comes to assignment operations. The next commit will address the above issues removing the defaulting to 30 seconds (hence, a 0 value is no longer ambiguous), while explicitly setting the delay in case of node-related stores only. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 3edae01 ] 3137ad5 ("node: Delay handling of node delete events received via kvstore") introduced a 30 seconds delay when handling node deletion events, as a temporary workaround for connectivity issues towards etcd in case of flaky nodes. Fast-forward almost four years, the same shared store logic is now leveraged also by other components (e.g., to observe global service changes in the context of clustermesh). Yet, in these cases the delay leads to possible issues: for instance, if a global service is deleted in a remote cluster, the local one is not updated for 30 seconds, causing the remote backends to be still targeted during that time period (possibly leading to drops if they are restarted in the meanwhile). This commit applies the following modifications: * Do not apply any deletion delay in the shared store by default. This prevents new users of that logic from being subject to unexpected delays even if none had been configured explicitly. * Explicitly configure the deletion delay for node-related stores only. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 9790592 ] Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 1f03fcd ] Previously when remote identity cache was closed it did not stop the stopGC channel, therefore it leaked the allocator resources. This commit fixes the issue and properly deletes all resources and closes the stopGC channel, but closing of events channel is moved outside of `allocator.Delete()` func as the channel must be closed only by the main CachingIdentityAllocator (it's shared by each remote allocator, but not owned). Signed-off-by: Ondrej Blazek <ondrej.blazek@firma.seznam.cz> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit e337ebd ] Previously each remote allocator spawned a goroutine which is supposed to sync all local keys to the remote etcd. But this makes sense only for the main identityAllocator, not for remote one(s). This commit fixes it, uses a disableGC option to save the goroutine. Signed-off-by: Ondrej Blazek <ondrej.blazek@firma.seznam.cz> Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 0084307 ] Previously all session/locksession controllers used the same name as a controller name (even the ones for remote clusters) and it was not possible to distinguish between them. This commit adds a suffix with a cluster name to each session controller name. Signed-off-by: Ondrej Blazek <ondrej.blazek@firma.seznam.cz> Signed-off-by: Tam Mach <tam.mach@cilium.io>
…d/write [ upstream commit 1b92a4d ] Cilium operator can crash in case when the identityToCES map is being accessed without a lock during a concurrent write. The error message is fatal error: concurrent map read and map write. The fix is just to add a read lock to the one location in code where the read is not guarded by a lock. Signed-off-by: Dorde Lapcevic <dordel@google.com> Signed-off-by: Tam Mach <tam.mach@cilium.io>
7741e4f
to
07e2714
Compare
@ysksuzuki Should we fixed now. Thanks for reporting! |
[ upstream commit 73c36d4 ] To decide which outer IP addresses to use in the XFRM states for IPsec encapsulation, we will use a node ID that identifies the remote node. That node ID will be written to the packet mark, which the XFRM states and policies can match against. This commit adapts the OUT XFRM state and policies to match on the packet marks. The node ID is carried in the 16 MSBs of the packet mark. Function enableSubnetIPsec isn't changed here because it will be almost entirely rewritten in a latter commit. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit b596cea ] The previous commit made use of the node IDs for the IPsec encryption. Node IDs are encoded on 16-bits, therefore limiting the number of nodes in the cluster to 64k for anything that uses them. This commit documents that limitation. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 19a62da ] In a subsequent commit, we will need offset 4 of skb->cb. Unfortunately, this offset is currently used to store the tunnel endpoint IP. We store the tunnel endpoint IP before encryption and, after encryption, we pass it to the encapsulation functions to build the correct VXLAN or GENEVE outer IP header. (Remember we do ESP-on-VXLAN/GENEVE.) Thus, to free offset 4, we need another way to retrieve the tunnel endpoint IP. Currently, in Cilium with IPsec and tunneling, packets are encrypted with the CiliumInternalIPs (cilium_host IPs) and then encapsulated with the NodeInternalIPs. The CiliumInternalIPs are in the ipcache and have a corresponding tunnel endpoint, the corresponding NodeInternalIP. Therefore, when we receive the to-be-encapsulated, encrypted packets, we can lookup their destination IP in the ipcache to retrieve the tunnel endpoint IP to use for VXLAN or GENEVE encapsulation. This commit implements just that. If the ipcache lookup fails, we drop the packet with a DROP_NO_TUNNEL_ENDPOINT event. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 4c7cce1 ] Remove all code pertaining to IP_POOLS in the datapath, including the CB_ENCRYPT_DST skb->cb slot and the code to rewritte the outer IP header. As a result of this commit, IPsec will be broken on AKS, EKS, and anywhere using IP_POOLS (that is, whenever we don't have a single pod allocation CIDR per node). A subsequent commit in this series will fix that. I couldn't find a way to keep a clean commit history without breaking this temporarily. Compilation still works so most git bisect use cases should still work. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 4de20fc ] We need to free the packet mark between bpf_lxc and bpf_host so that it can carry the node ID when IPsec is enabled, to match against XFRM states. Instead of using the packet mark to pass the source security ID, we thus use a skb->cb field. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit dadf041 ] Previous commits (1) populated the ipcache with the ID of the node hosting each remote endpoint and (2) updated XFRM marks to expect a node ID such that: 0xXXXXXE00 ^ ^^ | |+-- Set to 0xE to require encryption | +-- Set to the IPsec SPI to use +----- Set to the node ID This commit updates our BPF code to set the packet mark expected by the XFRM policies & states, using information from the ipcache. Signed-off-by: Paul Chaignon <paul@cilium.io>
Sorry about this, the concern we had before was resolved, so it will be part of the next round. |
[ upstream commit e625cde ] IPsec-encrypted packets coming from the overlay are recirculated to the overlay device and bpf_overlay after decryption. On the first pass, we can retrieve the source security identity from the tunnel metadata, but on the second pass, we don't have that metadata anymore. Therefore, today, we pass the security identity from the first bpf_overlay traversal to the second via the packet mark. Unfortunately, in the next commit, we need the packet mark's bits to match against XFRM IN states. This commit therefore frees those mark bits. To that end, we perform an ipcache lookup on the second bpf_overlay traversal to retrieve the security identity for the inner source IP address. That has obviously a higher cost than reading the packet mark, but given we're on the decryption path, it's probably negligible anyway. As a result of this change, we will also be missing the source identity in the very first TRACE_FROM_STACK packet trace in bpf_overlay. I think that's fine as it's usually the case for TRACE_FROM_* traces. Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 3e59b68 ] On EKS & AKS, we have only one XFRM state and policy per direction. We install those at startup (vs. when receiving k8s node objects on e.g. GKE). This is a consequence of their IPAM modes (cannot predict node from pod IP as we don't have per-node pod CIDRs). To encode the outer IP addresses to use for IPsec encapsulation, this however needs to change. We need to install one XFRM state and policy per remote node. We therefore need to install those on reception of the k8s node objects, as with the cluster-pool IPAM. This commit implements this change. All XFRM state and policies will be installed by the same function, enableIPsec, regardless of the Kubernetes provider. Existing logic in enableIPsec is therefore pushed behind a !n.subnetEncryption() guard. Existing logic for EKS and AKS' IPAMs is moved from enableSubnetIPsec to enableIPsec. This commit fixes IPsec support on EKS and AKS that was broken by a previous commit ("bpf: Remove IP_POOLS IPsec code") in this series. A lot of code moves around, so here's a summary of code changes: 1. The four helper function at the beginning of changes are removed because they won't be used anymore. 2. All code to install the kernel's XFRM config for EKS & AKS is moved from enableSubnetIPsec to enableIPsec. This is because this code will now be executed in reaction to nodes being created, as on GKE, and not once and for all. 3. In enableIPsec, we differentiate between the AKS/EKS and the GKE cases with the n.subnetEncryption() condition (true if AKS or EKS). Signed-off-by: Paul Chaignon <paul@cilium.io>
6b0aae8
to
fafaa64
Compare
my commits lgtm |
/test-backport-1.13 Job 'Cilium-PR-K8s-1.21-kernel-4.9' hit: #22578 (95.32% similarity) |
I've tested the IPsec backports on 3-nodes GKE and EKS clusters with the connectivity tests. |
/test-1.21-4.9 |
IP_POOLS
logic #24030 -- IPsec: RemoveIP_POOLS
logic (@pchaigno)Skipped due to conflicts:
Once this PR is merged, you can update the PR labels via: