Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.13 backports 2023-03-01 #24086

Merged
merged 25 commits into from
Mar 3, 2023

Conversation

sayboras
Copy link
Member

@sayboras sayboras commented Mar 1, 2023

Skipped due to conflicts:

Once this PR is merged, you can update the PR labels via:

$ for pr in 23469 23441 23915 23499 23443 24029 23745 24012 24030 23785 23928 23605; do contrib/backporting/set-labels.py $pr done 1.13; done

@sayboras sayboras requested a review from a team as a code owner March 1, 2023 02:04
@sayboras sayboras added kind/backports This PR provides functionality previously merged into master. backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. labels Mar 1, 2023
@maintainer-s-little-helper

This comment was marked as outdated.

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Mar 1, 2023
@sayboras sayboras force-pushed the pr/v1.13-backport-2023-03-01 branch from 0118787 to 7741e4f Compare March 1, 2023 02:08
@maintainer-s-little-helper

This comment was marked as resolved.

@sayboras sayboras removed the dont-merge/needs-sign-off The author needs to add signoff to their commits before merge. label Mar 1, 2023
@sayboras sayboras marked this pull request as draft March 1, 2023 04:20
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My commits look good. Thanks!

Copy link
Contributor

@gentoo-root gentoo-root left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for my commit.

@ysksuzuki
Copy link
Member

It seems that #23091 is not included

ysksuzuki and others added 13 commits March 1, 2023 23:39
[ upstream commit c4278e4 ]

When Cilium resolves a src ID from ipcache with a remote node IP, the resolved ID
is ignored and returns the world ID. This is because the resolve_srcid_ipv[4,6]
function checks if the resolved src ID is a reserved ID, which includes
the remote node ID, and if it's true, then returns the world ID.
This PR fixes this problem by removing !identity_is_reserved check.

This issue occurs if BPF host routing is in use because Cilium stores the src ID
resolved by resolve_srcid_ipv[4,6] in ipv[4,6]_local_delivery and enforces the policy
using the stored src id. While Cilium uses the src ID resolved in bpx_lxc
tail_ipv4_to_endpoint if the legacy routing mode is enabled. There's no corresponding
!identity_is_reserved check in bpf_lxc side. Therefore it works with the legacy routing
mode.

According to cilium#4874 cilium#6703, this check was added when those introduced the fallback to use
the ipcache data if the packet info does not contain any useful information. As far as I
look it into those PR, there's no solid reason to keep the !identity_is_reserved check,
because resolve_srcid_ipv[4,6] works as follows without the check which is the same as
bpx_lxc tail_ipv4_to_endpoint side.

1. the packet info, ctx->mark, cotaints the identity, then return it
2. the packet info does not contain any useful information, then resolved from ipcache and return it
3. the identity is not resolved from both the packet info and ipcache then return the world ID

Fixes: cilium#18042

Signed-off-by: Yusuke Suzuki <yusuke-suzuki@cybozu.co.jp>
Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit cce7976 ]

Commit 16e132e ("bpf, maps: Give the tunnel map its own structs") gave
the tunnel map its own structs for key and value. In that commit, I
however forgot to replace the structs in the lookups, on the BPF side.
CI passed nonetheless because the fields used in the former and new
structs are currently identical (because the cluster ID isn't used).

This commit fixes it by actually using the new tunnel_{key,value}
structs.

Fixes: 16e132e ("bpf, maps: Give the tunnel map its own structs")
Signed-off-by: Paul Chaignon <paul@cilium.io>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 0685982 ]

This commit is to make sure that users can enable/disable SA token auto
mount, which is recommended in NSA security hardening guide. Kindly note
that the default value hubble relay SA automount is false for consistency
with current behavior.

https://media.defense.gov/2022/Aug/29/2003066362/-1/-1/0/CTR_KUBERNETES_HARDENING_GUIDANCE_1.2_20220829.PDF
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 5cc12b9 ]

Most of the time, when we see a stale probe, it's due to a deadlock. So,
write a stack dump to disk (since we're probably going to be restarted
soon due to a liveness probe).

To prevent any sort of excessive resource consumption, only dump stack
once every 5 minutes, and always write to the same file. Also, let's
make the check lock-free while we're at it.

Also, make sure we capture this file in bugtool.

Signed-off-by: Casey Callendrello <cdc@isovalent.com>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 7f1503f ]

When attempting to take ownership of already created CiliumEndpoint
resources, the endpoint sync controller will check that it is safe to do so.

One condition is that if the endpoint is to take ownership of a already
created CEP, that CEP needs to be "local".
This comparison was being done in all cases, however in the case where the
agent is restarted and given a different IP this would prevent restored
endpoints from being managed correctly.

To fix this, this change will skip all checks if the endpoint UID already
matches the CEP UID - and allow the endpointsync to backfill the new IP.

Fixes: cilium#23487

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit d38093a ]

Endpointsync does not delete/recreate CEPs anymore when doing
endpoint sync init.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 5aa1c23 ]

This commit is to make sure that users can have options to configure
security contexts in pod and container levels for all cilium related
workload.

Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 1e98776 ]

Signed-off-by: Jed Salazar <jed@chainguard.dev>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 95ca167 ]

Signed-off-by: Jed Salazar <jed@chainguard.dev>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit c08f691 ]

Signed-off-by: Jed Salazar <jed@chainguard.dev>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit c67ede3 ]

This reverts commit 88cbfdb.

The above commit introduced two new configuration fields to tune the delay
when handling deletion events in the shared store in the context of
clustermesh, while changing the already existing SharedKeyDeleteDelay
parameter to a pointer to distinguish unset from 0.

Still, this approach has a few shortcomings:
* By default, a 30 seconds delay is still applied to deletion events (i.e.,
  if not specified differently). This defaulting logic (which was
  already present), is hidden in the code, leading to all shared store
  occurrences sticking to that;
* The newly added configuration fields are never set, hence leading to the
  default behavior;
* The usage of a pointer for that field is more cumbersome when it comes
  to assignment operations.

The next commit will address the above issues removing the defaulting to
30 seconds (hence, a 0 value is no longer ambiguous), while explicitly
setting the delay in case of node-related stores only.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 3edae01 ]

3137ad5 ("node: Delay handling of node delete events received via kvstore")
introduced a 30 seconds delay when handling node deletion events, as a
temporary workaround for connectivity issues towards etcd in case of flaky
nodes.

Fast-forward almost four years, the same shared store logic is now
leveraged also by other components (e.g., to observe global service changes
in the context of clustermesh). Yet, in these cases the delay leads to
possible issues: for instance, if a global service is deleted in a remote
cluster, the local one is not updated for 30 seconds, causing the remote
backends to be still targeted during that time period (possibly leading to
drops if they are restarted in the meanwhile).

This commit applies the following modifications:
* Do not apply any deletion delay in the shared store by default. This
  prevents new users of that logic from being subject to unexpected delays
  even if none had been configured explicitly.
* Explicitly configure the deletion delay for node-related stores only.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 9790592 ]

Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
oblazek and others added 4 commits March 1, 2023 23:39
[ upstream commit 1f03fcd ]

Previously when remote identity cache was closed it did not stop
the stopGC channel, therefore it leaked the allocator resources.

This commit fixes the issue and properly deletes all resources and
closes the stopGC channel, but closing of events channel is moved
outside of `allocator.Delete()` func as the channel must be closed
only by the main CachingIdentityAllocator (it's shared by each
remote allocator, but not owned).

Signed-off-by: Ondrej Blazek <ondrej.blazek@firma.seznam.cz>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit e337ebd ]

Previously each remote allocator spawned a goroutine which
is supposed to sync all local keys to the remote etcd. But
this makes sense only for the main identityAllocator, not
for remote one(s).

This commit fixes it, uses a disableGC option to save the
goroutine.

Signed-off-by: Ondrej Blazek <ondrej.blazek@firma.seznam.cz>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
[ upstream commit 0084307 ]

Previously all session/locksession controllers used the same
name as a controller name (even the ones for remote clusters)
and it was not possible to distinguish between them.

This commit adds a suffix with a cluster name to each session
controller name.

Signed-off-by: Ondrej Blazek <ondrej.blazek@firma.seznam.cz>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
…d/write

[ upstream commit 1b92a4d ]

Cilium operator can crash in case when the identityToCES map is being accessed without a lock during a concurrent write. The error message is fatal error: concurrent map read and map write. The fix is just to add a read lock to the one location in code where the read is not guarded by a lock.

Signed-off-by: Dorde Lapcevic <dordel@google.com>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
@pchaigno pchaigno force-pushed the pr/v1.13-backport-2023-03-01 branch from 7741e4f to 07e2714 Compare March 1, 2023 22:39
@pchaigno
Copy link
Member

pchaigno commented Mar 1, 2023

It seems that #23091 is not included

@ysksuzuki Should we fixed now. Thanks for reporting!

[ upstream commit 73c36d4 ]

To decide which outer IP addresses to use in the XFRM states for IPsec
encapsulation, we will use a node ID that identifies the remote node.
That node ID will be written to the packet mark, which the XFRM states
and policies can match against.

This commit adapts the OUT XFRM state and policies to match on the
packet marks. The node ID is carried in the 16 MSBs of the packet mark.

Function enableSubnetIPsec isn't changed here because it will be almost
entirely rewritten in a latter commit.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit b596cea ]

The previous commit made use of the node IDs for the IPsec encryption.
Node IDs are encoded on 16-bits, therefore limiting the number of nodes
in the cluster to 64k for anything that uses them. This commit documents
that limitation.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 19a62da ]

In a subsequent commit, we will need offset 4 of skb->cb. Unfortunately,
this offset is currently used to store the tunnel endpoint IP. We store
the tunnel endpoint IP before encryption and, after encryption, we pass
it to the encapsulation functions to build the correct VXLAN or GENEVE
outer IP header. (Remember we do ESP-on-VXLAN/GENEVE.)

Thus, to free offset 4, we need another way to retrieve the tunnel
endpoint IP. Currently, in Cilium with IPsec and tunneling, packets are
encrypted with the CiliumInternalIPs (cilium_host IPs) and then
encapsulated with the NodeInternalIPs. The CiliumInternalIPs are in the
ipcache and have a corresponding tunnel endpoint, the corresponding
NodeInternalIP. Therefore, when we receive the to-be-encapsulated,
encrypted packets, we can lookup their destination IP in the ipcache to
retrieve the tunnel endpoint IP to use for VXLAN or GENEVE
encapsulation.

This commit implements just that. If the ipcache lookup fails, we drop
the packet with a DROP_NO_TUNNEL_ENDPOINT event.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 4c7cce1 ]

Remove all code pertaining to IP_POOLS in the datapath, including the
CB_ENCRYPT_DST skb->cb slot and the code to rewritte the outer IP
header.

As a result of this commit, IPsec will be broken on AKS, EKS, and
anywhere using IP_POOLS (that is, whenever we don't have a single pod
allocation CIDR per node). A subsequent commit in this series will fix
that. I couldn't find a way to keep a clean commit history without
breaking this temporarily.
Compilation still works so most git bisect use cases should still work.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 4de20fc ]

We need to free the packet mark between bpf_lxc and bpf_host so that it
can carry the node ID when IPsec is enabled, to match against XFRM
states. Instead of using the packet mark to pass the source security ID,
we thus use a skb->cb field.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit dadf041 ]

Previous commits (1) populated the ipcache with the ID of the node
hosting each remote endpoint and (2) updated XFRM marks to expect a node
ID such that:

    0xXXXXXE00
      ^   ^^
      |   |+-- Set to 0xE to require encryption
      |   +-- Set to the IPsec SPI to use
      +----- Set to the node ID

This commit updates our BPF code to set the packet mark expected by the
XFRM policies & states, using information from the ipcache.

Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno marked this pull request as ready for review March 1, 2023 23:06
@sayboras
Copy link
Member Author

sayboras commented Mar 2, 2023

It seems that #23091 is not included

Sorry about this, the concern we had before was resolved, so it will be part of the next round.

[ upstream commit e625cde ]

IPsec-encrypted packets coming from the overlay are recirculated to the
overlay device and bpf_overlay after decryption. On the first pass, we
can retrieve the source security identity from the tunnel metadata, but
on the second pass, we don't have that metadata anymore. Therefore,
today, we pass the security identity from the first bpf_overlay
traversal to the second via the packet mark.

Unfortunately, in the next commit, we need the packet mark's bits to
match against XFRM IN states. This commit therefore frees those mark
bits.

To that end, we perform an ipcache lookup on the second bpf_overlay
traversal to retrieve the security identity for the inner source IP
address. That has obviously a higher cost than reading the packet mark,
but given we're on the decryption path, it's probably negligible
anyway.

As a result of this change, we will also be missing the source identity
in the very first TRACE_FROM_STACK packet trace in bpf_overlay. I think
that's fine as it's usually the case for TRACE_FROM_* traces.

Signed-off-by: Paul Chaignon <paul@cilium.io>
[ upstream commit 3e59b68 ]

On EKS & AKS, we have only one XFRM state and policy per direction. We
install those at startup (vs. when receiving k8s node objects on e.g.
GKE). This is a consequence of their IPAM modes (cannot predict node
from pod IP as we don't have per-node pod CIDRs).

To encode the outer IP addresses to use for IPsec encapsulation, this
however needs to change. We need to install one XFRM state and policy
per remote node. We therefore need to install those on reception of the
k8s node objects, as with the cluster-pool IPAM.

This commit implements this change. All XFRM state and policies will be
installed by the same function, enableIPsec, regardless of the
Kubernetes provider. Existing logic in enableIPsec is therefore pushed
behind a !n.subnetEncryption() guard. Existing logic for EKS and AKS'
IPAMs is moved from enableSubnetIPsec to enableIPsec.

This commit fixes IPsec support on EKS and AKS that was broken by a
previous commit ("bpf: Remove IP_POOLS IPsec code") in this series.

A lot of code moves around, so here's a summary of code changes:
1. The four helper function at the beginning of changes are removed
   because they won't be used anymore.
2. All code to install the kernel's XFRM config for EKS & AKS is moved
   from enableSubnetIPsec to enableIPsec. This is because this code will
   now be executed in reaction to nodes being created, as on GKE, and
   not once and for all.
3. In enableIPsec, we differentiate between the AKS/EKS and the GKE
   cases with the n.subnetEncryption() condition (true if AKS or EKS).

Signed-off-by: Paul Chaignon <paul@cilium.io>
@pchaigno pchaigno force-pushed the pr/v1.13-backport-2023-03-01 branch from 6b0aae8 to fafaa64 Compare March 2, 2023 11:59
@oblazek
Copy link
Contributor

oblazek commented Mar 2, 2023

my commits lgtm

@pchaigno
Copy link
Member

pchaigno commented Mar 2, 2023

/test-backport-1.13

Job 'Cilium-PR-K8s-1.21-kernel-4.9' hit: #22578 (95.32% similarity)

@pchaigno
Copy link
Member

pchaigno commented Mar 2, 2023

I've tested the IPsec backports on 3-nodes GKE and EKS clusters with the connectivity tests.

@sayboras
Copy link
Member Author

sayboras commented Mar 3, 2023

/test-1.21-4.9

@YutaroHayakawa YutaroHayakawa merged commit ca33ba6 into cilium:v1.13 Mar 3, 2023
@sayboras sayboras deleted the pr/v1.13-backport-2023-03-01 branch March 3, 2023 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.13 This PR represents a backport for Cilium 1.13.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
No open projects
Status: Released
Development

Successfully merging this pull request may close these issues.

None yet