-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.13 backports 2022-12-21 #22822
v1.13 backports 2022-12-21 #22822
Conversation
[ upstream commit bdc23f5 ] These tests deploy the standalone agent in a funny way; using the existing helm chart, but not configuring it fully. This breaks the config-resolver, which expects to be able to reach a functioning in-cluster apiserver. So, just pass the correct apiserver to the config resolver. Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 0f2f3fe ] Add some missing helper functions to implement v3 backend map. 1. A new AddrCluster constructor AddrClusterFrom 2. A new method of AddrCluster to extract ClusterID Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit c6e53ea ] Extend current backend map values (struct lb4_backend and struct lb6_backend) to contain a new cluster_id field. With this field, we can distinguish two backends which have the same IP address, but belong to different clusters. Since we don't have available padding bit for both structs anymore, we need to extend the structs and bump the map version as well. This commit also contains migration code from v2 backend map to v3 backend map. Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 4d7acac ] Currently, when a cilium-agent in eni/alibabacloud mode initializes router info, it might encounter the following fatal: ``` level=fatal msg="Error while creating daemon" error="failed to create router info invalid ip: " subsys=daemon ``` The gateway IP of routing info is derived from the CIDR of the subnet, the eni.Subnet.CIDR in InstanceMap is set as empty after ENI creation. In normal cases it will be filled by a later resyncTrigger after maintainIPPool. But if another goroutine (periodic resync or pool maintainer of another node) happens to sync local InstanceMap cache to k8s, cilium-agent would be informed of that ENI IP pool with empty cidr and router IP allocation would fatal due to empty gateway IP. This patch fixes this by filling the CIDR right after ENI creation. Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit ced77b3 ] In the L4LB mode, the PERM L2 entries are managed by users. So, in this mode Cilium should not mess with them. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 857060b ] Add a new type CiliumClusterConfig which represents a cluster configuration. This will be serialized and stored into kvstore during the clustermesh-apiserver startup time. Later on, cilium-agent on each node reads it when connecting to new clusters. The current use case of this is getting ClusterID at connect time, but by exposing the cluster configuration, we can also do some useful validation such as - Make sure the cluster id is not conflicting with existing clusters. - Make sure the new cluster doesn't have any capability mismatch. Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit b24973e ] Add hepler functions to set/get cluster information on kvstore. Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 5e5a26e ] Implement a basic connect-time validation using CiliumClusterConfig. clustermesh-apiserver is modified to set local CiliumClusterConfig on start-up time and cilium-agent is modified to get CiliumClusterConfig of remote clusters and validates it. Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 398cf5e ] For compatibility with an old Cilium that doesn't support cluster configuration feature, we should be able to connect to the remote cluster even if the configuration is missing. Test it. Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 25fa14c ] This allows us to stress / benchmark the main callback function of the DNS proxy within Cilium. This callback is called on each DNS request and response made by each endpoint managed by Cilium. For this commit, we are only considering the DNS response path because it is the path which contains the most work to be done. This includes policy calculation, identity allocation, and ipcache upsert. It is useful to establish baseline performance numbers so that we can compare against changes to this code path. In the following commits, a mutex will be added to this path to fix a race condition during parallel DNS response handling. Baseline: ``` $ go test -v -tags integration_tests ./daemon/cmd -check.b -check.bmem -check.f DaemonFQDNSuite.Benchmark_notifyOnDNSMsg -test.benchtime 100x ... PASS: fqdn_test.go:180: DaemonFQDNSuite.Benchmark_notifyOnDNSMsg 20000 96078 ns/op 36567 B/op 482 allocs/op ``` Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 604eb21 ] It is possible for parallel DNS requests handled by separate goroutines (controlled by --dnsproxy-concurrency-limit) to race if they are for the same DNS name, i.e. both goroutines handling cilium.io. Here is a diagram for this race: ``` G1 G2 T0 --> NotifyOnDNSMsg() NotifyOnDNSMsg() <-- T0 T1 --> UpdateGenerateDNS() UpdateGenerateDNS() <-- T1 T2 ----> mutex.Lock() +---------------------------+ |No identities need updating| T3 ----> mutex.Unlock() +---------------------------+ T4 --> UpsertGeneratedIdentities() UpsertGeneratedIdentities() <-- T4 T5 ----> Upsert() DNS released back to pod <-- T5 | T6 --> DNS released back to pod | | | | | v v Traffic flows fine Leads to policy drop ``` Note how G2 releases the DNS msg back to the pod at T5 because UpdateGenerateDNS() was a no-op. It's a no-op because G1 had executed UpdateGenerateDNS() first at T1 and performed the necessary identity allocation for the response IPs. Due to G1 performing all the work first, G2 executes T4 also as a no-op and releases the msg back to the pod at T5 before G1 would at T6. The above results in a policy drop because it is assumed that when the DNS msg is released back to the pod / endpoint, then the ipcache has been updated with the proper IP <-> identity mapping. In other words, because G2 releases the DNS msg back to the pod / endpoint, the app attempts to use the IPs returned in the DNS response, but the ipcache lookup fails in the datapath because the IP <-> identity mappings don't exist. In essence, this is a race between DNS requests to the ipcache. To fix, we create a critical section when handling the DNS response IPs. Allocating identities and associating them to IPs in the ipcache is now atomic because of the critical section. This is especially important as we've already seen for multiple DNS requests are in-flight for the same name (i.e. cilium.io). The critical section is implemented using a fixed sized array of mutexes. Why? Because the other options (listed below) all have flaws which may result in drops. Other options considered: 1) A map of mutexes keyed on DNS name 2) A map of mutexes keyed on set of DNS response IPs 3) A map of mutexes keyed on endpoint ID 4) A slice of mutexes with hash based on individual IPs from the set of DNS response IPs (1) could cause a race between the DNS response IPs from `a.com` and the set of DNS response IPs from `b.com` especially if the sets are the same. (2) could race between a super / subset of DNS response IPs, i.e. only an intersection of sets would have protection. Any IPs that are not in the intersection are vulnerable to race. (3) could cause a race between different endpoints querying the same name, i.e. two endpoints querying cilium.io. (4) could work but this is approach is not ideal as the slice scales with the number of unique IPs seen by the DNS proxy. This has memory consumption concerns. Therefore, we opt for a fixed size array of mutexes where we hash the individual IPs from the set of DNS response IPs. For example, say we attempt to resolve cilium.io and the response IPs are A and B. hash(A) % mod = 8 hash(B) % mod = 3 where mod is --dnsproxy-lock-count, i.e. size of array What's returned is a slice containing the above array indices that's sorted in ascending order. This is the order in which the mutexes must be locked and unlocked. To be specific, [3, 8] is the sorted resulted and it means we acquire & release mutex at index 3 and then do the same for mutex at index 8. What this does is essentially serialize parallel DNS requests which involve the same IP <-> identity mappings in the ipcache. It's possible that the hash of the IPs collide (map to the same mutex) when the IPs are different, but this is OK because it's overprotecting, rather then underprotecting which the other options above suffer from. Overprotecting results in a small performance hit because we are potentially serializing completely different / unrelated DNS requests, i.e. no intersection in the response IP sets. For --dnsproxy-lock-count, a default of 128 was chosen as a reasonable balance between memory usage and hash collisions, for most deployments of Cilium. An increased lock count will result in more memory usage, but faster DNS processing as there are less IP hash collisions, and therefore less mutexes to acquire. A decreased lock count will save memory at the cost of slower DNS processing as the IP hash collisions are more likely, and this tends towards behavior similar to a single, shared mutex. Users who need to tune this value because they have many parallel DNS requests (1000s) can make the trade-off of using more memory to avoid hash collisions. It's worth mentioning that the memory usage is for storing a instance of a mutex which is 64 bits as of Go 1.19 [1]. Benchmark run comparing the code before this commit, changing the fixed sized array to one to simulate a single mutex, and this commit (multiple-locks): ``` $ go test -v -tags integration_tests ./daemon/cmd -check.b -check.bmem -check.f DaemonFQDNSuite.Benchmark_notifyOnDNSMsg -test.benchtime 100x -test.count 8 ``` ``` $ benchstat no-lock.txt single-lock.txt multiple-locks.txt name \ time/op no-lock.txt single-lock.txt multiple-locks.txt _notifyOnDNSMsg 93.2µs ± 9% 179.6µs ± 3% 143.2µs ±28% name \ alloc/op no-lock.txt single-lock.txt multiple-locks.txt _notifyOnDNSMsg 36.5kB ± 2% 35.3kB ± 0% 33.8kB ± 1% name \ allocs/op no-lock.txt single-lock.txt multiple-locks.txt _notifyOnDNSMsg 485 ± 1% 474 ± 0% 453 ± 1% ``` Reproducing the bug and performance testing on a real cluster included running the 3 different modes and comparing the average processing time of the DNS proxy code. The cluster was 2 nodes with 32 replicas of the following workload: # websites.txt contains different common URLs, some duplicated. while [[ true ]]; do curl --parallel --parallel-immediate --parallel-max 5 --config websites.txt sleep 0.25s done No lock: 4.61ms, 18.3ms Single lock: 7.42ms, 18.3ms Multiple locks: 5.25ms, 17.9ms [1]: https://cs.opensource.google/go/go/+/refs/tags/go1.19.3:src/sync/mutex.go;l=34 Co-authored-by: Michi Mutsuzaki <michi@isovalent.com> Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 9eb0bfb ] Optimize policy processing by precomputing redirect types for the policy when ready to avoid multiple scans of the policy filters later. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 38589e7 ] Add a sticky parser type for CiliumEnvoyConfig CRDs. This will be used for policy based redirect to custom Envoy listeners. While CRD parser type will be redirected to Envoy, it is generally handled by a custom Listener, which may not perform HTTP policy enforcement. Thus this parser type is incompatible with HTTP rules. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 27861ae ] Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 7b01459 ] Listener config can fail if we don't wait for clusters to be configured first, so wait for clusters to be configured if both clusters and listeners are configured at the same time. While failure like this was once seen on local testing for the custom listener support in a CNP, similar failure could also happen for Cilium Ingress and Gateway API. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 0036e24 ] Support injecting Cilium Network filter also to TcpProxy filter chains. This enables SNI and other L3/L4 policy enforcement features also on custom Envoy listeners applied via Cilium Envoy Config resources that contain Envoy filter chains using TcpProxy filter, in addition to Http Connection Manager filter. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 41fe8c8 ] Tell envoy.ParseResources() if the resources are intended for a L7 load balancer. If so, the Envoy filter chain is configured to not use original source address and use the socket mark as the source endpoint ID. This was the current behavior with all CEC CRDs. When the CEC CRD does not contain a frontend service that is redirected to the defined listener, then the new 'isL7LB' flag is passed as 'false'. In this case the Envoy filter chain is configured to allow use of the original source address, and mark the upstream socket with the numeric source security identity. This is also the way in which Cilium uses Envoy Listener filter chains when enforcing policy for HTTP, for example. In this mode L4 LB is assumed to be taken place, if applicable, before traffic is redirected to the Envoy listener. Prior to this change CEC CRDs were mostly only usable for L7 LB uses, such as with Cilium Ingress. After this change CEC CRDs without a Service redirect can be used in place of Cilium policy enforcement filter chains, if desired. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
… now [ upstream commit a679c40 ] Specify proxy port as egress (or not-ingress) in test cases as the datapath currently supports ProxyTypeCRD only for L7 LB which is always on the egress path. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit aff0655 ] Add a new no-op CRDRedirect type to be used with Envoy listeners defined in CEC CRDs. In this case the listeners already exist and new Envoy Listener resources do not need to be created for them. This is needed for the forthcoming policy feature where policy can refer to a Listener defined in CEC CRD. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 06ed0a4 ] Find CRD proxy port by name instead of type. This is needed for enabling CEC CRD defined listeners to be used in CNPs. Prior to this CRD proxy ports did not use this code path, which is only called from endpoint policy updates, so there was no need to find CRD proxy ports by name. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit ede26fb ] Move resourceQualifiedName() to policy/api and export it so that it can be used in policy as well as in envoy package. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 9159609 ] Add `listener` field to CNP and CCNP, that causes traffic to the specified port(s) to be redirected to the named Envoy listener. If the listener does not exist the traffic is not allowed at all. When `listener.envoyConfig.kind` is left out it defaults to namespaced `CiliumEnvoyConfig` for rules in namespaced policies (CNP) or to cluster-scoped `CiliumClusterwideEnvoyConfig` for rules in cluster-scoped policies (CCNP). Namespaced policies can also refer to cluster-scoped listeners with an explicit `listener.envoyConfig.kind: CiliumClusterwideEnvoyConfig`. Cluster-scoped policies can not refer to namespaced listeners. Endpoint policies are regenerated whenever Envoy listeners change to update potential listener redirections in the bpf policy maps. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 527f5fd ] This makes it much easier to track down which code path the executor logs are coming from, so that when debugging issues, we can focus on the relevant code path. Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit bf3fd5f ] Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit dab8723 ] Previously, 11cb4d0 assumed that 137 was the exit code for when a process exists due to a SIGKILL. However, upon reading the Go source code as of 1.20 rc1, this is not the case, and that -1 is set for all exit codes due to signals [1]. Fixes: 11cb4d0 ("test: Keep trying exec if killed") Fixes: #22570 [1]: https://github.com/golang/go/blob/go1.20rc1/src/os/exec_posix.go#L128-L130 Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit e49ab12 ] When endpoint routes are enabled, we should enforce ingress policies at the destination lxc interface, to which a BPF program will be attached. Nevertheless, today, for packets coming from the overlay, we enforce ingress policies twice, once at the e.g. cilium_vxlan interface and a second time at the lxc device. This is happening for two reasons: 1. bpf_overlay is not aware of the endpoint routes settings so it doesn't even know that it's not responsible for enforcing ingress policies. 2. We have a flag to force the enforcement of ingress policies at the source in this case. This flag exists for historic reasons that are not valid anymore. A separate patch will fix the reason 2 above. This commit fixes reason 1 by telling bpf_overlay to *not* enforce ingress policies when endpoint routes are enabled. Note that we do not support the case where some endpoint have endpoint routes enabled and others don't. If we did, additional logic would be required. Fixes: 3179a47 ("datapath: Support enable-endpoint-routes with encapsulation") Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 3d2ceaf ] The previous commit changed the packet handling on the path overlay->lxc to fix a bug. More presicely, when endpoint routes are enabled, we won't enforce ingress policies on both the overlay and the lxc devices but only on the latter. However, as a consequence of that patch, we don't go through the policy-only program in bpf_lxc and we therefore changed the way the packet is transmitted between overlay and lxc devices in some cases. As a summary of changes made in the previous path, consider the following table for the path overlay -> lxc. Before the previous patch: | Endpoint routes | Enforcement | Path | |-----------------|-----------------|----------------------| | Enable | overlay AND lxc | bpf_redirect if KPR; | | | | stack otherwise | | Disabled | overlay | bpf_redirect | Now: | Endpoint routes | Enforcement | Path | |-----------------|-------------|--------------| | Enable | lxc | bpf_redirect | | Disabled | overlay | bpf_redirect | The previous patch intended to fix the enforcement to avoid the double policy enforcement, but it also changed the packet path in case endpoint routes are enabled. This patch now fixes this by adding the same exception we have in bpf_lxc to the l3.h logic we have. Hence, with the current patch, the table will look like: | Endpoint routes | Enforcement | Path | |-----------------|-------------|----------------------| | Enable | lxc | bpf_redirect if KPR; | | | | stack otherwise | | Disabled | overlay | bpf_redirect | I've kept this in a separate commit from the previous in an attempt to split up and the logic and more clearly show the deltas. Signed-off-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 843c072 ] The encryption test details - cilium/cilium-cli#1241. Signed-off-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Joe Stringer <joe@cilium.io>
[ upstream commit 884ccc1 ] This generates the CiliumNodeConfig type, a new way to set configuration overrides on a set of nodes. Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: Joe Stringer <joe@cilium.io>
/test-backport-1.13 Job 'Cilium-PR-K8s-1.21-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment Job 'Cilium-PR-K8s-1.16-kernel-4.9' failed: Click to show.Test Name
Failure Output
If it is a flake and a GitHub issue doesn't already exist to track it, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My commit (and the Hubble changes) looks good, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for my commit, thanks 💯
v1.13 backports 2022-12-21
backport/author
PRs alone #22654 -- backporting: leavebackport/author
PRs alone (@bimmlerd)hubble observe -o json
with-o jsonpb
#22796 -- ci: Replace deprecatedhubble observe -o json
with-o jsonpb
(@gandro)Skipped due to conflicts, seems some datapath testing PRs are not tagged for backport:
Once this PR is merged, you can update the PR labels via: