shared client: fix memory leak and add fail-safe mechanism #10

marseel · 2024-04-11T16:54:55Z

More context in commit descriptions / comments.

In case there were ~200 concurrent requests, there was high chance that some would get the same request id, which would cause goroutine leak. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

When number of concurrent dns requests was moderately high, there was a chance that some of the gorutines would get stuck waiting for response. Contains fix from cilium/dns#10 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

[ upstream commit 804d5f0 ] When number of concurrent dns requests was moderately high, there was a chance that some of the gorutines would get stuck waiting for response. Contains fix from cilium/dns#10 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

[ upstream commit 804d5f0 ] [ backporter's note: minor go.mod conflicts resolved by "go mod tidy" ] When number of concurrent dns requests was moderately high, there was a chance that some of the gorutines would get stuck waiting for response. Contains fix from cilium/dns#10 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>

[ upstream commit 804d5f0 ] [ backporter's note: minor go.mod conflicts resolved by "go mod tidy && go mod vendor" ] When number of concurrent dns requests was moderately high, there was a chance that some of the gorutines would get stuck waiting for response. Contains fix from cilium/dns#10 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>

[ upstream commit 804d5f0 ] [ backporter's note: conflicts in go.mod/go.sum ] When number of concurrent dns requests was moderately high, there was a chance that some of the gorutines would get stuck waiting for response. Contains fix from cilium/dns#10 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>

[ upstream commit 804d5f0 ] When number of concurrent dns requests was moderately high, there was a chance that some of the gorutines would get stuck waiting for response. Contains fix from cilium/dns#10 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>

[ upstream commit 804d5f0 ] When number of concurrent dns requests was moderately high, there was a chance that some of the gorutines would get stuck waiting for response. Contains fix from cilium/dns#10 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

[ upstream commit 804d5f0 ] When number of concurrent dns requests was moderately high, there was a chance that some of the gorutines would get stuck waiting for response. Contains fix from cilium/dns#10 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>

* docs: Remove Hubble-OTel from roadmap [ upstream commit 0976a1b6bd085f6c95a8f0ed4c24fac83e244bcd ] The Hubble OTel repo is going to be archived so it should be removed from the roadmap Signed-off-by: Bill Mulligan <billmulligan516@gmail.com> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> * bitlpm: Document and Fix Descendants Bug [ upstream commit 9e89397d2d81fe915bfbd74d41bd72e5d0c6ad5b ] Descendants and Ancestors cannot share the same traversal method, because Descendants needs to be able to select at least one in-trie key-prefix match that may not be a full match for the argument key-prefix. The old traversal method worked for the Descendants method if there happened to be an exact match of the argument key-prefix in the trie. These new tests ensure that Descendants will still return a proper list of Descendants even if there is not an exact match in the trie. Signed-off-by: Nate Sweet <nathanjsweet@pm.me> Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> * Prepare for release v1.15.4 Signed-off-by: Andrew Sauber <2046750+asauber@users.noreply.github.com> * install: Update image digests for v1.15.4 Generated from https://github.com/cilium/cilium/actions/runs/8654016733. `quay.io/cilium/cilium:v1.15.4@sha256:b760a4831f5aab71c711f7537a107b751d0d0ce90dd32d8b358df3c5da385426` `quay.io/cilium/cilium:stable@sha256:b760a4831f5aab71c711f7537a107b751d0d0ce90dd32d8b358df3c5da385426` `quay.io/cilium/clustermesh-apiserver:v1.15.4@sha256:3fadf85d2aa0ecec09152e7e2d57648bda7e35bdc161b25ab54066dd4c3b299c` `quay.io/cilium/clustermesh-apiserver:stable@sha256:3fadf85d2aa0ecec09152e7e2d57648bda7e35bdc161b25ab54066dd4c3b299c` `quay.io/cilium/docker-plugin:v1.15.4@sha256:af22e26e927ec01633526b3d2fd5e15f2c7f3aab9d8c399081eeb746a4e0db47` `quay.io/cilium/docker-plugin:stable@sha256:af22e26e927ec01633526b3d2fd5e15f2c7f3aab9d8c399081eeb746a4e0db47` `quay.io/cilium/hubble-relay:v1.15.4@sha256:03ad857feaf52f1b4774c29614f42a50b370680eb7d0bfbc1ae065df84b1070a` `quay.io/cilium/hubble-relay:stable@sha256:03ad857feaf52f1b4774c29614f42a50b370680eb7d0bfbc1ae065df84b1070a` `quay.io/cilium/operator-alibabacloud:v1.15.4@sha256:7c0e5346483a517e18a8951f4d4399337fb47020f2d9225e2ceaa8c5d9a45a5f` `quay.io/cilium/operator-alibabacloud:stable@sha256:7c0e5346483a517e18a8951f4d4399337fb47020f2d9225e2ceaa8c5d9a45a5f` `quay.io/cilium/operator-aws:v1.15.4@sha256:8675486ce8938333390c37302af162ebd12aaebc08eeeaf383bfb73128143fa9` `quay.io/cilium/operator-aws:stable@sha256:8675486ce8938333390c37302af162ebd12aaebc08eeeaf383bfb73128143fa9` `quay.io/cilium/operator-azure:v1.15.4@sha256:4c1a31502931681fa18a41ead2a3904b97d47172a92b7a7b205026bd1e715207` `quay.io/cilium/operator-azure:stable@sha256:4c1a31502931681fa18a41ead2a3904b97d47172a92b7a7b205026bd1e715207` `quay.io/cilium/operator-generic:v1.15.4@sha256:404890a83cca3f28829eb7e54c1564bb6904708cdb7be04ebe69c2b60f164e9a` `quay.io/cilium/operator-generic:stable@sha256:404890a83cca3f28829eb7e54c1564bb6904708cdb7be04ebe69c2b60f164e9a` `quay.io/cilium/operator:v1.15.4@sha256:4e42b867d816808f10b38f555d6ae50065ebdc6ddc4549635f2fe50ed6dc8d7f` `quay.io/cilium/operator:stable@sha256:4e42b867d816808f10b38f555d6ae50065ebdc6ddc4549635f2fe50ed6dc8d7f` Signed-off-by: Andrew Sauber <2046750+asauber@users.noreply.github.com> * cilium-dbg: remove section with unknown health status. The "unknown" status simply refers to components that accept a health reporter scope, but have not declared their state as being either "ok" or degraded. This is a bit confusing, as this does not necessarily mean any problems with Cilium. In the future we may want to rework this state to distinguish between unreported states and components that are "timing-out" reconciling a desired state. This PR simply removes displaying this information in `cilium-dbg status` Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> * chore(deps): update all github action dependencies Signed-off-by: renovate[bot] <bot@renovateapp.com> * chore(deps): update azure/login action to v2.1.0 Signed-off-by: renovate[bot] <bot@renovateapp.com> * fix k8s versions tested in CI - Remove older versions we do not officially support anymore on v1.15. - Make K8s 1.29 the default version on all platforms. Signed-off-by: Nicolas Busseneau <nicolas@isovalent.com> * chore(deps): update docker.io/library/golang:1.21.9 docker digest to 81811f8 Signed-off-by: renovate[bot] <bot@renovateapp.com> * images: update cilium-{runtime,builder} Signed-off-by: Cilium Imagebot <noreply@cilium.io> * golangci: Enable errorlint [ upstream commit d20f15ecab7c157f6246a07c857662bec491f6ee ] Enable errorlint in golangci-lint to catch uses of improper formatters for Go errors. This helps avoid unnecessary error/warning logs that cause CI flakes, when benign error cases are not caught due to failing error unwrapping when a string or value formatter has been used instead of the dedicated `%w`. Related: #31147 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> * errors: Precede with colon [ upstream commit fe46958e319cd754696eb2dd71dbc0957fa13368 ] Precede each `%w` formatter with a colon. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> * Docs: mark Tetragon as Stable [ upstream commit c3108c9fa3e72b6821410e4534d835b7ccdeee25 ] Signed-off-by: Natalia Reka Ivanko <natalia@isovalent.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * Minor nit according to Liz's comment [ upstream commit aea1ab82cca926d59f3434bfe5fe44ca9ab31e47 ] Signed-off-by: Natalia Reka Ivanko <natalia@isovalent.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * clustermesh: document global services limitations with KPR=false [ upstream commit b8050e527aa916c97d51b43d809d8d9fb13b54ea ] Global services do not work when KPR is disabled if accessed through a NodePort, or from a host-netns pod, as kube-proxy doesn't know about the remote backends. Let's explicit these limitations. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * alibabacloud/eni: avoid racing node mgr in test [ upstream commit 56ca9238b3cece3d97beaaa6e8837e7340966ad7 ] [ backporter's notes: minor adaptations as the InterfaceCandidates field was not scoped by IP family. ] TestPrepareIPAllocation attempts to check that PerpareIPAllocation produces the expected results. It avoids starting the ipam node manager it constructs, likely trying to avoid starting the background maintenance jobs. However, manager.Upsert _does_ asynchronously trigger pool maintenance, which results in a automated creation of an ENI with different params than the test expects, and hence assertion failures. This patch avoids the race condition by explicitly setting the instance API readiness to false, which causes background pool maintenance to be delayed and hence guarantees that the PrepareIPAllocation call runs in the environment expected. The following was used to reproduce this flake: go test -c -race ./pkg/alibabacloud/eni && stress -p 50 ./eni.test The likelihood of hitting this flake approximates 0.02%, hence reproducing requires a reasonably large number of runs, as well as high load on the system to increase the likelihood of the flake (since it does depend on the test being somewhat starved for CPU). Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * ClusterMesh/helm: support multiple replicas [ upstream commit df3c02f0044393b00a4fe2c3968a53466eb9032a ] [ backporter's notes: dropped the session affinity changes, and backported only the introduction of the unique cluster id which, together with the interceptors backported as part of the next commit, prevents Cilium agents from incorrectly restarting an etcd watch against a different clustermesh-apiserver instance. ] This commit makes changes to the helm templates for clustermesh-apiserver to support deploying multiple replicas. - Use a unique cluster id for etcd: Each replica of the clustermesh-apiserver deploys its own discrete etcd cluster. Utilize the K8s downward API to provide the Pod UUID to the etcd cluster as an initial cluster token, so that each instance has a unique cluster ID. This is necessary to distinguish connections to multiple clustermesh-apiserver Pods using the same K8s Service. - Use session affinity for the clustermesh-apiserver Service Session affinity ensures that connections from a client are passed to the same service backend each time. This will allow a Cilium Agent or KVStoreMesh instance to maintain a connection to the same backend for both long-living, streaming connections, such as watches on the kv store, and short, single-response connections, such as checking the status of a cluster. However, this can be unreliable if the l3/l4 loadbalancer used does not also implement sticky sessions to direct connections from a particular client to the same cluster node. Signed-off-by: Tim Horner <timothy.horner@isovalent.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * ClusterMesh: validate etcd cluster ID [ upstream commit 174e721c0e693a4e0697e4abfd1434f9c796859b ] [ backporter's notes: backported a stripped down version of the upstream commit including the introduction of the interceptors only, as fixing a bug occurring in a single clustermesh-apiserver configuration as well (during rollouts), by preventing Cilium agents from incorrectly restarting an etcd watch against a different clustermesh-apiserver instance. ] In a configuration where there are mutliple replicas of the clustermesh-apiserver, each Pod runs its own etcd instance with a unique cluster ID. This commit adds a `clusterLock` type, which is a wrapper around a uint64 that can only be set once. `clusterLock` is used to create gRPC unary and stream interceptors that are provided to the etcd client to intercept and validate the cluster ID in the header of all responses from the etcd server. If the client receives a response from a different cluster, the connection is terminated and restarted. This is designed to prevent accepting responses from another cluster and potentially missing events or retaining invalid data. Since the addition of the interceptors allows quick detection of a failover event, we no longer need to rely on endpoint status checks to determine if the connection is healthy. Additionally, since service session affinity can be unreliable, the status checks could trigger a false failover event and cause a connection restart. To allow creating etcd clients for ClusterMesh that do not perform endpoint status checks, the option NoEndpointStatusChecks was added to ExtraOptions. Signed-off-by: Tim Horner <timothy.horner@isovalent.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * bpf: Update checkpatch image [ upstream commit 7a98d56c101579ceda2ae6f88c187525ba785a38 ] Update checkpatch image to pull the latest changes we've added: namely, remove the wrapping of individual patch results in GitHub's workflows interface, as it's annoying to click many times to find the commits with issues. Signed-off-by: Quentin Monnet <qmo@qmon.net> * images: Update bpftool image [ upstream commit f3e65bccfaee2ae058aac7aabd4d6efe1cf6dd61 ] We want bpftool to be able to dump netkit programs, let's update the image with a version that supports it. Signed-off-by: Quentin Monnet <qmo@qmon.net> * images: update cilium-{runtime,builder} Signed-off-by: Quentin Monnet <qmo@qmon.net> * images: update cilium-{runtime,builder} Signed-off-by: Cilium Imagebot <noreply@cilium.io> * proxy: skip rule removal if address family is not supported Fixes: #31944 Signed-off-by: Robin Gögge <r.goegge@isovalent.com> * chore(deps): update all-dependencies Signed-off-by: renovate[bot] <bot@renovateapp.com> Signed-off-by: Julian Wiedmann <jwi@isovalent.com> * images: update cilium-{runtime,builder} Signed-off-by: Cilium Imagebot <noreply@cilium.io> * chore(deps): update hubble cli to v0.13.3 Signed-off-by: renovate[bot] <bot@renovateapp.com> * chore(deps): update all github action dependencies Signed-off-by: renovate[bot] <bot@renovateapp.com> * envoy: Bump envoy version to v1.27.5 This is mainly to address the below CVE https://github.com/envoyproxy/envoy/security/advisories/GHSA-3mh5-6q8v-25wj Related release: https://github.com/envoyproxy/envoy/releases/tag/v1.27.5 Signed-off-by: Tam Mach <tam.mach@cilium.io> * tables: Sort node addresses also by public vs private IP [ upstream commit 7da651454c788e6b240d253aa9cb94b3f28fc500 ] The firstGlobalAddr in pkg/node tried to pick public IPs over private IPs even after picking by scope. Include this logic in the address sorting and add a test case to check the different sorting predicates. For NodePort pick the first private address if any, otherwise pick the first public address. Fixes: 5342d0104f ("datapath/tables: Add Table[NodeAddress]") Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * gha: configure fully-qualified DNS names as external targets [ upstream commit 100e6257952a1466b1b3fb253959efff16b74c92 ] This prevents possible shenanigans caused by search domains possibly configured on the runner, and propagated to the pods. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * docs: Document six-month feature release cadence [ upstream commit 302604da6b82b808bf65e652698514b5fb631744 ] Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * docs: Fix github project link [ upstream commit 6f0a0596a568a0f8040bda03f4ad361bb1424db9 ] GitHub changed the URL for the classic projects that we are currently using to track patch releases. Fix the link. Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * fqdn: Fix goroutine leak in transparent-mode [ upstream commit 804d5f0f22a82495c5d12d9a25d0fe7377615e6f ] When number of concurrent dns requests was moderately high, there was a chance that some of the gorutines would get stuck waiting for response. Contains fix from https://github.com/cilium/dns/pull/10 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> * xds: Return nil error after context cancel [ upstream commit c869e6c057be8eed6d0960a06fbf7d9706b2bc83 ] Do not return an error from xds server when the context is cancelled, as this is part of normal operation, and we test for this in server_e2e_test. This resolves a test flake: panic: Fail in goroutine after has completed Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * test: Wait for stream server checks to complete [ upstream commit 3cde59c6e67b836ef1ea8c47b772d0431254a852 ] The main test goroutine might be completed before checks on the server goroutine are completed, hence cause the below panic issue. This commit is to defer the streamDone channel close to make sure the error check on the stream server is done before returning from the test. We keep the time check on the wait in the end of each test to not stall the tests in case the stream server fails to exit. Panic error ``` panic: Fail in goroutine after Test/ServerSuite/TestRequestStaleNonce has completed ``` Testing was done as per below: ``` $ go test -count 500 -run Test/ServerSuite/TestRequestStaleNonce ./pkg/envoy/xds/... ok github.com/cilium/cilium/pkg/envoy/xds 250.866s ``` Fixes: #31855 Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * stream_test: Return io.EOF upon channel close [ upstream commit df6afbd306d160179fd07ef6b4910b42967b5fdd ] [ backporter's note: moved changes from pkg/envoy/xds/stream_test.go to pkg/envoy/xds/stream.go as v1.15 doesn't have the former file ] Return io.EOF if test channel was closed, rather than returning a nil request. This mimics the behavior of generated gRPC code, which never returns nil request with a nil error. This resolves a test flake with this error logs: time="2024-04-16T08:46:23+02:00" level=error msg="received nil request from xDS stream; stopping xDS stream handling" subsys=xds xdsClientNode="node0~10.0.0.0~node0~bar" xdsStreamID=1 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> * test: Eliminate XDS CacheUpdateDelay [ upstream commit d3c1fee6f7cd69ea7975193b25dd881c0194889d ] Speed up tests by eliminating CacheUpdateDelay, as it is generally not needed. When needed, replace with IsCompletedInTimeChecker that waits for upto MaxCompletionDuration before returning, in contrast with IsCompletedChecker that only returns the current state without any wait. This change makes the server_e2e_test tests run >1000x faster. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * test: Eliminate duplicate SendRequest [ upstream commit 4efe9ddc35f918d870ff97d148c8d0e2255b4a73 ] TestRequestStaleNonce test code was written with the assumption that no response would be reveived for a request with a stale nonce, and a second SendRequest was done right after with the correct nonce value. This caused two responses to be returned, and the first one could have been with the old version of the resources. Remove this duplicate SendRequest. This resolves test flakes like this: --- FAIL: Test/ServerSuite/TestRequestStaleNonce (0.00s) server_e2e_test.go:784: ... response *discoveryv3.DiscoveryResponse = &discoveryv3.DiscoveryResponse{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(nil)}, sizeCache:0, unknownFields:[]uint8(nil), VersionInfo:"3", Resources:[]*anypb.Any{(*anypb.Any)(0x40003a63c0), (*anypb.Any)(0x40003a6410)}, Canary:false, TypeUrl:"type.googleapis.com/envoy.config.v3.DummyConfiguration", Nonce:"3", ControlPlane:(*corev3.ControlPlane)(nil)} ("version_info:\"3\" resources:{[type.googleapis.com/envoy.config.route.v3.RouteConfiguration]:{name:\"resource1\"}} resources:{[type.googleapis.com/envoy.config.route.v3.RouteConfiguration]:{name:\"resource0\"}} type_url:\"type.googleapis.com/envoy.config.v3.DummyConfiguration\" nonce:\"3\"") ... VersionInfo string = "4" ... Resources []protoreflect.ProtoMessage = []protoreflect.ProtoMessage{(*routev3.RouteConfiguration)(0xe45380)} ... Canary bool = false ... TypeUrl string = "type.googleapis.com/envoy.config.v3.DummyConfiguration" Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * test: increase xds stream timeout to avoid test flakes [ upstream commit 75d144c56579c8e0349b9b1446514c335a8d0909 ] Stream timeout is a duration we use in tests to make sure the stream does not stall for too long. In production we do not have such a timeout at all, and in fact the requests are long-lived and responses are only sent when there is something (new) to send. Test stream timeout was 2 seconds, and it would occasionally cause a test flake, especially if debug logging is enabled. This seems to happen due to goroutine scheduling, and for this reason debug logging should not be on for these tests. Bump the test stream timeout to 4 seconds to further reduce the chance of a test flake due to it. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * gha: configure max unavailable in clustermesh upgrade/downgrade [ upstream commit 350e9d33ed8ae99d0d9bb832c0cf3814284c39ac ] The goal being to slow down the rollout process, to better highlight possible connection disruption occurring in the meanwhile. At the same time, this also reduces the overall CPU load caused by datapath recompilation, which is a possible additional cause for connection disruption flakiness. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * gha: explicitly configure IPAM mode in clustermesh upgrade/downgrade [ upstream commit 7d2505ee18a0015df500de8a4618f94452e1d034 ] The default IPAM mode is cluster-pool, which gets automatically overwritten by the Cilium CLI to kubernetes when running on kind. However, the default helm value gets restored upon upgrade due to --reset-values, causing confusion and possible issues. Hence, let's explicitly configure it to kubernetes, to prevent changes. Similarly, let's configure a single replica for the operator. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * gha: fix incorrectly named test in clustermesh upgrade/downgrade [ upstream commit a0d7d3767f26276e3e70e1ec026ab2001ced3572 ] So that it gets actually executed. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * gha: don't wait for hubble relay image in clustermesh upgrade/downgrade [ upstream commit 1e28a102213700d65cf5a48f95257ca47a2afaf5 ] Hubble relay is not deployed in this workflow, hence it doesn't make sense to wait for the image availability. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * gha: enable hubble in clustermesh upgrade/downgrade [ upstream commit 0c211e18814a1d22cbb6f9c6eadf7017e8d234f9 ] As it simplifies troubleshooting possible connection disruptions. However, let's configure monitor aggregation to medium (i.e., the maximum, and default value) to avoid the performance penalty due to the relatively high traffic load. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * labels: Add controller-uid into default ignore list [ upstream commit aef6814f7e895d644234c679a956ef86c832a1b8 ] [ backporter's note: minor conflicts in pkg/k8s/apis/cilium.io/const.go ] Having uid in security labels will significantly increase the number of identities, not to mention about high cardinality in metrics. This commit is to add *controller-uid related labels into default exclusion list. Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> * gha: drop double installation of Cilium CLI in conformance-eks [ upstream commit 9dc89f793de82743d88448aef4c23095c7fd1130 ] Fixes: 5c06c8e2ea5c ("ci-eks: Add IPsec key rotation tests") Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * loader: sanitize bpffs directory strings for netdevs [ upstream commit 9b35bc5e8b42e0a8c9a94a6fd59f49f2aa0f1ab9 ] bpffs directory paths cannot contain the character ".", thus we must sanitize device names that contain any "." characters. Our solution is to replace "." with "-". This introduces a risk of naming collisions, e.g. "eth.0" and "eth-0", in practice the probability of this happening should be very small. Fixes: #31813 Signed-off-by: Robin Gögge <r.goegge@isovalent.com> Signed-off-by: Gray Liang <gray.liang@isovalent.com> * clustermesh-apiserver: use distroless/static-debian11 as the base image [ upstream commit 1d5615771b3dab1d5eeb9fda18ae0d98a22de73d ] [ backporter's notes: replaced the nonroot base image with the root one, to avoid requiring the Helm changes to configure the fsGroup, which could cause issues if users only updated the image version, without a full helm upgrade. ] gops needs to write data (e.g., the PID file) to the file-system, which turned out to be tricky when using scratch as base image, in case the container is then run using a non-root UID. Let's use the most basic version of a distroless image instead, which contains: - ca-certificates - A /etc/passwd entry for a root, nonroot and nobody users - A /tmp directory - tzdata This aligns the clustermesh-apiserver image with the Hubble Relay one, and removes the need for manually importing the CA certificates. The GOPS_CONFIG_DIR is explicitly configured to use a temporary directory, to prevent permission issues depending on the UID configured to run the entrypoint. Finally, we explicitly configure the fsGroup as part of the podSecurityContext, to ensure that mounted files are accessible by the non-root user as well. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * helm: same securityContext for etcd-init and etcd cmapiserver containers [ upstream commit 85218802f37ade30d2c14197e797b9ee6b8f832c ] Configure the specified clustermesh-apiserver etcd container security context for the etcd-init container as well, to make sure that they always match, and prevent issues caused by the init container creating files that cannot be read/written by the main instance later on. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * envoy: Update envoy 1.27.x to 1.28.3 This is as part of regular maintenance as Envoy v1.27 will be EOL in coming July. Upstream release: https://github.com/envoyproxy/envoy/releases/tag/v1.28.3 Upstream release schedule: https://github.com/envoyproxy/envoy/blob/main/RELEASES.md#major-release-schedule Signed-off-by: Tam Mach <tam.mach@cilium.io> * ingress translation: extract http and https filterchains functions [ upstream commit 215f5e17c3ca0b55f7534285a64652bb23614ca7 ] This commit extracts the creation of envoy listener filterchains for HTTP and HTTPS into dedicated functions `httpFilterChain` & `httpsFilterChains`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> * ingress translation: extract tls-passthrough filterchains function [ upstream commit db72b8203af794081883cb0e843e0f7ebae9b27d ] This commit extracts a function `tlsPassthroughFilterChains` that contains the logic of building the Envoy listener filterchains for Ingress TLS passthrough and Gateway API `TLSRoute`. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> * ingress translation: merge tlspassthrough and http envoy listener [ upstream commit 9797fa1bd71c290e4cab1234d485fd2b31877a6f ] Currently, the translation from Ingress Controller and Gateway API resources into Envoy resources creates two different Envoy listeners for TLS Passthrough (Ingress TLS Passthrough & Gateway API TLSRoute) and HTTP (non-TLS and TLS) handling. This comes with some problems when combining multiple Ingresses of different "types" into one `CiliumEnvoyConfig` in case of using the shared Ingress feature. - The names of the listeners is the same (`listener`): only one listener gets applied - Transparent proxying no longer works becauses TLS passthrough and HTTP with TLS uses the frontend port `443` on the loadbalancer service. The current implementation of transparently forwarding the requests on the service to the corresponding Envoy listener can't distinguish between the two existing Envoy listeners. This leads to situations where requests fail (listener not applied at all) or being forwarded to the wrong backend (e.g. HTTP with TLS matching the existing TLS passthrough listener filter chain). Therefore, this commit combines the two Envoy listeners into one - being responsible for HTTP, HTTPS and TLS passthrough. The corresponding filterchains match the requests by their SNI (HTTPS and TLS passthrough). Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> * ingress passthrough: set hostname in TLSRoute hostnames [ upstream commit b68d086f8432c305c200bea956ccbc9ffde0a3ec ] Currently, ingestion of Ingress doesn't set the hostname of the Ingress route in the internal model. This way, the resulting Envoy listener filterchain doesn't contain the hostname in its servernames. In combination with other filterchains (HTTP and HTTPS), this is to less restrictive. Therefore, this commit adds the hostname to the hostnames within TLSRoute (internal model). Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> * proxy/routes: Rename fromProxyRule to fromIngressProxyRule [ upstream commit: 287dd6313b70de16c65ca235e3d6446687e52211 ] Because we are introducing fromEgressProxyRule soon, it's better to make clear that the fromProxyRule is for ingress proxy only. This commit also changes its mark from MagicMarkIsProxy to MagicMarkIngress. They hold the same value 0xA00 while have the different semantics. Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> * proxy/routes: Rename RulePriorityFromProxyIngress [ upstream commit: 1dd21795ccbebd89f51db292f3e7a8f52e37eed1 ] No logic changes, just rename it to "RulePriorityFromProxy" without "Ingress" suffix, because egress rule is using the same priority. Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> * proxy/routes: Introduce fromEgressProxyRule [ upstream commit: 7d278affe093889bbc245303628caf2677b93cac ] Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> * proxy/routes: Remove fromEgressProxyRule for cilium downgrade [ upstream commit: 53133ff80ef4408ae673e5aa030f645bcd449afa ] Although we don't install fromEgressProxyRule for now, this commit insists on removing it to make sure further downgrade can go smoothly. Soon We'll have another PR to install fromEgressProxyRule, and cilium downgrade from that PR to branch tip (patch downgrade, 1.X.Y -> 1.X.{Y-1}) will be broken if we don't handle the new ip rule carefullly. Without this patch, downgrade from higher version will leave fromEgressProxyRule on the lower version cilium, cluster will be in a wrong status of "having stale ip rule + not having other necessary settings (iptables)", breaking the connectivity. Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com> * chore(deps): update all-dependencies Signed-off-by: renovate[bot] <bot@renovateapp.com> * images: update cilium-{runtime,builder} Signed-off-by: Cilium Imagebot <noreply@cilium.io> * route: dedicated net ns for each subtest of runListRules [ upstream commit 6c6c121361e97521b1d2b70f24cfa33338dbf297 ] Currently, there are cases where the test TestListRules/return_all_rules fails with the following error: ``` --- FAIL: TestListRules (0.02s) --- FAIL: TestListRules/returns_all_rules#01 (0.00s) route_linux_test.go:490: expected len: 2, got: 3 []netlink.Rule{ { - Priority: -1, + Priority: 9, Family: 10, - Table: 255, + Table: 2004, - Mark: -1, + Mark: 512, - Mask: -1, + Mask: 3840, Tos: 0, TunID: 0, ... // 11 identical fields IPProto: 0, UIDRange: nil, - Protocol: 2, + Protocol: 0, }, + s"ip rule 100: from all to all table 255", {Priority: 32766, Family: 10, Table: 254, Mark: -1, ...}, } ``` It looks like there's a switch of the network namespace during the test execution. Therefore, this commit locks the OS thread for the execution of the test that runs in a dedicated network namespace. In addition, each sub-test of the table driven testset runs in its own network namespaceas they run in their own go-routine. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * CI: bump default FQDN datapath timeout from 100 to 250ms [ upstream commit 34caeb233b0d47a0e4f9553fbd7ac532e0c1a5f8 ] This timeout can be CPU sensitive, and the CI environments can be CPU constrained. Bumping this timeout ensures that performance regressions will still be caught, as those tend to cause delays of 1+ seconds. This will, however, cut down on CI flakes due to noise. Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * ci: Set hubble.relay.retryTimeout=5s [ upstream commit 2428b08ad7032db7c2b92744a149a1615cc82144 ] I noticed CI seemed to flake on hubble-relay being ready. Based on the logs it was due to it being unable to connect to it's peers. Based on the sysdump the endpoints did eventually come up, so I think it just needs to retry more often, and the default is 30s so lower it to 5s in CI. Signed-off-by: Chance Zibolski <chance.zibolski@gmail.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * Fix helm chart incompatible types for comparison [ upstream commit 19f2268dfab23e37ae5a6d4f6e9379c4327007f6 ] Fixes: #32024 Signed-off-by: lou-lan <loulan@loulan.me> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * Agent: add kubeconfigPath to initContainers [ upstream commit 284ee43f82ad8230ca013f283bb9ad141f5531df ] This commit adds the missing pass of the Helm value `kubeConfigPath` to the initContainer of the Cilium-agent. Signed-off-by: darox <maderdario@gmail.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * Remove aks-preview from AKS workflows [ upstream commit a758d21bbae09ab0c4a8cd671e05e043a8c1ea5a ] Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * vendor: Bump cilium/dns to fix bug where timeout was not respected [ upstream commit c76677d81aa58c00698818bc1be55d1f5b4d0b0d ] This pulls in cilium/dns#11 which fixes a bug where the `SharedClient` logic did not respect the `c.Client.Timeout` field. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * dnsproxy: Fix bug where DNS request timed out too soon [ upstream commit 931b8167ea29bfd3ae8e6f11f41a8a1c531c33c8 ] This fixes a bug where DNS requests would timeout after 2 seconds, instead of the intended 10 seconds. This resulted in a `Timeout waiting for response to forwarded proxied DNS lookup` error message whenever the response took longer than 2 seconds. The `dns.Client` used by the proxy is [already configured][1] to use `ProxyForwardTimeout` value of 10 seconds, which would apply also to the `dns.Client.DialTimeout`, if it was not for the custom `net.Dialer` we use in Cilium. The logic in [dns.Client.getTimeoutForRequest][2] overwrites the request timeout with the timeout from the custom `Dialer`. Therefore, the intended `ProxyForwardTimeout` 10 second timeout value was overwritten with the much shorter `net.Dialer.Timeout` value of two seconds. This commit fixes that issue by using `ProxyForwardTimeout` for the `net.Dialer` too. Fixes: cf3cc16289b7 ("fqdn: dnsproxy: fix forwarding of the original security identity for TCP") [1]: https://github.com/cilium/cilium/blob/50943dbc02496c42a4375947a988fc233417e163/pkg/fqdn/dnsproxy/proxy.go#L1042 [2]: https://github.com/cilium/cilium/blob/94f6553f5b79383b561e8630bdf40bd824769ede/vendor/github.com/cilium/dns/client.go#L405 Reported-by: Andrii Iuspin <andrii.iuspin@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * ipam: retry netlink.LinkList call when setting up ENI devices [ upstream commit cf9bde54bd6eb6dbebe6c5f3e44500019b33b524 ] LinkList is prone to interrupts which are surfaced by the netlink library. This leads to stability issues when using the ENI datapath. This change makes it part of the retry loop in waitForNetlinkDevices. Fixes: #31974 Signed-off-by: Jason Aliyetti <jaliyetti@gmail.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * bpf: nodeport: fix double-SNAT for LB'ed requests in from-overlay [ upstream commit e984f657636c76ec9a8ac3c8b44a9b014831cff2 ] The blamed commit attempted to remove all usage of MARK_MAGIC_SNAT_DONE inside to-overlay, so that we can use MARK_MAGIC_OVERLAY instead. But it accidentally also touched the LB's NAT-egress path (which is used by from-overlay to forward LB'ed requests to the egress path, on their way to a remote backend). Here we need to maintain the usage of MARK_MAGIC_SNAT_DONE, so that the egress program (either to-overlay or to-netdev) doesn't apply a second SNAT operation. Otherwise the LB node is unable to properly RevNAT the reply traffic. Fix this by restoring the code that sets the MARK_MAGIC_SNAT_DONE flag for traffic in from-overlay. When such a request is forwarded to the overlay interface, then to-overlay will consume the mark, override it with MARK_MAGIC_OVERLAY and skip the additional SNAT step. Fixes: 5b37dc9d6a4f ("bpf: nodeport: avoid setting SNAT-done mark for to-overlay") Reported-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * workflows: Fix CI jobs for push events on private forks [ upstream commit 715906adf2388ef238bf189830434324780c927c ] Those workflows are failing to run on push events in private forks. They fail in the "Deduce required tests from code changes" in which we compute a diff of changes. To compute that diff, the dorny/paths-filter GitHub action needs to be able to checkout older git references. Unfortunately, we checkout only the latest reference and drop credentials afterwards. This commit fixes it by checking out the full repository. This will take a few seconds longer so probably not a big issue. Reported-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * docs: Fix prometheus port regex [ upstream commit 49334a5b9b79b3804865a084e5b4b2e8909cef6b ] Signed-off-by: James Bodkin <james.bodkin@amphora.net> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * endpoint: Skip build queue warning log is context is canceled [ upstream commit 8f0b10613443ffe30bcdc958addcab91416cf316 ] The warning log on failure to queue endpoint build is most likely not meaningful when the context is canceled, as this typically happends when the endpoint is deleted. Skip the warning log if error is context.Canceled. This fixes CI flakes like this: Found 1 k8s-app=cilium logs matching list of errors that must be investigated: 2024-04-22T07:48:47.779499679Z time="2024-04-22T07:48:47Z" level=warning msg="unable to queue endpoint build" ciliumEndpointName=kube-system/coredns-76f75df574-9k8sp containerID=3791acef13 containerInterface=eth0 datapathPolicyRevision=0 desiredPolicyRevision=0 endpointID=637 error="context canceled" identity=25283 ipv4=10.0.0.151 ipv6="fd02::82" k8sPodName=kube-system/coredns-76f75df574-9k8sp subsys=endpoint Fixes: #31827 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * build(deps): bump pydantic from 2.3.0 to 2.7.1 in /Documentation [ upstream commit b971e46f02be77e02195ae7654fa3ad99018e00e ] Bumps [pydantic](https://github.com/pydantic/pydantic) from 2.3.0 to 2.4.0. - [Release notes](https://github.com/pydantic/pydantic/releases) - [Changelog](https://github.com/pydantic/pydantic/blob/main/HISTORY.md) - [Commits](https://github.com/pydantic/pydantic/compare/v2.3.0...v2.4.0) [ Quentin: The pydantic update requires an update of pydantic_core, too. Bump both packages to their latest available version (pydantic 2.7.1 and pydantic_core 2.18.2). ] --- updated-dependencies: - dependency-name: pydantic dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Quentin Monnet <qmo@qmon.net> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * ci: update docs-builder [ upstream commit 6e53ad73238d244c0dac5dccabc375729225fdae ] Signed-off-by: Cilium Imagebot <noreply@cilium.io> * install/kubernetes: update nodeinit image to latest version [ upstream commit a2069654ea618fa80d2175f9f453ff50c183e7bf ] For some reason the renovate configuration added in commit ac804b6980aa ("install/kubernetes: use renovate to update quay.io/cilium/startup-script") did not pick up the update. Bump the image manually for now while we keep investigating. Signed-off-by: Tobias Klauser <tobias@cilium.io> * daemon: don't go ready until CNI configuration has been written [ upstream commit 77b1e6cdaa6c16c9e17cc54c9e3548b43e4de262 ] If the daemon is configured to write a CNI configuration file, we should not go ready until that CNI configuration file has been written. This prevents a race condition where the controller removes the taint from a node too early, meaning pods may be created with a different CNI provider. In #29405, Cilium was configured in chaining mode, but the "primary" CNI provider hadn't written its configuration yet. This caused the not-ready taint to be removed from the node too early, and pods were created in a bad state. By hooking in the CNI cell's status in the daemon's Status type, we prevent the daemon's healthz endpoint from returning a successful response until the CNI cell has been successful. Fixes: #29405 Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * enable kube cache mutation detector [ upstream commit de075aaf9b5f77e530f56b74b42c9b78d0b4b1bd ] This file was accidentally removed while refactoring the commits before merging the PR. Fixes: 8021e953e630 (".github/actions: enable k8s cache mutation detector in the CI") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * envoy: pass idle timeout configuration option to cilium configmap [ upstream commit 889b688668ec29f18e98a66d2df597006cfa9266 ] Currently, Envoys idle timeout configuration option can be configured in Helm `envoy.idleTimeoutDurationSeconds` and the agent/operator Go flag `proxy-idle-timeout-seconds`. Unfortunately, changing the value in the Helm values doesn't have an effect when running in embedded mode, because the helm value isn't passed to the Cilium ConfigMap. This commit fixes this, by setting the value in the configmap. Fixes: #25214 Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * ci: Increase timeout for images for l4lb test [ upstream commit 8cea46d58996f248df2a1c8f706b89dcb43048d5 ] Followup for #27706 Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * ci: update docs-builder Signed-off-by: Cilium Imagebot <noreply@cilium.io> * l7 policy: add possibility to configure Envoy proxy xff-num-trusted-hops [ upstream commit 1bc2c75da2d3430b752645854a5693ac05817c4d ] Currently, when L7 policies (egress or ingress) are enforced for traffic between Pods, Envoy might change x-forwarded-for related headers because the corresponding Envoy listeners don't trust the downstream headers because `XffNumTrustedHops` is set to `0`. e.g. `x-forwarded-proto` header: > Downstream x-forwarded-proto headers will only be trusted if xff_num_trusted_hops is non-zero. If xff_num_trusted_hops is zero, downstream x-forwarded-proto headers and :scheme headers will be set to http or https based on if the downstream connection is TLS or not. https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/headers#x-forwarded-proto This might be problematic if L7 policies are used for egress traffic for Pods from a non-Cilium ingress controller (e.g. nginx). If the Ingress Controller is terminating TLS traffic and forwards the protocol via `x-forwarded-proto=https`, Cilium Envoy Proxy changes this header to `x-forwarded-proto=http` (if no tls termination itself is used in the policy configuration). This breaks applications that depend on the forwarded protocol. Therefore, this commit introduces two new config flags `proxy-xff-num-trusted-hops-ingress` and `proxy-xff-num-trusted-hops-egresss` that configures the property `XffNumTrustedHops` on the respective L7 policy Envoy listeners. For backwards compabitility and security reasons, the values still default to `0`. Note: It's also possible to configure these values via Helm (`envoy.xffNumTrustedHopsL7PolicyIngress` & `envoy.xffNumTrustedHopsL7PolicyEgress`). Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> * chore(deps): update docker.io/library/golang:1.21.9 docker digest to d83472f Signed-off-by: renovate[bot] <bot@renovateapp.com> * images: update cilium-{runtime,builder} Signed-off-by: Cilium Imagebot <noreply@cilium.io> * loader: implement seamless tcx to legacy tc downgrade [ upstream commit d4a159a2e772f87afdf3ef6a2d3cb56799cc4f5a ] Due to confusion around the semantics of the tcx API, we believed downgrading tcx to legacy tc attachments couldn't be done seamlessly. This doesn't seem to be the case. As documented in the code, install the legacy tc attachment first, then remove any existing tcx attachment for the program. Signed-off-by: Timo Beckers <timo@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> * contrib: bump etcd image version in cilium-etcd systemd service [ upstream commit 26bb9f05677f7e47f6011baa92e90a054e541187 ] As per Mahé's debugging in https://github.com/cilium/cilium/pull/31823#issuecomment-2051601958, we need a more recent etcd image for compatibility with docker v26. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> * chore(deps): update stable lvh-images Signed-off-by: renovate[bot] <bot@renovateapp.com> * chore(deps): update dependency cilium/cilium-cli to v0.16.6 Signed-off-by: renovate[bot] <bot@renovateapp.com> * chore(deps): update all github action dependencies Signed-off-by: renovate[bot] <bot@renovateapp.com> * envoy: Update to use source port in connection pool hash [ upstream commit d5efd282b8bdbe9ee99cf82c02df07555e6fdb9c ] Update Envoy image to a version that includes the source port in upstream connection pool hash, so that each unique downstream connection gets a dedicated upstream connection. Fixes: #27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> * chore(deps): update docker.io/library/ubuntu:22.04 docker digest to a6d2b38 Signed-off-by: renovate[bot] <bot@renovateapp.com> * images: update cilium-{runtime,builder} Signed-off-by: Cilium Imagebot <noreply@cilium.io> * bpf: lb: un-break terminating backends for service without backend [ upstream commit 0de6f0f230be10e084f30fb3128c215edde1611f ] [ backporter's notes: add the ->count checks in slightly different locations, as we're missing a bunch of LB cleanup PRs. ] Continue to forward traffic for established connections, even when a service loses its last active backends. This needs a small adjustment in a BPF test that was relying on this behaviour. Fixes: 183501124869 ("bpf: drop SVC traffic if no backend is available") Signed-off-by: Julian Wiedmann <jwi@isovalent.com> * bpf: test: add LB test for terminating backend [ upstream commit 7ece278d42cca541b2e8e862e717f2536935af11 ] Once a LB connection has been established, we expect to continue using its CT entry to obtain the backend. Even if the backend is in terminating state, and the service has lost all of its backends. Keeping this separate from the fix, in case we can't easily backport. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> * chore(deps): update golangci/golangci-lint-action action to v6 Signed-off-by: renovate[bot] <bot@renovateapp.com> * bpf: host: simplify MARK_MAGIC_PROXY_EGRESS_EPID handling [ upstream commit feaf6c7369d5bb3d0ca615c709979a4731a41454 ] inherit_identity_from_host() already knows how to handle MARK_MAGIC_PROXY_EGRESS_EPID and extract the endpoint ID from the mark. Use it to condense the code path, similar to how to-container looks like. Also fix up the drop notification, which currently uses the endpoint ID in place of the source security identity. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * bpf: host: clean up duplicated send_trace_notify() code [ upstream commit 446cf565f6b059196ce7661d8bb91421d8a3cc02 ] Use a single send_trace_notify() statement, with parameters that can be trivially optimized out in the from-netdev path. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * clustermesh: fix panic if the etcd client cannot be created [ upstream commit 58b74f5cd9902363e11e6d78799be474090ba47c ] The blamed commit anticipated the execution of the watchdog in charge of restarting the etcd connection to a remote cluster in case of errors. However, this can lead to a panic if the etcd client cannot be created (e.g., due to an invalid config file), as in that case the returned backend is nil, and the errors channel cannot be accessed. Let's push again this logic below the error check, to make sure that the backend is always valid at that point. Yet, let's still watch for possible reconnections during the initial connection establishment phase, so that we immediately restart it in case of issues. Otherwise, this phase may hang due to the interceptor preventing the establishment to succeed, given that it would continue returning an error. Fixes: 174e721c0e69 ("ClusterMesh: validate etcd cluster ID") Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * install/kubernetes: add AppArmor profile to Cilium Daemonset [ upstream commit 2136418dab6c2cd799d14a1443e29ca1ab5b5238 ] [ backporter's notes: values.schema.json has been introduced in #30631 and thus is not present in v1.15. Changes to that file have been ignored. ] Starting from k8s 1.30 together with Ubuntu 24.04, Cilium fails to initialize with the error: ``` Error: applying apparmor profile to container 43ed6b4ba299559e8eac46a32f3246d9c54aca71a9b460576828b662147558fa: empty localhost AppArmor profile is forbidden ``` This commit adds the "Unconfined" as default, where users can overwrite it with any of the AppArmor profiles available on their environments, to all the pods that have the "container.apparmor.security.beta.kubernetes.io" annotations. Signed-off-by: André Martins <andre@cilium.io> * docs: Add annotation for Ingress endpoint [ upstream commit 2949b16f4a7a58f246622e3ae9d9dbc1cb09efcf ] Relates: #19764 Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * pkg: don't cache Host identity rule matches [ upstream commit 8397e45ee46ba65ef803805e4317d09a508862fe ] Unlike every other identity, the set of labels for the reserved:host identity is mutable. That means that rules should not cache matches for this identity. So, clean up the code around determining matches. Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * ipsec: Refactor temporary removal of XFRM state [ upstream commit e7db879706e62a882a5e80696a5d0b5812b515ff ] Context: During IPsec upgrades, we may have to temporarily remove some XFRM states due to conflicts with the new states and because the Linux API doesn't enable us to perform this atomically as we do for XFRM policies. This commit moves this removal logic to its own function. That logic will grow in subsequent commits as I'll add debugging information to the log message. This commit doesn't make any functional changes. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * ipsec: Log duration of temporary XFRM state removal [ upstream commit bba016e37ccc30f8b06b6837a72d184fa55eecd2 ] Context: During IPsec upgrades, we may have to temporarily remove some XFRM states due to conflicts with the new states and because the Linux API doesn't enable us to perform this atomically as we do for XFRM policies. This temporary removal should be very short but can still cause drops under heavy throughput. This commit logs the duration of the removal so we can validate that it's actually always short and estimate the impact on packet drops. Note the log message will now be displayed only once the XFRM state is re-added, instead of when it's removed like before. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * ipsec: Log XFRM errors during temporary state removal [ upstream commit 76d66709b9a20677010ec3bfa3fef61b2c4665fe ] Context: During IPsec upgrades, we may have to temporarily remove some XFRM states due to conflicts with the new states and because the Linux API doesn't enable us to perform this atomically as we do for XFRM policies. This temporary removal should be very short but can still cause drops under heavy throughput. This commit logs how many such drops happened. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * ci: Filter supported versions of AKS [ upstream commit dbcdd7dbe2e9a72a7c8c9f9231c4349ea7e94c0c ] Whenever AKS stopped supporting a particular version of AKS, we had to manually remove it from all stable branches. Now instead of that, we will dynamically check if it's supported and only then run the test. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * cni: Use correct route MTU for various cloud cidrs [ upstream commit 29a340ee6c49f48e70d01ba715e016e9bb5a2d13 ] This commit corrects the MTU that is used by the cilium-cni plugin when creating routes for CIDRs received from ENI, Azure or Alibaba Cloud. The cilium-agent daemon returns two MTUs to the cilium-cni plugin: a "device" MTU, which is used to set the MTU on a Pod's interface in its network namespace, and a "route" MTU, which is used to set the MTU on the routes created inside the Pod's network namespace that handle traffic leaving the Pod. The "route" MTU is adjusted based on the Cilium configuration to account for any configured encapsulation protocols, such as VXLAN or WireGuard. Before this commit, when ENI, Azure or Alibaba Cloud IPAM was enabled, the routes created in a Pod's network namespace were using the "device" MTU, rather than the "route" MTU, leading to fragmentation issues. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * fqdn: Change error log to warning [ upstream commit f1925b59247739ab3beb5ed3b395025d887c2825 ] There is no reason why the log level of "Timed out waiting for datapath updates of FQDN IP information" log message should be an error. Change it to a warning instead. Add a reference to --tofqdns-proxy-response-max-delay parameter to make this warning actionable. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * bpf: lxc: don't take RevDNAT tailcall for service backend's ICMP messages [ upstream commit b1fae235092158ef642f81a4150db3cb3801d2ac ] When tail_nodeport_rev_dnat_ingress_ipv*() is called by from-container to apply RevDNAT for a local backend's reply traffic, it drops all unhandled packets. This would mostly be traffic where the pod-level CT entry was flagged as .node_port = 1, but there is no corresponding nodeport-level CT entry to provide a rev_nat_index for RevDNAT. Essentially an unexpected error situation. But we didn't consider that from-container might also see ICMP error messages by a service backend, and the CT_RELATED entry for such traffic *also* has the .node_port flag set. As we don't have RevDNAT support for such traffic, there's no good reason to send it down the RevDNAT tailcall. Let it continue in the normal from-container flow instead. This avoids subsequent drops with DROP_NAT_NO_MAPPING. Alternative solutions would be to tolerate ICMP traffic in tail_nodeport_rev_dnat_ingress_ipv*(), or not propagate the .node_port flag into CT_RELATED entries. Fixes: 6936db59e3ee ("bpf: nodeport: drop reply by local backend if revDNAT is skipped") Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * docs: add link to sig-policy meeting [ upstream commit d74bc1c90fc4edfb1f223d9460b31e0e52885e65 ] Signed-off-by: Casey Callendrello <cdc@isovalent.com> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * gha: bump post-upgrade timeout in clustermesh upgrade/downgrade tests [ upstream commit 01c3b8376046369571176fe6c58e7502bf3c4fd2 ] [ backporter's notes: The workflow looks different in v1.15, the option to bump the timeout has been applied to the wait for cluster mesh to be ready after rolling out Cilium in cluster2. ] The KPR matrix entry recently started failing quite frequently due to cilium status --wait failing to complete within the timeout after the upgrading Cilium to the tip of main. Part of the cause seems to be cilium/test-connection-disruption@619be5ab79a6, which made the connection disruption test more aggressive. In turn, causing more CPU contention during endpoint regeneration, which now requires longer. Hence, let's bump the timeout, to avoid failing too early due to the endpoint regeneration not having terminated yet. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * fqdn: Fix Upgrade Issue Between PortProto Versions [ upstream commit a682a621e4e555b766c0cc1bd66770fa9ffbecc3 ] Users of this library need Cilium to both check restore and updated DNS rules for the new PortProto version. Otherwise upgrade incompatibilities exist between Cilium and programs that utilize this library. Signed-off-by: Nate Sweet <nathanjsweet@pm.me> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * docs: Alias "kubectl exec" command in Host Firewall guide [ upstream commit 0e07aa5f86d043af722e9c297d817d2889d72ae3 ] The documentation for the Host Firewall contains a number of invocations of the cilium-dbg command-line tool, to run on a Cilium pod. For convenience, the full, standalone command starting with "kubectl -n <namespace> exec <podname> -- cilium-dbg ..." is provided everytime. But this makes for long commands that are hard to read, and that sometimes require scrolling on the rendered HTML version of the guide. Let's shorten them by adding an alias for the recurrent "kubectl exec" portion of the commands. Signed-off-by: Quentin Monnet <qmo@qmon.net> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * docs: Clean up Host Firewall documentation [ upstream commit 21d5f0157e9fc94856b4fb18342401b5798a0d5f ] This commit is a mix of various minor improvements for the host firewall guide and the host policies documentation: - Fix some commands in troubleshooting steps: - "cilium-dbg policy get" does not show host policies - "kubectl get nodes -o wide" does not show labels - Add the policy audit mode check - (Attempt to) improve style - Mark references with the ":ref:" role in a consistent way - Improve cross-referencing (in particular, the text displayed for the K8s label selector references was erroneous) - Strip trailing whitespaces Signed-off-by: Quentin Monnet <qmo@qmon.net> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * docs: Document Host Policies known issues [ upstream commit 642921dc760c731b4eac5131ac60faef672878ff ] Make sure that users of the Host Firewall are aware of the current limitations, or rather the known issues, for the feature. Signed-off-by: Quentin Monnet <qmo@qmon.net> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * docs: Link Host Policies "troubleshooting"/"known issues" from HFW tuto [ upstream commit 5dd2f14a732d325ba498045331ab8b8ebba76a6f ] Help users reading the Host Firewall tutorial to find the sections related to troubleshooting host policies or listing known issues for the Host Firewall. Signed-off-by: Quentin Monnet <qmo@qmon.net> Signed-off-by: Fabio Falzoi <fabio.falzoi@isovalent.com> * chore(deps): update go to v1.21.10 Signed-off-by: renovate[bot] <bot@renovateapp.com> * images: update cilium-{runtime,builder} Signed-off-by: Cilium Imagebot <noreply@cilium.io> * envoy: Bump go version to 1.22.3 This is to for security fixes in go 1.22. Related upstream: https://groups.google.com/g/golang-announce/c/wkkO4P9stm0/ Related build: https://github.com/cilium/proxy/actions/runs/8994624308/job/24708355529 Signed-off-by: Tam Mach <tam.mach@cilium.io> * api: Introduce ipLocalReservedPorts field [ upstream commit 23a9607e4e88dd9a8e139108bd2ba20b441ae0ff ] This field will be used by the CNI plugin to set the `ip_local_reserved_ports` sysctl option. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * wireguard: Remove unused constants [ upstream commit 02fd4eeeefdde71ddabb6dbcf66e605a9eeb866c ] These values were never used in any released version of Cilium. Let's remove them. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * wireguard: Export ListenPort constant [ upstream commit 117cb062266e61bbb57391e6750aaf88dd0a2328 ] This allows other packages to obtain the standard WireGuard port. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * cilium-cni: Reserve ports that can conflict with transparent DNS proxy [ upstream commit 11fe7cc144aaff1a44b6ed9c733a9025c674e683 ] [ backporter notes: 1. Fixes conflicts due to sysctl reconciler not being present 2. Also reserve Geneve tunnel port in `auto` configuration ] This commit adds the ability for the cilium-cni plugin to reserve IP ports (via the `ip_local_reserved_ports` sysctl knob) during CNI ADD. Reserving these ports will prevent the container network namespace to use these ports as an ephemeral source port, while still allowing the port to be explicitly allocated (see [1]). This functionality is added as a workaround for cilium/cilium#31535 where transparent DNS proxy mode causes conflicts when an ephemeral source port is being chosen that is already being used in the host network namespace. By reserving ports used by Cilium itself (such as WireGuard and VXLAN), we hopefully reduce the number of such conflicts. The set of reserved ports can be configured via a newly introduced agent flag. By default, it will reserve an auto-generated set of ports. The list of ports is configurable such that users running custom UDP services on ports in the ephemeral port range can provide their own set of ports. The flag may be set to an empty string to disable reservations altogether. [1] https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * k8s/watcher: Remove outdated comments about K8sEventHandover [ upstream commit 03d8e73068cb422af0aca14c42ee3443c913fbf6 ] K8sEventHandover is no longer a flag so remove any remaining references to it. Signed-off-by: Chris Tarazi <chris@isovalent.com> * daemon, endpoint: Consolidate K8s metadata into struct [ upstream commit b17a9c64d3a16041f53870587331fc441a2a6c1b ] Refactor the Kubernetes-fetched metadata into a consolidated struct type for ease of adding a new type without increasing the function signatures of the relevant functions. Signed-off-by: Chris Tarazi <chris@isovalent.com> * watchers: Detect Pod UID changes [ upstream commit aa628ad60fb0af5056c500a111e8ddbf703c8c24 ] This commit causes the Kubernetes Pod watcher to trigger endpoint identity updates if the Pod UID changes. This covers the third scenario pointed out in [1], see issue for more details. To summarize, with StatefulSet restarts, it is possible for the apiserver to coalesce Pod events where the delete and add event are replaced with a single update event that includes, for example a label change. If Cilium indexes based on namespace/name of the pod, then it will miss this update…

marseel added 2 commits April 11, 2024 18:45

shared client: do not leak goroutines

1efc791

In case there were ~200 concurrent requests, there was high chance that some would get the same request id, which would cause goroutine leak. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

shared client: add fail-safe mechanism for stuck requests.

d461701

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>

aanm approved these changes Apr 11, 2024

View reviewed changes

aanm merged commit 4e6b438 into cilium:master Apr 11, 2024
2 checks passed

marseel mentioned this pull request Apr 15, 2024

fqdn: Fix goroutine leak in transparent-mode cilium/cilium#31959

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shared client: fix memory leak and add fail-safe mechanism #10

shared client: fix memory leak and add fail-safe mechanism #10

marseel commented Apr 11, 2024

shared client: fix memory leak and add fail-safe mechanism #10

shared client: fix memory leak and add fail-safe mechanism #10

Conversation

marseel commented Apr 11, 2024