Updating ENI prefix delegation fallback to use dedicated error codes #30536

hemanthmalla · 2024-01-30T22:00:06Z

AWS SDK now returns a dedicated error code to indicate the scenario where a subnet is out of capacity for /28 prefixes. This commit updates the fallback logic. The existing fallback logic does not work anymore since the code changed from InvalidParameterValue to InsufficientCidrBlocks

Reported-by: Benjamin Pineau benjamin.pineau@datadoghq.com

Confirmed by the following operator logs. Notice the api error InsufficientCidrBlocks part.

Unable to assign additional IPs to interface, will create new interface" error="operation error EC2: AssignPrivateIpAddresses, https response error StatusCode: 400, RequestID: <>, api error InsufficientCidrBlocks: The specified subnet does not have enough free cidr blocks to satisfy the request.

amazon-vpc-cni-k8s's usage of this can be found here https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/ipamd/ipamd.go#L296-L305

AWS SDK now returns a dedicated error code to indicate the scenario where a subnet is out of capacity for /28 prefixes. This commit updates the fallback logic. The existing fallback logic does not work anymore since the code changed from InvalidParameterValue to InsufficientCidrBlocks Reported-by: Benjamin Pineau <benjamin.pineau@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

christarazi

LGTM, thanks. A couple of questions:

See comment below
How was this detected? Can we automate the detection in CI? Can we improve our tests to catch this before merging the changes that broke it in the first place?

Since this is a bug, can you write a user-facing release note?

pkg/aws/eni/node.go

hemanthmalla · 2024-01-31T16:59:55Z

How was this detected? Can we automate the detection in CI? Can we improve our tests to catch this before merging the changes that broke it in the first place?

@christarazi We realized this while debugging something else that operator was not actually falling back and the logs indicated a different status code than what's expected. The operator's previous logic was correct when the fallback feature was introduced. See https://repost.aws/knowledge-center/vpc-insufficient-ip-errors for example. I'm actually surprised to see that AWS changed this behavior.

I guess the only way to maybe catch this in CI would be to have small subnet attached to an EKS cluster and have an e2e test exercise the fallback path. But that sounds like an overkill ATM, especially now that there's a proper error code.

hemanthmalla · 2024-01-31T21:37:52Z

/test

julianwiedmann · 2024-02-01T09:55:14Z

Do we need backports for this issue?

hemanthmalla · 2024-02-01T14:51:20Z

@julianwiedmann Yes, I think we should backport this.

cilium#30536 prematurely concluded that AWS now uses InsufficientCidrBlocks to indicate the subnet is out of prefixes. Looks like AWS still uses InvalidParameterValue and "There aren't sufficient free Ipv4 addresses or prefixes" to indicate subnet is at capacity. In addition to this InsufficientCidrBlocks is returned when subnet is at capacity potentially due to fragmentation. In either case, it's worth trying to fallback since /32 IPs might still be available compared to /28. See PR for details from AWS support ticket. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

#30536 prematurely concluded that AWS now uses InsufficientCidrBlocks to indicate the subnet is out of prefixes. Looks like AWS still uses InvalidParameterValue and "There aren't sufficient free Ipv4 addresses or prefixes" to indicate subnet is at capacity. In addition to this InsufficientCidrBlocks is returned when subnet is at capacity potentially due to fragmentation. In either case, it's worth trying to fallback since /32 IPs might still be available compared to /28. See PR for details from AWS support ticket. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

[ upstream commit 5a487b5 ] #30536 prematurely concluded that AWS now uses InsufficientCidrBlocks to indicate the subnet is out of prefixes. Looks like AWS still uses InvalidParameterValue and "There aren't sufficient free Ipv4 addresses or prefixes" to indicate subnet is at capacity. In addition to this InsufficientCidrBlocks is returned when subnet is at capacity potentially due to fragmentation. In either case, it's worth trying to fallback since /32 IPs might still be available compared to /28. See PR for details from AWS support ticket. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

* wireguard: Encrypt L7 proxy pkts to remote pods [ upstream commit 26f83491643b8c9b2921544ad340d8de07e9138c ] Marco reported that the following L7 proxy traffic is leaked (bypasses the WireGuard encryption): 1. WG: tunnel, L7 egress policy: forward traffic is leaked 2. WG: tunnel, DNS: all DNS traffic is leaked 3. WG: native routing, DNS: all DNS traffic is leaked This was reported before the introduction of the --wireguard-encapsulate [1]. The tunneling leak cases are obvious. The L7 proxy traffic got encapsulated by the Cilium's tunneling device. This made it to bypass the redirection to the Cilium's WireGuard device. However, [1] fixed this behavior. For Cilium v1.15 (upcoming) nothing needs to be configured. Meanwhile, for v1.14.4 users need to set --wireguard-encapsulate=true. The native routing case is more tricky. The L7 proxy taffic got a src IP of a host instead of a client pod. So, the redirection was bypassed. To fix this, we extended the redirection check to identify L7 proxy traffic. [1]: https://github.com/cilium/cilium/pull/28917 Reported-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Martynas Pumputis <m@lambda.lt> * wireguard: Improve L7 proxy traffic detection [ upstream commit 96e01adeaa51c85a5671d34c4715c07de97e26e1 ] Use marks set by the proxy instead of assuming that each pkt from HOST_ID w/o MARK_MAGIC_HOST belongs to the proxy. In addition, in the tunneling mode the mark might get reset before entering wg_maybe_redirect_to_encrypt(), as the proxy packets are instead routed to from_host@cilium_host. The latter calls inherit_identity_from_host() which resets the mark. In this case, rely on the TC index. Suggested-by: Gray Lian <gray.liang@isovalent.com> Signed-off-by: Martynas Pumputis <m@lambda.lt> * images: bump cni plugins to v1.4.1 The result of running ``` images/scripts/update-cni-version.sh 1.4.1 ``` Signed-off-by: André Martins <andre@cilium.io> * images: update cilium-{runtime,builder} Signed-off-by: André Martins <andre@cilium.io> * bgpv1: Disable PodCIDR Reconciler for unsupported IPAM modes [ upstream commit 276499405242e03b01da54f2a1f35891db56783a ] [ backporter's note: Discarded document changes. We'll backport it together with other recent document changes. ] PodCIDR shouldn't take any effect for the unsupported IPAM modes. Modify ExportPodCIDRReconciler's constructor to not provide ConfigReconciler for unsupported IPAMs. Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> * Prepare for release v1.15.2 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> * install: Update image digests for v1.15.2 Generated from https://github.com/cilium/cilium/actions/runs/8266651120. `quay.io/cilium/cilium:v1.15.2@sha256:bfeb3f1034282444ae8c498dca94044df2b9c9c8e7ac678e0b43c849f0b31746` `quay.io/cilium/cilium:stable@sha256:bfeb3f1034282444ae8c498dca94044df2b9c9c8e7ac678e0b43c849f0b31746` `quay.io/cilium/clustermesh-apiserver:v1.15.2@sha256:478c77371f34d6fe5251427ff90c3912567c69b2bdc87d72377e42a42054f1c2` `quay.io/cilium/clustermesh-apiserver:stable@sha256:478c77371f34d6fe5251427ff90c3912567c69b2bdc87d72377e42a42054f1c2` `quay.io/cilium/docker-plugin:v1.15.2@sha256:ba4df0d63b48ba6181b6f3df3b747e15f5dfba06ff9ee83f34dd0143c1a9a98c` `quay.io/cilium/docker-plugin:stable@sha256:ba4df0d63b48ba6181b6f3df3b747e15f5dfba06ff9ee83f34dd0143c1a9a98c` `quay.io/cilium/hubble-relay:v1.15.2@sha256:48480053930e884adaeb4141259ff1893a22eb59707906c6d38de2fe01916cb0` `quay.io/cilium/hubble-relay:stable@sha256:48480053930e884adaeb4141259ff1893a22eb59707906c6d38de2fe01916cb0` `quay.io/cilium/operator-alibabacloud:v1.15.2@sha256:e2dafa4c04ab05392a28561ab003c2894ec1fcc3214a4dfe2efd6b7d58a66650` `quay.io/cilium/operator-alibabacloud:stable@sha256:e2dafa4c04ab05392a28561ab003c2894ec1fcc3214a4dfe2efd6b7d58a66650` `quay.io/cilium/operator-aws:v1.15.2@sha256:3f459999b753bfd8626f8effdf66720a996b2c15c70f4e418011d00de33552eb` `quay.io/cilium/operator-aws:stable@sha256:3f459999b753bfd8626f8effdf66720a996b2c15c70f4e418011d00de33552eb` `quay.io/cilium/operator-azure:v1.15.2@sha256:568293cebc27c01a39a9341b1b2578ebf445228df437f8b318adbbb2c4db842a` `quay.io/cilium/operator-azure:stable@sha256:568293cebc27c01a39a9341b1b2578ebf445228df437f8b318adbbb2c4db842a` `quay.io/cilium/operator-generic:v1.15.2@sha256:4dd8f67630f45fcaf58145eb81780b677ef62d57632d7e4442905ad3226a9088` `quay.io/cilium/operator-generic:stable@sha256:4dd8f67630f45fcaf58145eb81780b677ef62d57632d7e4442905ad3226a9088` `quay.io/cilium/operator:v1.15.2@sha256:e592ceba377985eb4225b0da9121d0f8c68a564ea38e5732bd6d59005eb87c08` `quay.io/cilium/operator:stable@sha256:e592ceba377985eb4225b0da9121d0f8c68a564ea38e5732bd6d59005eb87c08` Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> * gha: drop unused check_url environment variable [ upstream commit e17cf21d9720493766c9f1d12c2d75c842f26e86 ] This variable used to be used in combination with the Sibz/github-status-action action, which we replaced with myrotvorets/set-commit-status-action when reworking the workflows to be triggered by Ariane [1]. Given it is now unused, let's get rid of the leftover environment variable, so that we also stop copying it to new workflows. [1]: 9949c5a1891a ("ci: rework workflows to be triggered by Ariane") Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * gha: centralize kind version and image definition in set-env-variables [ upstream commit 394b3de26a4e2235ec25399861e12886b507f335 ] Let's define kind-related variables (i.e., version, k8s image and k8s version) inside the set-env-variables action. One all consumers will have been migrated through the subsequent commit, this will ensure consistency across workflows, simplify version bumps as well as the introduction of new workflows depending on them. One extra byproduct is that renovate updates will also stop requesting reviews from all the different teams owning each specific workflow. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * gha: migrate workflows to use the global kind-related variables [ upstream commit aabdfa73d3d83edda3935277bd08c4c4c0bf5b68 ] Let's switch all the workflows over to using the globally defined kind-related variables, and remove the workflow specific definitions. This also addresses a few cases which didn't specify any version. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * gha: don't wait for kind clusters to become ready [ upstream commit 39637d6d2385baab556078d844f431522d99f616 ] They will never, because no CNI is present at that point. Hence, let's just avoid wasting one minute waiting for the timeout to expire. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * gha: checkout target branch instead of the default one [ upstream commit 6716a9c01b69d88da0c3316fe8b3640180bbafb1 ] Currently, the GHA workflows running tests triggered on pull_request and/or push events initially checkout the default branch to configure the environment variables, before retrieving the PR head. However, this is problematic on stable branches, as we then end up using the variables from the default (i.e., main) branch (e.g., Kubernetes version, Cilium CLI version), which may not be appropriate here. Hence, let's change the initial checkout to retrieve the target (i.e., base) branch, falling back to the commit in case of push events. This ensure that we retrieve the variables from the correct branch, and matches the behavior of Ariane triggered workflows. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * gha: switch kind images back to kind Kind has released stable versions for k8s 1.29 so we can use this image instead of the cilium kindest for ginkgo tests. The same version has already been configured for the rest of the workflows in the previous commits. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * loader: fix cancelled context during compile logging errors. [ upstream commit 70b405f32018af84ad8221e4bafb223a70c23736 ] On Linux/Unix based implementations, exec/cmd.Run will return either context.ContextCancelled or the error "signal: killed" depending on whether the cancellation occurred while the process was running. There's several places we check on ```is.Errors(err, context.Cancelled)``` on whether to emit high level logs about failed program compilations. Because already running cmd.Run() doesn't return an error that satisfies this, this will result in spurious error logs about failed compilation (i.e. "signal: killed") This meant that in cases where a compilation is legitimately cancelled, we would still log an error such as msg="BPF template object creation failed" ... error="...: compile bpf_lxc.o: signal: killed" This can occur occasionally in CI, which enforces no error to pass, causing failures. example: ``` ctx, c := context.WithTimeout(context.Background(), time.Second) go func() { time.Sleep(time.Second) c() }() cmd := exec.CommandContext(ctx, "sleep", "2") fmt.Println(cmd.Run()) ctx, c = context.WithTimeout(context.Background(), time.Second) c() cmd = exec.CommandContext(ctx, "sleep", "2") fmt.Println(cmd.Run()) ``` To fix this, this will join in the ctx.Err() if it is: * context.Cancelled * The process has not exited itself. * The process appeared to be SIGKILL'ed. Addresses: #30991 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * cni: use default logger with timestamps. [ upstream commit a099bf1571f1a090ccfd6ccbba545828a6b3b63c ] Unlike runtime agent/operator logs, CNI logs are just written to disk so we have no way to attach timestamps to them. This makes it harder to debug CNI issues as we have no way to correlate when things happened between Agent logs and CNI events. This switches CNI to use the same default logger, except with timestamps enabled. Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * gateway-api: Bump to the latest version from upstream [ upstream commit d50c2e4fb057f3ca041a7c346cac8c12e0f4a422 ] This is mainly to pick up the new GRPC Conformance tests added recently. Relates: kubernetes-sigs/gateway-api#2745 Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * gateway: Remove manual GRPCRoute test [ upstream commit f77f3c385e885eab1e0c58cf8d80c99244802637 ] This is to remove our manual GRPCRoute test, in favour of more comprehensive tests added recently in upstream. Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * cilium-dbg: Don't fatal on XFRM rule deletion errors [ upstream commit 927969b247ed8f8b499988274a23e8ca2da42346 ] This commit slightly changes the behavior of the "encrypt flush" command in case of errors when trying to delete XFRM rules. The tool currently lists rules, filters them based on user-given arguments, and then deletes them. If an XFRM rule is deleted by the agent or the user while we're filtering, the deletion will fail. The current behavior in that case is to fatal. On busy clusters, that might mean that we always fatal because XFRM states and policies are constently added and removed. This commit changes the behavior to proceed with subsequent deletions in case one fails. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * cilium-dbg: Refactor confirmation message for encrypt flush [ upstream commit 5c2a67fcd306329abb8f5be0a7bac753141bfea6 ] This commit refactors the code a bit simplify a latter commit. No functional changes. This may be a bit excessive in commit splitting, but at least I can claim my last commit is free of any refactoring :sweat_smile: Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * cilium-dbg: New --stale flag for encrypt flush [ upstream commit 5eb27e25b38bc6d073e96835f674d0748176d49e ] This new flag will allow users to clean stale XFRM states and policies based on the node ID map contents. If XFRM states or policies are found with a node ID that is not in the BPF map, then we probably have a leak somewhere. Such leaks can lead in extreme cases to performance degradation when the number of XFRM states and policies grows large (and if using ENI or Azure IPAM). Having a tool to cleanup these XFRM states and policies until the leak is fixed can therefore be critical. The new flag is incompatible with the --spi and --node-id filter flags. We first dump the XFRM rules and then dump the map content. In that way, if a new node ID is allocated while we're running the tool, we will simply ignore the corresponding XFRM rules. If a node ID is removed while running the tool, we will fail to remove the corresponding XFRM rules and continue with the others. Tested on a GKE cluster by adding fake XFRM states and policies that the tool was able to remove. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * k8s_install.sh: specify the CNI version [ upstream commit f92b528abc30a5c66ba80d5922d4e58c48cfe7e1 ] The CNI version should be specify so that in case we have to fallback the installation of k8s via binaries it doesn't fail with the error: ``` 10:29:25 k8s1-1.25: gzip: stdin: not in gzip format 10:29:25 k8s1-1.25: tar: Child returned status 1 10:29:25 k8s1-1.25: tar: Error is not recoverable: exiting now ``` Fixes: ce69afdc3ad1 ("add support for k8s 1.25.0") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * statedb: Add read-after-write tests [ upstream commit 070e4deddcdbd892bf35be1d6224237d533ef27b ] The optimization to reuse the last used "indexReadTxn" was broken when a WriteTxn was first used for a read and then for a write followed by a read (the last read re-used an old transaction). Add test to observe this bug. Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * statedb: Fix read-after-write bug [ upstream commit 52944a032150e8040869c20bead7ef05ae32dbe3 ] [ backporter's note: minor conflicts due to different variables' names] The optimization in *txn to hold the last used index read transaction to avoid a hash map lookup was broken as the write txn used for reading was cloned, leading to not being able to read any of the future writes in the same transaction. Benchmark show that this optimization does not actually help, so solve the problem by removing the optimization. Before: BenchmarkDB_RandomLookup-8 1000000 1471 ns/op After: BenchmarkDB_RandomLookup-8 1000000 1485 ns/op Fixes: d0d4d46992bf ("statedb: Store index unique info in the index tree entry") Signed-off-by: Jussi Maki <jussi@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * introduce ARM github workflows [ upstream commit 7a301a48c55c427714261e0686699fc2f63d2d31 ] This commit adds the GH workflow to run on arm machines. This effectively means that we can remove our travis integration and only use GH actions from now on. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * metrics: Disable prometheus metrics by default [ upstream commit f31cbef727b3cb252d6117f48fd0bb24106d9876 ] Prior to commit c49ef45399fc9, the prometheus metrics server was disabled by default in Cilium. Typically we would expect users to specify in helm both `prometheus.enabled` and `prometheus.port` to determine how to configure the prometheus server to ensure that the prometheus port doesn't also conflict with other services that the user is running on their nodes in their clusters. With the refactor in the aforementioned commit, the default was set to `:9962`. This means that even if the user installed Cilium without prometheus settings, or explicitly configured helm with `prometheus.enabled: false`, the prometheus metrics server would be enabled. This patch reverts the default back to the pre-v1.14 default. Fixes: c49ef45399fc ("metrics: Modularize daemon metrics registry") Signed-off-by: Joe Stringer <joe@cilium.io> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * ingress: Update docs with network policy example [ upstream commit b68cf99c3bb50a6a06dbbb0fd3085e5430297eac ] This is to extend the existing basic ingress docs with external lockdown CCNP, while still allows in-cluster traffic to Ingress LB IP. Relates: https://github.com/cilium/cilium/pull/28126 Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * Downgrade L2 Neighbor Discovery failure log to Debug [ upstream commit db7b3efeb4176b7a993816cb095da93ca184e968 ] Suppress the "Unable to determine next hop address" logs. While it shows the L2 neighbor resolution failure, it does not always indicate a datapath connectivity issue. For example, when "devices=eth+" is specified and the device naming/purposing is not consistent across the nodes in the cluster, in some nodes "eth1" is a device reachable to other nodes, but in some nodes, it is not. As a result, L2 Discovery generates an "Unable to determine next hop address". Another example is ENI mode with automatic device detection. When secondary interfaces are added, they are used for L2 Neighbor Discovery as well. However, nodes can only be reached via the primary interface through the default route in the main routing table. Thus, L2 Neighbor Discovery generates "Unable to determine next hop address" for secondary interfaces. In both cases, it does not always mean the datapath has an issue for KPR functionality. However, the logs appear repeatedly, are noisy, and the message is error-ish, causing confusion. This log started to appear for some users who did not see it before from v1.14.3 (#28858) and v1.15.0 (in theory). For v1.14.3, it affects KPR + Tunnel users because of f2dcc86. Before the commit, we did not perform L2 Neighbor Discovery in tunnel mode, so even if users had an interface unreachable to other nodes, the log did not appear. For v1.15.0, it affects to the users who used to have the unreachable interface. 2058ed6a made it visible. Before the commit, some kind of the errors like EHOSTUNREACH and ENETUNREACH were not caught because FIBMatch option didn't specified. After v1.15.0, users started to see the log. Fixes: #28858 Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * gha: additionally test host firewall + KPR disabled in E2E tests [ upstream commit 77a0c6b0424b8fe89459ff74787b2e46461731aa ] Let's additionally enable host firewall on a couple of existing matrix entries associated with KPR disabled, so that we can additionally cover this configuration and prevent regressions. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * bgpv1: Adjust ConnectionRetryTimeSeconds to 1 in component tests [ upstream commit 777b58037491951212ed874165563aac17e61763 ] Set ConnectionRetryTimeSeconds in the component tests to 1s in component tests unless it is specified explicitly. Otherwise, when the initial connect fails, we need to 120s for the next connection by default, which may longer than the timeout of the test itself. Fixes: #31217 Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * bpf: Enable monitor aggregation for all events in bpf_network.c [ upstream commit 81f14bbd4ebe898c48a918822ed30fe42ed5620d ] This commit adjusts the usage of send_trace_notify in bpf_network.c to enable monitor aggregation for all events emitted at this observation point in the datapath. This change helps improve resource usage by reducing the overall number of events that the datapath emits, while still enabling packet observability with Hubble. The events in bpf_network.c enable observability into the IPSec processing of the datapath. Before this commit, multiple other efforts have been made to increase the aggregation of events related to IPSec to reduce resource usage, see #29616 and #27168. These efforts were related to packets that were specifically marked as encrypted or decrypted by IPSec and did not include events in bpf_network.c that were emitted when either: (a) a plaintext packet has been received from the network, or (b) a packet was decrypted and reinserted into the stack by XFRM. Both of these events are candidates for aggregation because similar to-stack events will be emitted down the line in the datapath anyways. Additionally, these events are mainly useful for root-cause analysis or debugging and are not necessarily helpful from an overall observability standpoint. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * endpointmanager: Improve health reporter messages when stopped [ upstream commit 02d04b34005c45d955a6850d3faf561d9eae8f78 ] The health reporter messages could use a bit more clarity and another was a copy-paste mistake. Fixes: bb957f3821 ("endpointmanager: add modular health checks to epm componenets.") Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io> * bpf, tests: Force v6 address into 'rodata' section [ upstream commit 39e65e99d9397b6a036a049d4830322261de6920 ] This commit forces v6 addresses in the pkgten.h header into the .rodata section of compiled ELF binaries. This prevents a possible mismatch between ELF and BTF for the locations of these v6 symbols. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * ingress: don't emit an error message on namespace termination [ upstream commit 3b8c517c95e6167a531e24e9d0fea5c1416478d4 ] When a namespace containing an Ingress configured in dedicated mode gets deleted, the controller may first receive the event corresponding to the deletion of one of the dependent resources, and only afterwards of the ingress itself. Hence, it will attempt to recreate these resources (e.g., the CiliumEnvoyConfig), but fail as rejected by the k8s APIServer with an error along the lines of: failed to create or update CiliumEnvoyConfig: ciliumenvoyconfigs.cilium.io \"cilium-ingress-cilium-test-ingress-service\" is forbidden: unable to create new content in namespace cilium-test because it is being terminated Let's replace this error with an informative message at lower severity, given that the situation resolves itself, and there's no actionable item for a user reading the error. This also prevents the error from being flagged by the Cilium CLI no-errors-in-logs check. Since we don't return error to controller-runtime, the reconciliation will not be retried automatically. However, we will be woken up again once the ingress deletion event is received, so that we can proceed with the cleanup. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * ingress: add unit test for namespace termination error [ upstream commit 00e45d7a19ff46dafcb7cd99a737c6d240e84e81 ] [ backporter's notes: adapted the newIngressReconciler parameters as appropriate for v1.15. ] This commit adds a unit test for the case where reconciliation shouldn't omit an error if resource creation is failing due to Namespace termination. Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> * envoy: enable k8s secret watch even if only CEC is enabled Currently, the K8s Secret watch (used by Envoy SecretSync (K8s TLS Secret -> Envoy SDS)) is only active if either Ingress Controller or Gateway API is enabled. Hence Secrets aren't available via SDS in cases where only CiliumEnvoyConfig is enabled (`--enable-envoy-config`). This commit fixes this by enabling the K8s Secret watch also in cases where only CiliumEnvoyConfig is enabled (without Ingress Controller and/or Gateway API being enabled). Fixes: #26005 Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> * chore(deps): update dependency cilium/cilium-cli to v0.16.3 Signed-off-by: renovate[bot] <bot@renovateapp.com> * Update --nodes-without-cilium flag The --nodes-without-cilium flag type changed from string slice to boolean in cilium/cilium-cli#2427. Signed-off-by: Michi Mutsuzaki <michi@isovalent.com> * chore(deps): update gcr.io/distroless/static-debian11:nonroot docker digest to 55c6361 Signed-off-by: renovate[bot] <bot@renovateapp.com> * k8s/watchers: inline single-use updateEndpointLabels [ upstream commit bba0ff569e10e902747ae8df0761f96374a10f5f ] The functions updateEndpointLabel is only used in one place. Inline it to improve readability and simplify changes in successive commits. Signed-off-by: Tobias Klauser <tobias@cilium.io> * k8s/watchers: warn when endpoint label update fails on pod update [ upstream commit 9a26446df2acad3c03c2413aca4d95f1e9bd7406 ] Currently, failure to update endpoint labels based on pod labels on pod update is silently ignored by the callers or only reflected in error count metrics. Report a warning to clearly indicate that pod and endpoint labels might be out of sync. Signed-off-by: Tobias Klauser <tobias@cilium.io> * k8s: move filterPodLabels to k8s/utils package for SanitizePodLabels [ upstream commit 23098051f1994e5ac69d181680556de28c3e5540 ] Currently GetPodMetadata is the only caller of SanitizePodLabels but other callers will be introduced in successive changes. This change ensures the io.cilium.k8s.* labels are filtered for these callers as well. Signed-off-by: Tobias Klauser <tobias@cilium.io> * k8s/watchers: set unfiltered pod labels on CEP on pod update [ upstream commit 5508746586091ffa2c72e24362ec62781a4ce2ad ] The labels on the CEP are set to the unfiltered pod labels on CEP creation, see [1]. On any label update where labels contain filtered labels, e.g. io.cilium.k8s.* labels or labels filtered out by the user by means of the --label and/or --label-prefix-file agent options the current logic would wrongly remove the filtered labels from the CEP labels. Fix this by always using the unfiltered pod labels. [1] https://github.com/cilium/cilium/blob/b58125d885edbb278f11f84303c0e7c934ca7ea4/pkg/endpointmanager/endpointsynchronizer.go#L185-L187 Signed-off-by: Tobias Klauser <tobias@cilium.io> * k8s/utils: filter out cilium-owned labels on pod update [ upstream commit ed4e6505c059512446d45b40d5aba6ca5cf15d3c ] Currently `io.cilium.k8s.*` pod labels are only filtered out on pod creation. On pod update, they are currently not filtered which leads to a situation where no pod label update is reflected in the endpoint anymore in case of a `io.cilium.k8s.*` label set on the pod: $ cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: foo namespace: default labels: app: foobar io.cilium.k8s.something: bazbar spec: containers: - name: nginx image: nginx:1.25.4 ports: - containerPort: 80 EOF $ kubectl -n kube-system exec -it cilium-nnnn -- cilium-dbg endpoint list ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS ENFORCEMENT ENFORCEMENT 252 Disabled Disabled 50316 k8s:app=foobar fd00:10:244:1::8b69 10.244.1.78 ready k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default k8s:io.cilium.k8s.policy.cluster=kind-kind k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default $ kubectl label pods foo app=nothing --overwrite $ kubectl describe pod foo [...] Labels: app=nothing io.cilium.k8s.something=bazbar [...] $ kubectl describe cep foo [...] Labels: app=foobar io.cilium.k8s.something=bazbar [...] $ kubectl -n kube-system exec -it cilium-nnnn -- cilium-dbg endpoint list ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS ENFORCEMENT ENFORCEMENT 252 Disabled Disabled 50316 k8s:app=foobar fd00:10:244:1::8b69 10.244.1.78 ready k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default k8s:io.cilium.k8s.policy.cluster=kind-kind k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default 1285 Disabled Disabled 1 reserved:host ready 1297 Disabled Disabled 4 reserved:health fd00:10:244:1::ebfb 10.244.1.222 ready Note that the `app` label didn't change from `foobar` to `nothing` in the endpoint and the CiliumEndpoint CRD This is because the filtered labels are passed wrongly passed to `(*Endpoint).ModifyIdentityLabels` which in turn calls `e.OpLabels.ModifyIdentityLabels` which checks whether all of the deleted labels (which contains the filtered label on pod update for the example above) were present before, i.e. on pod creation. This check fails however because the labels were filtered out on pod creation. Fix this issue by also filtering out the labels on pod update and thus allowing the label update to successfully complete in the presence of filtered labels. After this change, the labels are correctly updated on the endpoint and the CiliumEndpoint CRD: $ kubectl label pods foo app=nothing --overwrite $ kubectl describe pod foo [...] Labels: app=nothing io.cilium.k8s.something=bazbar [...] $ kubectl describe cep foo [...] Labels: app=nothing io.cilium.k8s.something=bazbar [...] $ kubectl -n kube-system exec -it cilium-x2x5r -- cilium-dbg endpoint list ENDPOINT POLICY (ingress) POLICY (egress) IDENTITY LABELS (source:key[=value]) IPv6 IPv4 STATUS ENFORCEMENT ENFORCEMENT 57 Disabled Disabled 56486 k8s:app=nothing fd00:10:244:1::71b7 10.244.1.187 ready k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default k8s:io.cilium.k8s.policy.cluster=kind-kind k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default 201 Disabled Disabled 4 reserved:health fd00:10:244:1::c8de 10.244.1.221 ready 956 Disabled Disabled 1 reserved:host ready Fixes: 599dde3b91b3 ("k8s: Filter out cilium owned from pod labels") Signed-off-by: Tobias Klauser <tobias@cilium.io> * k8s/utils: correctly filter out labels in StripPodSpecialLabels [ upstream commit 280b69d9c12da757bfa764e3e77b87660f77c5bc ] Filter the labels before iterating them, otherwise they are added to sanitizedLabels again. Also remove forbidden keys prefixed with io.cilium.k8s because they are already filtered out by filterPodLabels and inline the check for kubernetes namespace labels. Fixes: ed4e6505c059 ("k8s/utils: filter out cilium-owned labels on pod update") Signed-off-by: Tobias Klauser <tobias@cilium.io> * chore(deps): update all github action dependencies Signed-off-by: renovate[bot] <bot@renovateapp.com> * cni: Use batch endpoint deletion API in chaining plugin [ upstream commit 31be7879e3a3a16936b2df71b14bb4cbdfc697a6 ] This commit is to leverage new batch endpoint deletion API instead of singular endpoint deletion based on ID. The main reason is to provide backward compatibility on upgrade path. The current CNI attachment ID requires a valid containerIfName attribute, however, the endpoints created by old cilium versions (e.g. <1.15) are not having such details. Any CNI DEL command for these endpoints will lead to invalid lookup (e.g. DeleteEndpointIDNotFoundCode), and prevent cleaning up of related resources such as IP addresses. The impact is only limited to CNI chaining mode, as batch endpoint deletion API is already used cilium-cni/cmd/cmd.go as part of #27351. Old endpoint details without (or empty) ContainerIfName ```json { "ID": 423, "ContainerName": "", "dockerID": "415beb119c4b0910f62634510e921a447893195ebedc30ca0e9cd5bf02569645", "DockerNetworkID": "", "DockerEndpointID": "", "IfName": "eni22524e9e591", "IfIndex": 13, "ContainerIfName": "", "DisableLegacyIdentifiers": false, ... } ``` New endpoint details with valid ContainerIfName (e.g. eth0) ```json { "ID": 3627, "ContainerName": "", "dockerID": "f89ccf654b878248442981d4c56fe3f50fa127f922b46ee6dccc94ae10e94b79", "DockerNetworkID": "", "DockerEndpointID": "", "IfName": "enia67a2d3c27d", "IfIndex": 45, "ContainerIfName": "eth0", "DisableLegacyIdentifiers": false, ... } ``` Relates: #26894, #27351 Suggested-by: Sebastian Wicki <sebastian@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io> * test: Remove duplicate Cilium deployments in some datapath config tests When disabling IPv6 BPF masquerading in datapath configuration tests using the host firewall, we accidentally duplicated the command to deploy Cilium for some of the tests. Let's clean this up to avoid slowing down the tests. Fixes: 934e1f2df26c ("daemon: Forbid IPv6 BPF masquerading with the host firewall") Signed-off-by: Quentin Monnet <qmo@qmon.net> * chore(deps): update docker.io/library/golang:1.21.8 docker digest to 8560736 Signed-off-by: renovate[bot] <bot@renovateapp.com> * images: update cilium-{runtime,builder} Signed-off-by: Jarno Rajahalme <jarno@isovalent.com> * controlplane: fix panic: send on closed channel [ upstream commit 7df437a0d03e91ea137836b5fa3436fe60e3cdc8 ] Rarely, the control plane test panics, due to a send on a closed channel. This can occur in a narrow race window in the filteringWatcher: 1. Stop is called on the child watcher 2. Child watcher calls stop on parent watcher 3. Concurrently, an event is dequeued from the parent result chan, and we enter the filtering logic. 4. The parent result chan is closed, and we close the child event channel 5. The filter is matched, and we attempt to write on the closed channel, which causes the panic. Instead of closing the channel in the Stop method, close the channel from the writing goroutine (as is commonly considered best practice in Go.) Fixes: fa89802ce7 (controlplane: Implement filtering of objects with field selectors) Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * controlplane: add mechanism to wait for watchers [ upstream commit ba99d74c444ec3bdcd7d32bd4bbec1d76f85cdd2 ] [ backporter notes: Whitespace conflict at the new `establishedWatchers` field in test/controlplane/suite/testcase.go ] We've recently learned that the fake k8s client set's object tracker do not respect the semantics of the real api-server when it comes to 'Watch': since the object tracker does not care for ResourceVersions, it cannot respect the version from which it ought to replay events. As a result, the default informer (more precisely, its reflector) is racy: it uses a ListAndWatch approach, which relies on this resource version to avoid a race window between the end of list and the beginning of watch. Therefore, all informers used in cilium have a low chance of hitting this race when used with a k8s fake object tracker. This is somewhat known in the k8s community, see for example [1]. However, the upstream response is that one simply shouldn't be using the fake infrastructure to test real informers. Unfortunately, this pattern is used somewhat pervasively inside the cilium tests, specifically so in the controlplane tests. This patch introduces a mechanism which reduces the likelihood of hitting the flake, under the assumption that we do not (often) establish multiple watchers for the same resource. In the following patch, we'll use the new infrastructure to reduce the flakiness of tests. [1]: https://github.com/kubernetes/kubernetes/issues/95372 Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * controlplane: wait for watcher establishment [ upstream commit 955049a8c6339ae8983fa8f041062e850ddcd5a3 ] [ backporter notes: Also updated status_updates_gc.go ] Use the infrastructure introduced in the previous commit to deflake control plane tests which update k8s state after starting the agent. Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * job: avoid a race condition in ExitOnCloseFnCtx [ upstream commit 6b2d186f653d8856aefa05b771ae0c7acf16a5b2 ] The test attempted to avoid closing a channel multiple times by setting 'started' to nil. However, since the outer scope will wait on 'started', if started is set to nil before the outer scope waits, it will wait indefinitely - resulting in a time out of the test. Fixes: daa85a0f4a (jobs,test: Fix TestTimer_ExitOnCloseFnCtx channel close panic) Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * controlplane: fix mechanism for ensuring watchers [ upstream commit fe71a4aef3bdd2fdafa7a8cbdf3dc71d29f18cf7 ] [ backporter notes: Minor conflicts in the fields of `ControlPlaneTest` (fakeDatapath package has been renamed) ] I realized that the fix for controlplane tests isn't complete. There is still a (small) race window: The current watch reaction records a watcher as established without "handling" the watch itself, i.e. it lets the default watch reaction actually call 'Watch' on the tracker. This is racy, as things can happen in the window between recordng and actually watching. To fix this, add the recording unconditionally in the existing tracker augmentation. Fixes: ba99d74c44 (controlplane: add mechanism to wait for watchers) Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * Handle InvalidParameterValue as well for PD fallback [ upstream commit 5a487b59a5b9cbc52dd221885655c6cb8abd1890 ] cilium#30536 prematurely concluded that AWS now uses InsufficientCidrBlocks to indicate the subnet is out of prefixes. Looks like AWS still uses InvalidParameterValue and "There aren't sufficient free Ipv4 addresses or prefixes" to indicate subnet is at capacity. In addition to this InsufficientCidrBlocks is returned when subnet is at capacity potentially due to fragmentation. In either case, it's worth trying to fallback since /32 IPs might still be available compared to /28. See PR for details from AWS support ticket. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * Adding unit test for PD fallback [ upstream commit 0007e35d0a5de7b570e4750cb12a7965b3301949 ] [ backporter notes: Minor import conflict in pkg/aws/ec2/mock/mock.go ] Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * slices: don't modify missed input slice in test [ upstream commit 08d898d8011c68a85851629e131900deef91bab3 ] The commit below correctly identified that modifying the test input makes the test flaky due to being dependent on the non-deterministic ordering. However, it missed adding the work-around for the first test, TestUnique. Fixes: 32543a40b1 (slices: don't modify input slices in test) Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * bgpv1: avoid object tracker vs informer race [ upstream commit cfd17903fb69d1d6153c172f0af6ac200721e065 ] The fake k8s testing infrastructure has an unfortunate interaction between real informers and the object tracker used in fake clientsets: the informer uses ListAndWatch to subscribe to the api server, but ListAndWatch relies on the ResourceVersion field to bridge the gap between the List and the Watch. The simple object tracker does not provide the resource versioning, nor does its Watch implementation attempt to replay existing state. The race window, hence, allows for creation of objects _after_ the initial list, but _before_ the establishment of Watch. These objects are not observed by the informer and thus tests are likely to fail. As a workaround, we ensure that the object tracker has registered a watcher before we start the test in earnest. It is only important that we do not create/update/delete objects in the window between starting the hive (and hence running of the informer) and having ensured that the watcher is in place. Creation prior to starting the hive is okay, as well as after ensuring the watcher exists. Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * doc: Clarified GwAPI KPR prerequisites [ upstream commit 1321e03093db0f7655488c60f91e22e273c65cb0 ] Before the GwAPI doc listed KPR mode as a prerequisite. However, it's actually only required to enable BPF nodePort support. Signed-off-by: Philip Schmid <philip.schmid@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * helm: Add pod affinity for cilium-envoy [ upstream commit 44aeb535b1d328734f8faac07669fbfa4683bc6f ] [ backporter notes: Minor conflict in values.yaml due to different intendation of podAntiAffinity in v1.15 ] This commit is to avoid cilium-envoy running on the node without cilium agent. Two main changes are: - nodeAffinity to make sure that cilium-envoy will not be scheduled on node without cilium agent - podAffinity with requiredDuringSchedulingIgnoredDuringExecution to cilium agent Relates: #25081, #30034 Fixes: #31149 Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * bgpv1: fix Test_PodIPPoolAdvert flakiness [ upstream commit 696a4fd9620d6efe8f1697a5d4d066a6ab5fbdf8 ] Since the test updates multiple k8s resources in each test step, and each step depends on the previous one, we should be careful about how many k8s resources are we actually changing in each step, to not trigger multiple reconciliations with different results (advertisements) in a single step. This change ensures we change only one value that affects the advertisement in each test step. Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * bpf, maps: Don't propagate nodeID to bpf map when allocation fails. [ upstream commit 4f4b0a7460ec759dc025c91df92f29aa7b64cce1 ] When we run out of IDs to allocate for nodes, we were propagating zero ID to bpf map. Now we just simply return error and not modify bpf map instead. Also clean up incorrectly mapped nodeids on startup in case that happened. Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * policy: Fix missing labels from SelectorCache selectors [ upstream commit 65bbf3fbb568bf3b8a76a2ec87ac826e3e036a0a ] During the refactor of the below commit, it seems the labels were left out inadvertently, breaking the `cilium policy selectors` command that displays the labels/name of the policy from which the selectors originate from. Fixes: 501944c35d ("policy/selectorcache: invert identitySelector interface") Signed-off-by: Chris Tarazi <chris@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * ci: Bump lvh-kind ssh-startup-wait-retries [ upstream commit a5eafe025390d77fa370add4883e6b1a00da7365 ] Recently, we frequently see the CI failure with lvh-kind startup failure with exit code 41. This indicates the timeout of the task waiting for the SSH startup. Bump the timeout (retry) to 600 (10min) as a workaround. Fixes: #31336 Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * datapath: Remove unnecessary IPsec code [ upstream commit 4ba7e6a3111f1bb43c54253ac645ed7949709a2e ] Commit 891fa78474 ("bpf: Delete obsolete do_netdev_encrypt_pools()") removed the special code we had to rewrite the IPsec outer header. The code removed in the present commit is therefore not required anymore. Fixes: 891fa78474 ("bpf: Delete obsolete do_netdev_encrypt_pools()") Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * operator: fix errors/warnings metric. [ upstream commit f61651f3797432892588bfb8086e70991312313e ] This was broken during transition of pkg/metrics to integrate with Hive where relevant operator metrics where never initialized. This adds a init func specific for operator and cleans up the "flush" logic used as a work around for errors/warnings emitted prior to agent starting (in the case of the operator). Addresses: #29525 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * loader: add message if error is ENOTSUP [ upstream commit 64364477fc024e8026e30016257fa124fca4a901 ] I was trying to install cilium in a machine without a vxlan module available. Add a helpful message for the next time this happens. Tested by running: > PRIVILEGED_TESTS=1 go test -test.v -test.run TestSetupTunnelDevice on said machine, and got the following output: ... ``` Error: Received unexpected error: setting up vxlan device: creating vxlan device: creating device cilium_vxlan: operation not supported, maybe kernel module for vxlan is not available? Test: TestSetupTunnelDevice/Vxlan ``` Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * ipam: fix azure ipam test panics due to shared pointers. [ upstream commit 1ca6141190f455cae85ebe6215f86a74bda9d213 ] pkg/ipam/types.(*InstanceMap).DeepCopy(...) will iterate for all instances/interfaces in order to copy the data. However, unlike what the name suggests, underlying instance pkg/ipam/types.Interface pointers are copied and shared in the returned instance map. In some cases, this case result in memory corruption issues resulting in confusing panics while running tests such as: ``` panic: runtime error: makeslice: cap out of range goroutine 1366 [running]: strconv.appendQuotedWith({0xc000576208, 0x0, 0x3f?}, {0x1000000, 0x100000001000000}, 0x22, 0x0, 0x0) /opt/hostedtoolcache/go/1.22.0/x64/src/strconv/quote.go:35 +0x85 strconv.AppendQuote(...) ... ``` Capturing such an event in a debugger you would see a AzureInterface struct such as this with the IP string memory being corrupt (likely due to an interleaved read/write) being passed to logrus causing a crash. ``` github.com/cilium/cilium/pkg/azure/types.AzureAddress { IP: "\x10\x01\x00\x00\x00\x00\x00\x00\x10\x01\x00\x00\x00\x00\x00\x007d�_\x02\b\b\x19\x00\x00\x00\x00\x00\x00\x00\x00�\x1f�\x03\x00\x00\x00\x00W�\b\x00\x00\x00\x00\x00\x00p \x03\x00\x00\x00\x00�qi\x03\x00\x00\x00\x00...+51559946186908214 more", Subnet: "subsys", State: "instanceID",} ``` This ensures that the revision interface is correctly deepcopied such that the underlying resource is also safely copied. Fixed: #31059 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * gateway-api: Retrieve LB service from same namespace [ upstream commit e8bed8d8343731422dc65b706c3617a40d0a8058 ] [ backporter notes: conflicts in unit tests. Had to change introduce the `NoError` check instead of expecting a specific error ] This commit is to add the same namespace while listing generated LB service, the main reason is to avoid wrong status update for gateways having the same name, but belonged to different namespace. Testing was done locally as per below: Before the fix: ``` $ kg gateway -A NAMESPACE NAME CLASS ADDRESS PROGRAMMED AGE another my-gateway cilium 10.110.222.237 True 4s default my-gateway cilium 10.110.222.237 True 56s ``` After the fix: ``` $ kg gateway -A NAMESPACE NAME CLASS ADDRESS PROGRAMMED AGE another my-gateway cilium 10.102.170.180 True 14m default my-gateway cilium 10.110.222.237 True 14m $ kg services -A NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE another cilium-gateway-my-gateway LoadBalancer 10.102.170.180 10.102.170.180 80:31424/TCP 15m default cilium-gateway-my-gateway LoadBalancer 10.110.222.237 10.110.222.237 80:31889/TCP 15m ... ``` Fixes: https://github.com/cilium/cilium/issues/31270 Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * ci-e2e: Add matrix for bpf.tproxy [ upstream commit 5884af16935bc7acec4dcbb836dffe7a5e460855 ] This is to make sure that we have the coverage for bpf.tproxy enabled. The first step is to test it with Ingress Controller enabled, we can add more settings if required later. Relates: #30331, #30404 Suggested-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * gha: disable fail-fast on integration tests [ upstream commit 0fb203e8671638bad22438ec31cbe687462c4826 ] So that the failure of one matrix entry (e.g., caused by a flake) doesn't cancel the other ongoing tests, if any. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * hive/cell/health: don't warn when reporting on stopped reporter. [ upstream commit 44e90054f29c410608e816b112df8fc615eb0131 ] Currently in cases where reporters emit status on stopped reporters, a warning is logged. This causes failures in CI related to endpoints, as those close their reporters after endpoint is deleted. This scenario is probably not a valuable log for users and generally such occurrences should be harmless. This moves the log to debug level. Fixes: #31147 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * docs: Warn on key rotations during upgrades [ upstream commit b639eab46a1fd85f0bab362b3fdf4cf21df80ace ] In general, it is not recommended to carry several admin. operations on the cluster at the same time, as it can make troubleshooting in case of issues a lot more complicated. Mixing operations is also less likely to be covered in CI so more likely to hit corner cases. Performing IPsec key rotations during Cilium up/downgrades is one such case. Let's document it explicitly to discourage users from doing that. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Sebastian Wicki <sebastian@isovalent.com> * chore(deps): update all github action dependencies Signed-off-by: renovate[bot] <bot@renovateapp.com> * hubble: fix traffic direction and reply on encrypted trace notifications [ upstream commit 9939fa2b0848ddd056e81f14a548f179f59027f3 ] Before this patch, Hubble would wrongly report known traffic direction and reply status when IPSec was enabled. Signed-off-by: Alexandre Perrin <alex@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io> * AKS: avoid overlapping pod and service CIDRs [ upstream commit fbe78c42f44f8e1c04d3886fc51a2b283b668770 ] The default service CIDR of AKS clusters is 10.0.0.0/16 [1]. Unfortunately, we don't set a pod cidr for clusterpool IPAM, and hence use cilium's default of 10.0.0.0/8, which overlaps. This can lead to "fun" situations in which e.g. the kube-dns service ClusterIP is the same as the hubble-relay pod IP, or similar shenanigans. This usually breaks the cluster utterly. The fix is relatively straight-forward: set a pod CIDR for cilium which does not overlap with defaults of AKS. We chose 192.168.0.0/16 as this is what is recommended in [2]. [1]: https://learn.microsoft.com/en-us/azure/aks/configure-kubenet#create-an-aks-cluster-with-system-assigned-managed-identities [2]: https://learn.microsoft.com/en-us/azure/aks/azure-cni-powered-by-cilium#option-1-assign-ip-addresses-from-an-overlay-network Fixes: fbf3d38a4b (ci: add AKS workflow) Co-authored-by: Fabian Fischer <fabian.fischer@isovalent.com> Signed-off-by: David Bimmler <david.bimmler@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io> * hubble/relay: Fix certificate reloading in PeerManager [ upstream commit 71628929e10005e3620df0f52732ecb57015cbef ] This change fixes an issue with the PeerManager that can lead to Relay being unable to reconnect to some or all peers when the client certificate expires or the Certificate Authority is replaced. Before this change, when the client certificate changes, we did not redial or update the exiting gRPC ClientConns. When the old certificate becomes invalid, (expiring, changed CA, or revoked) The connection will eventually fail with a certificate error. However, the gRPC ClientConn is not closed, but treats the certificate error as a transient failure and will retry connecting with the old credentials indefinitely. In most cases this will cause the relay health checks to fail. Relay will restart and successfully reconnect to all peers. However, if a new peer joins between the certificate being updated and the connections failing, Relay may keep on running in a degraded state. This issue was introduced by #28595. Before that change, Relay aggressively closed and re-dialed ClientConns on any error, mitigating this problem. We fix this issue by wrapping the provided gRPC transport credentials and updating the TLS configuration whenever a new TLS connection is established. This means every TLS connection will use up-to-date certificates and gRPC ClientConns will be able to recover when their certificate changes. Fixes: aca4d42ce80d ("hubble/relay: Fix connection leak when reconnecting to peer service") Signed-off-by: Fabian Fischer <fabian.fischer@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: Alexandre Perrin <alex@isovalent.com> * cilium-dbg: listing load-balancing configurations displays L7LB proxy port [ upstream commit d3b19d65d3358c1dc08d55e715219a74567cd26d ] Currently, listing the load-balancing configuration doesn't display the L7LB Proxy Port for services of type `l7-load-balancer`. ``` cilium-dbg bpf lb list SERVICE ADDRESS BACKEND ADDRESS (REVNAT_ID) (SLOT) ... 10.96.193.7:443 0.0.0.0:0 (30) (0) [ClusterIP, non-routable, l7-load-balancer] ``` The only way of retrieving the L7LB proxy port is to list the frontends (`cilium-dbg bpf lb list --frontends`) and manually convert the backend id (union type) to the L7LB proxy port. Therefore, this commit addsd the L7LB proxy port to the output of `cilium-dbg bpf lb list` if the service is of type L7 LoadBalancer. The `--frontends` subcommand still displays the unmapped backend id. ``` cilium-dbg bpf lb list SERVICE ADDRESS BACKEND ADDRESS (REVNAT_ID) (SLOT) 10.96.0.1:443 172.18.0.3:6443 (1) (1) 0.0.0.0:0 (1) (0) [ClusterIP, non-routable] 10.96.252.10:443 172.18.0.2:4244 (22) (1) 0.0.0.0:0 (22) (0) [ClusterIP, InternalLocal, non-routable] 10.96.155.44:80 0.0.0.0:0 (14) (0) [ClusterIP, non-routable] 10.244.1.211:80 (14) (1) 172.18.0.2:32646 0.0.0.0:0 (33) (0) [NodePort, l7-load-balancer] (L7LB Proxy Port: 15735) 10.96.193.7:443 0.0.0.0:0 (30) (0) [ClusterIP, non-routable, l7-load-balancer] (L7LB Proxy Port: 15735) 10.96.122.45:80 10.244.1.250:80 (26) (1) 0.0.0.0:0 (26) (0) [ClusterIP, non-routable] 10.96.102.137:80 0.0.0.0:0 (23) (0) [ClusterIP, non-routable] 10.244.1.126:4245 (23) (1) 10.96.108.180:443 0.0.0.0:0 (17) (0) [ClusterIP, non-routable, l7-load-balancer] (L7LB Proxy Port: 17731) 172.18.255.1:80 0.0.0.0:0 (25) (0) [LoadBalancer, l7-load-balancer] (L7LB Proxy Port: 17731) 0.0.0.0:32646 0.0.0.0:0 (34) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 15735) 0.0.0.0:31012 0.0.0.0:0 (21) (0) [NodePort, non-routable, l7-load-balancer] (L7LB Proxy Port: 17731) ``` Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io> * chore: update sw + json-mock image [ upstream commit f8fb8d1fb3c94776faa63b3c3062837b6c74e188 ] Replaced docker.io by quay.io pinned with current latest Docker source is deprecated. Signed-off-by: loomkoom <loomkoom@hotmail.com> Signed-off-by: Tam Mach <tam.mach@cilium.io> * bgpv1: BGP Control Plane metrics [ upstream commit 59a01a89a231cab751fe88a028ce4c5c46c4d513 ] Implement following initial metrics for BGP Control Plane. 1. cilium_bgp_control_plane_session_state Gauge that shows session state per vrouter/neighbor. Established (1) or Not Established (0). 2. cilium_bgp_control_plane_advertised_routes Gauge that shows the number of advertised routes per vrouter/neighbor/afi/safi. 3. cilium_bgp_control_plane_received_routes Gauge that shows the number of received routes per vrouter/neighbor/afi/safi. Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Tam Mach <tam.mach@cilium.io> * bpf: Remove `declare_tailcall_if` [ upstream commit 46db41390be3a17872f7274c0ded62ecf7a07bf8 ] Remove `declare_tailcall_if`, so we always emit the tailcall programs into the ELF. The followup commit will implement pruning logic based on the actual usage of the tail calls. This means that we will only need the `invoke_tailcall_if` without the need to keep both the declaration and invocation in sync. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Signed-off-by: Timo Beckers <timo@isovalent.com> * bpf: Modify tail_call_static to emit better parseable assembly [ upstream commit c4cbb38a1819adb08bd9b479b37ab4a7ec1b4e7a ] Before this change the tail_call_static function would emit the following instructions to perform a tailcall: ``` Mov R1, Rctx Mov R2, Rmap_ptr Mov R3, <slot> Call TailCall ``` Since the second instruction is always a Register to Register move, we would have to backtrack to find the actual map which is being used. These changes makes it so the following instructions are emitted: ``` Mov R1, Rctx Mov R2, 0 ll <calls_map> Mov R3, <slot> Call TailCall ``` By always using a double word immediate, with a relocation entry on the Mov R2 instruction it is much easier to find the actual map which is being used. As a side effect, we usually eliminate an extra instruction clang was otherwise forced to emit. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> * pkg/bpf: Implement unreachable tail call pruning [ upstream commit 16033b98b97573437df991efb2dd1da369a683ba ] This commit implements unreachable tail call pruning. When loading a collection we check if a tail call is reachable. If not, we remove the tail call from the collection. This saves us from having to load the tail call program into the kernel. Previously, we would conditionally not include tail calls in the collection with pre-processor directives. Now that we do it in the loader, we can remove the pre-processor directives. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> * pkg/bpf: Add test for removeUnreachableTailcalls [ upstream commit 217426aae125fc3cc0257db611b7f8e9e519d3ab ] [ backporter's notes: regenerated testdata ] This commit adds a test to verify the behavior of the dead tail call pruning. It consists of 5 tail calls, of which 2 are unreachable. The test asserts that only the unreachable tail calls are removed from the spec. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Signed-off-by: Timo Beckers <timo@isovalent.com> * bpf: Fix missing tail calls [ upstream commit e1afa06774191d65714576d55f2eadcbe29fc2e6 ] The changes to the dead tail call elimination revealed 2 cases of missing tail calls. First is to do with NAT46x64 logic where there still existed a call path from the IPv4 logic which would attempt to tail call into IPv6 to recirculate the packet, even when the IPv6 tail call wasn't compiled in. The second was that when XDP offloaded, the IPv6 logic would tail call into a ICMP6 tail call which is only compiled in for TC programs. This commit fixes both possible missing tail calls. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Signed-off-by: Timo Beckers <timo@isovalent.com> * bpf: fix go testdata [ upstream commit d6666ed76ddf5dc23f8fde07c3af0f28621f1907 ] Currently, check-go-testdata.sh doesn't work as expected on CI. It reports the following error and the GitHub action (Go Precheck) succeeds without error. ``` contrib/scripts/check-go-testdata.sh make[1]: Entering directory '/home/runner/work/cilium/cilium/src/github.com/cilium/cilium/pkg/bpf/testdata' docker run -it --rm -v /home/runner/work/cilium/cilium/src/github.com/cilium/cilium:/cilium quay.io/cilium/cilium-builder:819a4f1e57eaacb6aabfc6a1a39d11d4fd794a88@sha256:24781dc80f2be2d8fd66b0ce1405e1f117a3a0ef388758b1ede7831778e3a4f7 clang -target bpf -Wall -O2 -g -I -I/usr/include -I/cilium/bpf -I/cilium/bpf/include -c /cilium/pkg/bpf/testdata/unreachable-tailcall.c -o /cilium/pkg/bpf/testdata/unreachable-tailcall.o the input device is not a TTY make[1]: *** [Makefile:9: build] Error 1 make[1]: Leaving directory '/home/runner/work/cilium/cilium/src/github.com/cilium/cilium/pkg/bpf/testdata' ``` This commit fixes the following issues: - Don't execute docker interactive - Use 'set -e' in check script Signed-off-by: Marco Hofstetter <marco.hofstetter@isovalent.com> * ipsec: New IPsec secret bit to indicate per-node-pair keys Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> * ipsec: Move enableIPsecIPv{4,6} preconditions to caller This small bit of refactoring will ease a subsequent commit. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> * ipsec: Compute per-node-pair IPsec keys We need to ensure the (key used, sequence number) tuple for each encrypted packet is always unique on the network. Today that's not the case because the key is the same for all nodes and the sequence number starts at 0 on node reboot. To enable this, we will derive one key per node pair from …

hemanthmalla requested a review from a team as a code owner January 30, 2024 22:00

hemanthmalla requested a review from christarazi January 30, 2024 22:00

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 30, 2024

hemanthmalla added the area/eni Impacts ENI based IPAM. label Jan 30, 2024

christarazi approved these changes Jan 31, 2024

View reviewed changes

pkg/aws/eni/node.go Show resolved Hide resolved

christarazi added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Jan 31, 2024

maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 1, 2024

julianwiedmann added this pull request to the merge queue Feb 1, 2024

Merged via the queue into cilium:main with commit ed26b07 Feb 1, 2024
63 checks passed

maintainer-s-little-helper bot added this to Needs backport from main in 1.14.7 Feb 1, 2024

maintainer-s-little-helper bot added this to Needs backport from main in 1.13.12 Feb 1, 2024

nbusseneau mentioned this pull request Feb 8, 2024

v1.13 Backports 2024-02-08 #30679

Merged

6 tasks

nbusseneau added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Feb 8, 2024

maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.13 in 1.13.12 Feb 8, 2024

nbusseneau mentioned this pull request Feb 8, 2024

v1.14 Backports 2024-02-08 #30680

Merged

9 tasks

maintainer-s-little-helper bot moved this from Needs backport from main to Backport pending to v1.14 in 1.14.7 Feb 8, 2024

nbusseneau mentioned this pull request Feb 8, 2024

v1.15 Backports 2024-02-08 #30681

Merged

12 tasks

nbusseneau added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Feb 8, 2024

maintainer-s-little-helper bot removed this from Backport pending to v1.14 in 1.14.7 Feb 9, 2024

github-actions bot added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. labels Feb 9, 2024

maintainer-s-little-helper bot added this to Backport done to v1.14 in 1.14.7 Feb 9, 2024

maintainer-s-little-helper bot moved this from Backport pending to v1.13 to Backport done to v1.13 in 1.13.12 Feb 9, 2024

maintainer-s-little-helper bot removed this from Backport done to v1.13 in 1.13.12 Feb 9, 2024

This was referenced Feb 13, 2024

Prepare for release v1.14.7 #30729

Merged

Prepare for release v1.13.12 #30730

Merged

Prepare for release v1.15.1 #30775

Merged

hemanthmalla mentioned this pull request Feb 27, 2024

Handle InvalidParameterValue as well for PD fallback #31016

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating ENI prefix delegation fallback to use dedicated error codes #30536

Updating ENI prefix delegation fallback to use dedicated error codes #30536

hemanthmalla commented Jan 30, 2024

christarazi left a comment •

edited

hemanthmalla commented Jan 31, 2024

hemanthmalla commented Jan 31, 2024

julianwiedmann commented Feb 1, 2024

hemanthmalla commented Feb 1, 2024

Updating ENI prefix delegation fallback to use dedicated error codes #30536

Updating ENI prefix delegation fallback to use dedicated error codes #30536

Conversation

hemanthmalla commented Jan 30, 2024

christarazi left a comment • edited

Choose a reason for hiding this comment

hemanthmalla commented Jan 31, 2024

hemanthmalla commented Jan 31, 2024

julianwiedmann commented Feb 1, 2024

hemanthmalla commented Feb 1, 2024

christarazi left a comment •

edited