-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.15 Backports 2024-01-30 #30529
v1.15 Backports 2024-01-30 #30529
Conversation
[ upstream commit 00b1e5e ] Once a service gets selected we start leader election. However, if we lose the lease for some reason, we don't retry getting it until the service is deselected and reselect, recreated or the agent restarts. This commit surrounds the lease leader election logic with a loop that ends when the context is cancelled. Signed-off-by: Dylan Reimerink <dylan.reimerink@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit e2760e6 ] When egress_gw_fib_lookup_and_redirect() in to-netdev selects the final egress interface for a packet (based on its desired EgressIP), the FIB lookup potentially returns BPF_FIB_LKUP_RET_NO_NEIGH. For 5.10+ kernels this is gracefully handled in fib_do_redirect() by redirecting to the neigh subsystem. But for older kernels we have no possibility to fall back to the NEIGH map, and the packet would just get dropped with DROP_NO_FIB / BPF_FIB_MAP_NO_NEIGH. Have egress_gw_fib_lookup_and_redirect() catch this case, and just let the packet continue on the current egress interface. Users that strictly require the *correct* egress interface need to run a 5.10+ kernel. We'll update the relevant code in from-overlay with a subsequent patch. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 918e47b ] With a previous patch, egress_gw_fib_lookup_and_redirect() now potentially doesn't redirect the packet, and just returns CTX_ACT_OK instead. Handle this by forwarding the packet to the stack, as was done prior to 9c1d1de ("egressgateway: Redirect from bpf_overlay to egress gw SNAT netdev"). Ideally this happens just once per connection - the pass through the stack should trigger a fresh ARP resolution, and subsequent traffic can obtain a L2 resolution from the FIB lookup. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit b09561c ] The only functions left in egress_policies.h are SRv6 related. Let's rename this to 'srv6.h' and update references to the old file name. Signed-off-by: ldelossa <louis.delos@gmail.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 2de0fea ] Include a trace reason for SRv6 encapsulation and decapsulation. This greatly improves the debugging process, indicating whether SRv6 VPN related packets are processed by our datapath. Signed-off-by: ldelossa <louis.delos@gmail.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit a6bfb79 ] Consider encap/decap as egress/ingress (respectively) and both as unknown reply ct status. Signed-off-by: Alexandre Perrin <alex@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 5eee9de ] The Node{Add,Update,Delete} functions of the linux node handler are already guarded in order not to execute the underlying logic if the node subsystem is not yet fully initialized. Once initialized, all updates are then automatically replayed. Yet, this does not apply to the NodeValidateImplementation and AllNodeValidateImplementation functions, which can also be invoked asynchronously, leading to a panic if not fully initialized (even without panicing, we would be enforcing an incorrect configuration, possibly disrupting existing connections): github.com/cilium/cilium/pkg/datapath/linux.(*linuxNodeHandler).nodeUpdate(0xc0022be1a0, 0x0, 0xc001936480, 0x0) /go/src/github.com/cilium/cilium/pkg/datapath/linux/node.go:1030 +0x142d github.com/cilium/cilium/pkg/datapath/linux.(*linuxNodeHandler).NodeValidateImplementation(_, {{0xc000f10720, 0x1b}, {0xc00068b2d8, 0x13}, {0xc000d6a3c0, 0x4, 0x4}, 0xc0005fc0e8, {0x0, ...}, ...}) /go/src/github.com/cilium/cilium/pkg/datapath/linux/node.go:1337 +0xc8 github.com/cilium/cilium/pkg/node/manager.(*manager).backgroundSync.func1({0x4019e80, 0xc0022be1a0}) /go/src/github.com/cilium/cilium/pkg/node/manager/manager.go:342 +0x9a github.com/cilium/cilium/pkg/node/manager.(*manager).Iter(0x3251f40?, 0xc001f0bdb8) /go/src/github.com/cilium/cilium/pkg/node/manager/manager.go:174 +0xdb github.com/cilium/cilium/pkg/node/manager.(*manager).backgroundSync(0xc00083c460, {0x400c390, 0xc00135b630}) /go/src/github.com/cilium/cilium/pkg/node/manager/manager.go:341 +0x4ab github.com/cilium/workerpool.(*WorkerPool).run.func1() Let's fix this by also checking the initialization status there. Signed-off-by: Marco Iorio <marco.iorio@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit f299dc1 ] Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 7cdadbc ] Global variable `countErrors` converted to the function local. Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 000edce ] Signed-off-by: viktor-kurchenko <viktor.kurchenko@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 5b04e08 ] Ensure errs are checked for the calls below: - deleteNodeIPSecOutRoute - replaceNodeIPSecOutRoute Signed-off-by: Fernand Galiana <fernand.galiana@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit d84a650 ] Signed-off-by: Ludovic Ortega <ludovic.ortega@adminafk.fr> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit d721b8b ] Previously, the restoreIPCache() method would return an error on new installs because it was checking for the presence of the "file or dir missing" error but this error was being wrapped by another method in the call tree. This PR updates the restoreIPCache() method to use errors.Is() that reports whether any error in err's tree matches the target and thus reports a nil error on new installs when the "cilium_ipcache" file does not exist. Fixes: #29328 Signed-off-by: Daneyon Hansen <daneyon.hansen@solo.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 945ad0c ] Currently we try to remove IPv6 proxy rules if the IPv6 option is disabled. This is to clean up those rules if a previously running agent has installed them but was restarted with a configuration change. This can fail if the underlying kernel has no IPv6 support. This commit fixes this, by allowing the necessary netlink syscall to fail with EAFNOSUPPORT. Fixes: #29965 Signed-off-by: Robin Gögge <r.goegge@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 09f18fd ] Signed-off-by: Filip Nikolic <oss.filipn@gmail.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 3932a4b ] Fix up some ctx_load_bytes() usage to return a drop reason, and not the raw kernel errno. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 87d948e ] The Cilium nodeinit startup script lays down a temporary CNI config in order to be able to restart a version of containerd that doesn't allow a missing CNI config. This commit fixes an issue with missing double quotes in the temporary config which causes an error in containerd and leads to NotReady Kubernetes nodes I also considered heredoc or escaping the quote characters but settled on single quoting as I think its the most readable one line solution without needing to deal with the indentation issue with heredoc Signed-off-by: Tom Cowling <952241+tlcowling@users.noreply.github.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit a388c42 ] Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](pallets/jinja@3.1.2...3.1.3) --- updated-dependencies: - dependency-name: jinja2 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com>
[ upstream commit 068dc47 ] Signed-off-by: Cilium Imagebot <noreply@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit f30fd60 ] This commit adds a warning to the Egress Gateway documentation to help user avoid deploying a known bad configuration. Co-authored-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: soggiest <nicholas@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 746b0f3 ] Signed-off-by: Nico Vibert <nicolas.vibert@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 7141c22 ] Remove a misplaces ls alias that caused cilium-dbg blf auth ls to flush the map. Signed-off-by: Maartje Eyskens <maartje@eyskens.me> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 0983a57 ] This is to update the Gateway API version to align with the conformance tests running in CI. Signed-off-by: Tam Mach <tam.mach@cilium.io> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit e777df1 ] #26215 changed how we do egressGW-specific routing on the gateway node - instead of installing custom IP rules, we rely on the node's routing setup. #30286 then fixed up a corner-case on older kernels. Reflect both parts in the docs. Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 3092ed1 ] Signed-off-by: Dmitry Kharitonov <dmitry@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 36c7d66 ] On modern kernels, bpf_host currently builds a tail_nodeport_nat_ingress_ipv4() that includes the code for RevSNAT, HostFW and RevDNAT. But at least for the 6.1 kernel we're scratching at the complexity limit and hitting verifier troubles in main [0] and 1.15 [1]. It's currently unclear why we can't reliably(!) reproduce those troubles in CI. Take the pressure off by splitting the tail-call into two parts, whenever the HostFW is enabled - leaving RevSNAT and HostFW in the first part, while RevDNAT is handled in a separate tail-call. This prevents the trace_ctx from the RevSNAT to reach the nodeport_add_tunnel_encap() call in the RevDNAT code, but that's acceptable for now. [0]: #30266 [1]: #30093 Co-developed-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Julian Wiedmann <jwi@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 9bf7a8e ] In cases where the endpoint regen fails, such as in a complexity issue, the endpoints reporter should be put into a degraded state. However, currently it will degrade the endpoint hr, and the immediately clear the degraded state with hr.Ok(...). This fixes that to only clear if there is no error while regenerating. Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
[ upstream commit 5581963 ] This commit brings two fixes to the script that we use to determine to which version we should upgrade/downgrade in some CI workflows. The first fix is the most important one. When looking for the closest patch version, make the script return the value in VERSION instead of decrementing it. The rationale is that for stable branches, VERSION already points to the latest patch release, there is no need to decrease it further! This fix does not affect the output for the calculation of the previous minor version number. The second fix is simply the addition of an error message in case the minor version number is 0, to get some explicit error instead of a silent failure if we ever reach Cilium 2.0.0. Updated samples of numbers from VERSION and the corresponding values returned: VERSION Previous minor Previous patch release 1.14.3 v1.13 v1.14.3 1.14.1 v1.13 v1.14.1 1.14.0 v1.13 <error> 1.14.1-dev v1.13 v1.14.1 1.15.0-dev v1.14 <error> 1.13.90 v1.12 <error> 2.0.1 <error> v2.0.1 Fixes: 56dfec2 ("contrib/scripts: Support patch releases in print-downgrade-version.sh") Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Jussi Maki <jussi@isovalent.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you!
LGTM ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joamaki LGTM Thanks for these updates Jussi!
Signed-off-by: Cilium Imagebot <noreply@cilium.io>
(All tests passed. docs-builder triggered an update on top) |
/test-backport-1.15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and looks good for my commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM
@@ -221,6 +221,7 @@ | |||
# after the less specific paths, otherwise the ownership for the specific paths | |||
# is not properly picked up in Github. | |||
* @cilium/tophat | |||
/.github/workflows/*perf*.yaml @cilium/sig-scalability @cilium/github-sec @cilium/ci-structure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to block the PR on this but I wanted to point out to @marseel / @cilium/sig-scalability that in general we expect most groups to apply their review feedback during the primary PR against main
then the backport is a sanity check, so for this case @cilium/sig-scalability does not need to feel obliged to review all changes to stable branch modifications of this workflow. Perhaps as a followup we can just drop this change and defer to the existing rule.
/.github/workflows/*perf*.yaml @cilium/sig-scalability @cilium/github-sec @cilium/ci-structure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is totally fine with me - modification of CODEOWNERS was part of the commit where I introduced a new test, which seems reasonable in general, but I guess for backports we should get rid of it. I will open followup PR.
@joamaki Looks like the first three PRs listed in the description are not included in this backport? |
Looks like they were in v1.15 already. Fixed the labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the remaining backported PRs.
PRs skipped due to conflicts:
Once this PR is merged, a GitHub action will update the labels of these PRs: