New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.14 Backports 2024-03-12 #31335
v1.14 Backports 2024-03-12 #31335
Conversation
[ upstream commit 70b405f ] On Linux/Unix based implementations, exec/cmd.Run will return either context.ContextCancelled or the error "signal: killed" depending on whether the cancellation occurred while the process was running. There's several places we check on ```is.Errors(err, context.Cancelled)``` on whether to emit high level logs about failed program compilations. Because already running cmd.Run() doesn't return an error that satisfies this, this will result in spurious error logs about failed compilation (i.e. "signal: killed") This meant that in cases where a compilation is legitimately cancelled, we would still log an error such as msg="BPF template object creation failed" ... error="...: compile bpf_lxc.o: signal: killed" This can occur occasionally in CI, which enforces no error to pass, causing failures. example: ``` ctx, c := context.WithTimeout(context.Background(), time.Second) go func() { time.Sleep(time.Second) c() }() cmd := exec.CommandContext(ctx, "sleep", "2") fmt.Println(cmd.Run()) ctx, c = context.WithTimeout(context.Background(), time.Second) c() cmd = exec.CommandContext(ctx, "sleep", "2") fmt.Println(cmd.Run()) ``` To fix this, this will join in the ctx.Err() if it is: * context.Cancelled * The process has not exited itself. * The process appeared to be SIGKILL'ed. Addresses: #30991 Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit a099bf1 ] [ backporter's node: changes applied to plugins/cilium-cni/main.go rather than plugins/cilium-cni/cmd/cmd.go ] Unlike runtime agent/operator logs, CNI logs are just written to disk so we have no way to attach timestamps to them. This makes it harder to debug CNI issues as we have no way to correlate when things happened between Agent logs and CNI events. This switches CNI to use the same default logger, except with timestamps enabled. Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 927969b ] [ backporter's note: changes applied to cilium/cmd/encrypt_flush.go rather than cilium-dbg/cmd/encrypt_flush ] This commit slightly changes the behavior of the "encrypt flush" command in case of errors when trying to delete XFRM rules. The tool currently lists rules, filters them based on user-given arguments, and then deletes them. If an XFRM rule is deleted by the agent or the user while we're filtering, the deletion will fail. The current behavior in that case is to fatal. On busy clusters, that might mean that we always fatal because XFRM states and policies are constently added and removed. This commit changes the behavior to proceed with subsequent deletions in case one fails. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 5c2a67f ] [ backporter's note: changes applied to cilium/cmd/encrypt_flush.go rather than cilium-dbg/cmd/encrypt_flush ] This commit refactors the code a bit simplify a latter commit. No functional changes. This may be a bit excessive in commit splitting, but at least I can claim my last commit is free of any refactoring 😅 Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 5eb27e2 ] [ backporter's note: changes applied to cilium/cmd/encrypt_flush.go rather than cilium-dbg/cmd/encrypt_flush ] This new flag will allow users to clean stale XFRM states and policies based on the node ID map contents. If XFRM states or policies are found with a node ID that is not in the BPF map, then we probably have a leak somewhere. Such leaks can lead in extreme cases to performance degradation when the number of XFRM states and policies grows large (and if using ENI or Azure IPAM). Having a tool to cleanup these XFRM states and policies until the leak is fixed can therefore be critical. The new flag is incompatible with the --spi and --node-id filter flags. We first dump the XFRM rules and then dump the map content. In that way, if a new node ID is allocated while we're running the tool, we will simply ignore the corresponding XFRM rules. If a node ID is removed while running the tool, we will fail to remove the corresponding XFRM rules and continue with the others. Tested on a GKE cluster by adding fake XFRM states and policies that the tool was able to remove. Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit f92b528 ] The CNI version should be specify so that in case we have to fallback the installation of k8s via binaries it doesn't fail with the error: ``` 10:29:25 k8s1-1.25: gzip: stdin: not in gzip format 10:29:25 k8s1-1.25: tar: Child returned status 1 10:29:25 k8s1-1.25: tar: Error is not recoverable: exiting now ``` Fixes: ce69afd ("add support for k8s 1.25.0") Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
ea8de7a
to
f9c2032
Compare
[ upstream commit 7a301a4 ] This commit adds the GH workflow to run on arm machines. This effectively means that we can remove our travis integration and only use GH actions from now on. Signed-off-by: André Martins <andre@cilium.io> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit db7b3ef ] Suppress the "Unable to determine next hop address" logs. While it shows the L2 neighbor resolution failure, it does not always indicate a datapath connectivity issue. For example, when "devices=eth+" is specified and the device naming/purposing is not consistent across the nodes in the cluster, in some nodes "eth1" is a device reachable to other nodes, but in some nodes, it is not. As a result, L2 Discovery generates an "Unable to determine next hop address". Another example is ENI mode with automatic device detection. When secondary interfaces are added, they are used for L2 Neighbor Discovery as well. However, nodes can only be reached via the primary interface through the default route in the main routing table. Thus, L2 Neighbor Discovery generates "Unable to determine next hop address" for secondary interfaces. In both cases, it does not always mean the datapath has an issue for KPR functionality. However, the logs appear repeatedly, are noisy, and the message is error-ish, causing confusion. This log started to appear for some users who did not see it before from v1.14.3 (#28858) and v1.15.0 (in theory). For v1.14.3, it affects KPR + Tunnel users because of f2dcc86. Before the commit, we did not perform L2 Neighbor Discovery in tunnel mode, so even if users had an interface unreachable to other nodes, the log did not appear. For v1.15.0, it affects to the users who used to have the unreachable interface. 2058ed6 made it visible. Before the commit, some kind of the errors like EHOSTUNREACH and ENETUNREACH were not caught because FIBMatch option didn't specified. After v1.15.0, users started to see the log. Fixes: #28858 Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 777b580 ] Set ConnectionRetryTimeSeconds in the component tests to 1s in component tests unless it is specified explicitly. Otherwise, when the initial connect fails, we need to 120s for the next connection by default, which may longer than the timeout of the test itself. Fixes: #31217 Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 81f14bb ] This commit adjusts the usage of send_trace_notify in bpf_network.c to enable monitor aggregation for all events emitted at this observation point in the datapath. This change helps improve resource usage by reducing the overall number of events that the datapath emits, while still enabling packet observability with Hubble. The events in bpf_network.c enable observability into the IPSec processing of the datapath. Before this commit, multiple other efforts have been made to increase the aggregation of events related to IPSec to reduce resource usage, see #29616 and #27168. These efforts were related to packets that were specifically marked as encrypted or decrypted by IPSec and did not include events in bpf_network.c that were emitted when either: (a) a plaintext packet has been received from the network, or (b) a packet was decrypted and reinserted into the stack by XFRM. Both of these events are candidates for aggregation because similar to-stack events will be emitted down the line in the datapath anyways. Additionally, these events are mainly useful for root-cause analysis or debugging and are not necessarily helpful from an overall observability standpoint. Signed-off-by: Ryan Drew <ryan.drew@isovalent.com> Signed-off-by: Gilberto Bertin <jibi@cilium.io>
f9c2032
to
de1f8c1
Compare
Noted and thanks, I will take care of these two backports for 1.14. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks and looks good for my commit.
/test-backport-1.14 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My PR looks good. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
encrypt flush --stale
flag #31159 (@pchaigno)PRs skipped due to conflicts:
Once this PR is merged, a GitHub action will update the labels of these PRs: