Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.14 Backports 2024-03-12 #31335

Merged
merged 11 commits into from Mar 16, 2024
Merged

v1.14 Backports 2024-03-12 #31335

merged 11 commits into from Mar 16, 2024

Conversation

tommyp1ckles and others added 6 commits March 12, 2024 09:44
[ upstream commit 70b405f ]

On Linux/Unix based implementations, exec/cmd.Run will return either context.ContextCancelled or the error "signal: killed" depending on whether the cancellation occurred while the process was running.

There's several places we check on ```is.Errors(err, context.Cancelled)``` on whether to emit high level logs about failed program compilations.  Because already running cmd.Run() doesn't return an error that satisfies this, this will result in spurious error logs about failed compilation (i.e. "signal: killed")

This meant that in cases where a compilation is legitimately cancelled, we would still log an error such as

msg="BPF template object creation failed" ... error="...: compile bpf_lxc.o: signal: killed"

This can occur occasionally in CI, which enforces no error to pass, causing failures.

example:
```
	ctx, c := context.WithTimeout(context.Background(), time.Second)
	go func() {
		time.Sleep(time.Second)
		c()
	}()
	cmd := exec.CommandContext(ctx, "sleep", "2")
	fmt.Println(cmd.Run())

	ctx, c = context.WithTimeout(context.Background(), time.Second)
	c()
	cmd = exec.CommandContext(ctx, "sleep", "2")
	fmt.Println(cmd.Run())
```

To fix this, this will join in the ctx.Err() if it is:
* context.Cancelled
* The process has not exited itself.
* The process appeared to be SIGKILL'ed.

Addresses: #30991

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit a099bf1 ]

[ backporter's node: changes applied to plugins/cilium-cni/main.go rather than
plugins/cilium-cni/cmd/cmd.go ]

Unlike runtime agent/operator logs, CNI logs are just written to disk so we have no way to attach timestamps to them.
This makes it harder to debug CNI issues as we have no way to correlate when things happened between Agent logs and CNI events.

This switches CNI to use the same default logger, except with timestamps enabled.

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 927969b ]

[ backporter's note: changes applied to cilium/cmd/encrypt_flush.go
rather than cilium-dbg/cmd/encrypt_flush ]

This commit slightly changes the behavior of the "encrypt flush"
command in case of errors when trying to delete XFRM rules. The tool
currently lists rules, filters them based on user-given arguments, and
then deletes them. If an XFRM rule is deleted by the agent or the user
while we're filtering, the deletion will fail.

The current behavior in that case is to fatal. On busy clusters, that
might mean that we always fatal because XFRM states and policies are
constently added and removed.

This commit changes the behavior to proceed with subsequent deletions in
case one fails.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 5c2a67f ]

[ backporter's note: changes applied to cilium/cmd/encrypt_flush.go
rather than cilium-dbg/cmd/encrypt_flush ]

This commit refactors the code a bit simplify a latter commit. No
functional changes.

This may be a bit excessive in commit splitting, but at least I can
claim my last commit is free of any refactoring 😅

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 5eb27e2 ]

[ backporter's note: changes applied to cilium/cmd/encrypt_flush.go
rather than cilium-dbg/cmd/encrypt_flush ]

This new flag will allow users to clean stale XFRM states and policies
based on the node ID map contents. If XFRM states or policies are found
with a node ID that is not in the BPF map, then we probably have a leak
somewhere.

Such leaks can lead in extreme cases to performance degradation when the
number of XFRM states and policies grows large (and if using ENI or
Azure IPAM). Having a tool to cleanup these XFRM states and policies
until the leak is fixed can therefore be critical.

The new flag is incompatible with the --spi and --node-id filter flags.

We first dump the XFRM rules and then dump the map content. In that way,
if a new node ID is allocated while we're running the tool, we will
simply ignore the corresponding XFRM rules. If a node ID is removed
while running the tool, we will fail to remove the corresponding XFRM
rules and continue with the others.

Tested on a GKE cluster by adding fake XFRM states and policies that the
tool was able to remove.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit f92b528 ]

The CNI version should be specify so that in case we have to fallback
the installation of k8s via binaries it doesn't fail with the error:

```
10:29:25      k8s1-1.25: gzip: stdin: not in gzip format
10:29:25      k8s1-1.25: tar: Child returned status 1
10:29:25      k8s1-1.25: tar: Error is not recoverable: exiting now
```

Fixes: ce69afd ("add support for k8s 1.25.0")
Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
@jibi jibi added kind/backports This PR provides functionality previously merged into master. backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. labels Mar 12, 2024
@jibi jibi force-pushed the pr/v1.14-backport-2024-03-12 branch from ea8de7a to f9c2032 Compare March 12, 2024 08:58
CODEOWNERS Outdated Show resolved Hide resolved
aanm and others added 5 commits March 12, 2024 10:51
[ upstream commit 7a301a4 ]

This commit adds the GH workflow to run on arm machines. This
effectively means that we can remove our travis integration and only use
GH actions from now on.

Signed-off-by: André Martins <andre@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit b68cf99 ]

This is to extend the existing basic ingress docs with external lockdown
CCNP, while still allows in-cluster traffic to Ingress LB IP.

Relates: #28126
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit db7b3ef ]

Suppress the "Unable to determine next hop address" logs. While it shows
the L2 neighbor resolution failure, it does not always indicate a
datapath connectivity issue.

For example, when "devices=eth+" is specified and the device
naming/purposing is not consistent across the nodes in the cluster, in
some nodes "eth1" is a device reachable to other nodes, but in some
nodes, it is not. As a result, L2 Discovery generates an "Unable to
determine next hop address".

Another example is ENI mode with automatic device detection. When
secondary interfaces are added, they are used for L2 Neighbor Discovery
as well. However, nodes can only be reached via the primary interface
through the default route in the main routing table. Thus, L2 Neighbor
Discovery generates "Unable to determine next hop address" for secondary
interfaces.

In both cases, it does not always mean the datapath has an issue for KPR
functionality. However, the logs appear repeatedly, are noisy, and the
message is error-ish, causing confusion.

This log started to appear for some users who did not see it before from
v1.14.3 (#28858) and v1.15.0 (in theory). For v1.14.3, it affects KPR +
Tunnel users because of f2dcc86. Before the commit, we did not perform
L2 Neighbor Discovery in tunnel mode, so even if users had an interface
unreachable to other nodes, the log did not appear.

For v1.15.0, it affects to the users who used to have the unreachable
interface. 2058ed6 made it visible. Before the commit, some kind of the
errors like EHOSTUNREACH and ENETUNREACH were not caught because
FIBMatch option didn't specified. After v1.15.0, users started to see
the log.

Fixes: #28858

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 777b580 ]

Set ConnectionRetryTimeSeconds in the component tests to 1s in component
tests unless it is specified explicitly. Otherwise, when the initial
connect fails, we need to 120s for the next connection by default, which
may longer than the timeout of the test itself.

Fixes: #31217

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
[ upstream commit 81f14bb ]

This commit adjusts the usage of send_trace_notify in bpf_network.c to
enable monitor aggregation for all events emitted at this observation
point in the datapath. This change helps improve resource usage by
reducing the overall number of events that the datapath emits, while
still enabling packet observability with Hubble.

The events in bpf_network.c enable observability into the IPSec
processing of the datapath. Before this commit, multiple other efforts
have been made to increase the aggregation of events related to IPSec to
reduce resource usage, see #29616 and #27168. These efforts were related
to packets that were specifically marked as encrypted or decrypted by
IPSec and did not include events in bpf_network.c that were emitted when
either: (a) a plaintext packet has been received from the network, or
(b) a packet was decrypted and reinserted into the stack by XFRM. Both
of these events are candidates for aggregation because similar to-stack
events will be emitted down the line in the datapath anyways.
Additionally, these events are mainly useful for root-cause
analysis or debugging and are not necessarily helpful from an overall
observability standpoint.

Signed-off-by: Ryan Drew <ryan.drew@isovalent.com>
Signed-off-by: Gilberto Bertin <jibi@cilium.io>
@jibi jibi force-pushed the pr/v1.14-backport-2024-03-12 branch from f9c2032 to de1f8c1 Compare March 12, 2024 09:51
@sayboras
Copy link
Member

Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks and looks good for my commit.

@jibi
Copy link
Member Author

jibi commented Mar 12, 2024

/test-backport-1.14

@jibi jibi marked this pull request as ready for review March 12, 2024 18:28
@jibi jibi requested review from a team as code owners March 12, 2024 18:28
@jibi jibi requested a review from aanm March 12, 2024 18:28
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR looks good. Thanks!

Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@learnitall learnitall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jibi jibi added the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Mar 13, 2024
@julianwiedmann julianwiedmann removed the dont-merge/wait-until-release Freeze window for current release is blocking non-bugfix PRs label Mar 16, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Mar 16, 2024
@julianwiedmann julianwiedmann merged commit 0de4ac9 into v1.14 Mar 16, 2024
230 checks passed
@julianwiedmann julianwiedmann deleted the pr/v1.14-backport-2024-03-12 branch March 16, 2024 12:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master. ready-to-merge This PR has passed all tests and received consensus from code owners to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants