Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.15 Backports 2024-03-19 #31490

Merged
merged 24 commits into from Mar 21, 2024
Merged

Conversation

gandro
Copy link
Member

@gandro gandro commented Mar 19, 2024

PRs skipped due to conflicts:

Once this PR is merged, a GitHub action will update the labels of these PRs:

 30906 30929 31030 31016 31119 31010 31366 31150 31365 31380 31358 31387 31344 31214 31413 26998 31271 31272 31420 31262 31437

@gandro gandro added kind/backports This PR provides functionality previously merged into master. backport/1.15 This PR represents a backport for Cilium 1.15.x of a PR that was merged to main. labels Mar 19, 2024
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for my PRs, thanks 🎖️

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My commit looks good, thanks!

@gandro
Copy link
Member Author

gandro commented Mar 19, 2024

/test-backport-1.15

@gandro gandro marked this pull request as ready for review March 19, 2024 11:14
@gandro gandro requested review from a team as code owners March 19, 2024 11:14
Copy link
Member

@pchaigno pchaigno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PRs look good. Thanks Sebastian!

Copy link
Member

@hemanthmalla hemanthmalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

Copy link
Contributor

@kkourt kkourt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my patch.

@gandro gandro force-pushed the pr/v1.15-backport-2024-03-19-11-05 branch from 4cf0b29 to 021d1da Compare March 20, 2024 09:56
[ upstream commit 7df437a ]

Rarely, the control plane test panics, due to a send on a closed
channel. This can occur in a narrow race window in the filteringWatcher:

1. Stop is called on the child watcher
2. Child watcher calls stop on parent watcher
3. Concurrently, an event is dequeued from the parent result chan, and
   we enter the filtering logic.
4. The parent result chan is closed, and we close the child event channel
5. The filter is matched, and we attempt to write on the closed channel,
   which causes the panic.

Instead of closing the channel in the Stop method, close the channel
from the writing goroutine (as is commonly considered best practice in
Go.)

Fixes: fa89802 (controlplane: Implement filtering of objects with field selectors)

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
bimmlerd and others added 17 commits March 20, 2024 13:58
[ upstream commit 08d898d ]

The commit below correctly identified that modifying the test input
makes the test flaky due to being dependent on the non-deterministic
ordering. However, it missed adding the work-around for the first test,
TestUnique.

Fixes: 32543a4 (slices: don't modify input slices in test)

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit cfd1790 ]

The fake k8s testing infrastructure has an unfortunate interaction
between real informers and the object tracker used in fake clientsets:
the informer uses ListAndWatch to subscribe to the api server, but
ListAndWatch relies on the ResourceVersion field to bridge the gap
between the List and the Watch. The simple object tracker does not
provide the resource versioning, nor does its Watch implementation
attempt to replay existing state.

The race window, hence, allows for creation of objects _after_ the
initial list, but _before_ the establishment of Watch. These objects are
not observed by the informer and thus tests are likely to fail.

As a workaround, we ensure that the object tracker has registered a
watcher before we start the test in earnest. It is only important that
we do not create/update/delete objects in the window between starting
the hive (and hence running of the informer) and having ensured that the
watcher is in place. Creation prior to starting the hive is okay, as
well as after ensuring the watcher exists.

Signed-off-by: David Bimmler <david.bimmler@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 1321e03 ]

Before the GwAPI doc listed KPR mode as a prerequisite. However,
it's actually only required to enable BPF nodePort support.

Signed-off-by: Philip Schmid <philip.schmid@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 44aeb53 ]
[ backporter notes: Minor conflict in values.yaml due to different
  intendation of podAntiAffinity in v1.15 ]

This commit is to avoid cilium-envoy running on the node without cilium
agent. Two main changes are:

- nodeAffinity to make sure that cilium-envoy will not be scheduled on
  node without cilium agent
- podAffinity with requiredDuringSchedulingIgnoredDuringExecution to
  cilium agent

Relates: #25081, #30034
Fixes: #31149
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 696a4fd ]

Since the test updates multiple k8s resources in each test step,
and each step depends on the previous one, we should be careful
about how many k8s resources are we actually changing in each step,
to not trigger multiple reconciliations with different results
(advertisements) in a single step.

This change ensures we change only one value that affects
the advertisement in each test step.

Signed-off-by: Rastislav Szabo <rastislav.szabo@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 4f4b0a7 ]

When we run out of IDs to allocate for nodes, we were propagating zero
ID to bpf map.
Now we just simply return error and not modify bpf map instead.
Also clean up incorrectly mapped nodeids on startup in case that
happened.

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 65bbf3f ]

During the refactor of the below commit, it seems the labels were left
out inadvertently, breaking the `cilium policy selectors` command that
displays the labels/name of the policy from which the selectors
originate from.

Fixes: 501944c ("policy/selectorcache: invert identitySelector interface")

Signed-off-by: Chris Tarazi <chris@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit a5eafe0 ]

Recently, we frequently see the CI failure with lvh-kind startup
failure with exit code 41. This indicates the timeout of the task
waiting for the SSH startup. Bump the timeout (retry) to 600 (10min) as
a workaround.

Fixes: #31336

Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 4ba7e6a ]

Commit 891fa78 ("bpf: Delete obsolete do_netdev_encrypt_pools()")
removed the special code we had to rewrite the IPsec outer header. The
code removed in the present commit is therefore not required anymore.

Fixes: 891fa78 ("bpf: Delete obsolete do_netdev_encrypt_pools()")
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit f61651f ]

This was broken during transition of pkg/metrics to integrate with Hive where relevant operator metrics where never initialized.
This adds a init func specific for operator and cleans up the "flush" logic used as a work around for errors/warnings emitted prior to agent starting (in the case of the operator).

Addresses: #29525

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 6436447 ]

I was trying to install cilium in a machine without a vxlan module
available. Add a helpful message for the next time this happens.

Tested by running:
> PRIVILEGED_TESTS=1 go test  -test.v -test.run TestSetupTunnelDevice

on said machine, and got the following output:
...

```
        Error:          Received unexpected error: setting up vxlan device: creating vxlan device: creating device cilium_vxlan: operation not supported, maybe kernel module for vxlan is not available?
	Test:           TestSetupTunnelDevice/Vxlan
```

Signed-off-by: Kornilios Kourtis <kornilios@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 1ca6141 ]

pkg/ipam/types.(*InstanceMap).DeepCopy(...) will iterate for all instances/interfaces in order to copy the data.
However, unlike what the name suggests, underlying instance pkg/ipam/types.Interface pointers are copied and shared in the returned instance map.
In some cases, this case result in memory corruption issues resulting in confusing panics while running tests such as:

```
panic: runtime error: makeslice: cap out of range
goroutine 1366 [running]:
strconv.appendQuotedWith({0xc000576208, 0x0, 0x3f?}, {0x1000000, 0x100000001000000}, 0x22, 0x0, 0x0)
	/opt/hostedtoolcache/go/1.22.0/x64/src/strconv/quote.go:35 +0x85
strconv.AppendQuote(...)

...
```

Capturing such an event in a debugger you would see a AzureInterface struct such as this with the IP string memory being corrupt (likely due to an interleaved read/write) being passed to logrus causing a crash.

```
github.com/cilium/cilium/pkg/azure/types.AzureAddress {
	IP: "\x10\x01\x00\x00\x00\x00\x00\x00\x10\x01\x00\x00\x00\x00\x00\x007d�_\x02\b\b\x19\x00\x00\x00\x00\x00\x00\x00\x00�\x1f�\x03\x00\x00\x00\x00W�\b\x00\x00\x00\x00\x00\x00p \x03\x00\x00\x00\x00�qi\x03\x00\x00\x00\x00...+51559946186908214 more",
	Subnet: "subsys",
	State: "instanceID",}
```
This ensures that the revision interface is correctly deepcopied such that the underlying resource is also safely copied.

Fixed: #31059

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit e8bed8d ]
[ backporter notes: conflicts in unit tests. Had to change introduce
  the `NoError` check instead of expecting a specific error ]

This commit is to add the same namespace while listing generated LB
service, the main reason is to avoid wrong status update for gateways
having the same name, but belonged to different namespace.

Testing was done locally as per below:

Before the fix:
```
$ kg gateway -A
NAMESPACE   NAME         CLASS    ADDRESS          PROGRAMMED   AGE
another     my-gateway   cilium   10.110.222.237   True         4s
default     my-gateway   cilium   10.110.222.237   True         56s
```

After the fix:

```
$ kg gateway -A
NAMESPACE   NAME         CLASS    ADDRESS          PROGRAMMED   AGE
another     my-gateway   cilium   10.102.170.180   True         14m
default     my-gateway   cilium   10.110.222.237   True         14m

$ kg services -A
NAMESPACE     NAME                           TYPE           CLUSTER-IP       EXTERNAL-IP      PORT(S)                      AGE
another       cilium-gateway-my-gateway      LoadBalancer   10.102.170.180   10.102.170.180   80:31424/TCP                 15m
default       cilium-gateway-my-gateway      LoadBalancer   10.110.222.237   10.110.222.237   80:31889/TCP                 15m
...
```

Fixes: #31270
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 5884af1 ]

This is to make sure that we have the coverage for bpf.tproxy enabled.
The first step is to test it with Ingress Controller enabled, we can add
more settings if required later.

Relates: #30331, #30404

Suggested-by: Marco Hofstetter <marco.hofstetter@isovalent.com>
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 0fb203e ]

So that the failure of one matrix entry (e.g., caused by a flake) doesn't
cancel the other ongoing tests, if any.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit 44e9005 ]

Currently in cases where reporters emit status on stopped reporters, a warning is logged.
This causes failures in CI related to endpoints, as those close their reporters after endpoint is deleted.

This scenario is probably not a valuable log for users and generally such occurrences should be harmless.
This moves the log to debug level.

Fixes: #31147

Signed-off-by: Tom Hadlaw <tom.hadlaw@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
[ upstream commit b639eab ]

In general, it is not recommended to carry several admin. operations on
the cluster at the same time, as it can make troubleshooting in case of
issues a lot more complicated. Mixing operations is also less likely to
be covered in CI so more likely to hit corner cases.

Performing IPsec key rotations during Cilium up/downgrades is one such
case. Let's document it explicitly to discourage users from doing that.

Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/v1.15-backport-2024-03-19-11-05 branch from 021d1da to 0ef35be Compare March 20, 2024 12:58
@gandro gandro requested a review from bimmlerd March 20, 2024 12:59
@gandro
Copy link
Member Author

gandro commented Mar 20, 2024

/test

@jrajahalme
Copy link
Member

/test-backport-1.15

Copy link
Member

@YutaroHayakawa YutaroHayakawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mine looks good. Thanks!

@gandro gandro added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 21, 2024
@gandro
Copy link
Member Author

gandro commented Mar 21, 2024

Non-trivial reviews are in, marking ready to merge.

Copy link
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the remaining commits

@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 21, 2024
@jrajahalme jrajahalme merged commit b92965b into v1.15 Mar 21, 2024
218 of 219 checks passed
@jrajahalme jrajahalme deleted the pr/v1.15-backport-2024-03-19-11-05 branch March 21, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.15 This PR represents a backport for Cilium 1.15.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet