Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.14 Backports 2024-04-22 #32104

Merged
merged 14 commits into from
Apr 23, 2024
Merged

v1.14 Backports 2024-04-22 #32104

merged 14 commits into from
Apr 23, 2024

Conversation

@jschwinger233 jschwinger233 added kind/backports This PR provides functionality previously merged into master. backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. labels Apr 22, 2024
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PRs look good, thanks!

@jschwinger233 jschwinger233 force-pushed the pr/v1.14-backport-2024-04-22-01-00 branch from 2f000df to 435188c Compare April 22, 2024 07:38
@jschwinger233
Copy link
Member Author

/test-backport-1.14

Copy link
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My commits LGTM

giorio94 and others added 14 commits April 23, 2024 11:55
[ upstream commit 100e625 ]

This prevents possible shenanigans caused by search domains possibly
configured on the runner, and propagated to the pods.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Gray Liang <gray.liang@isovalent.com>
[ upstream commit 804d5f0 ]

When number of concurrent dns requests was moderately high, there was a
chance that some of the gorutines would get stuck waiting for response.
Contains fix from cilium/dns#10

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
[ upstream commit c869e6c ]

Do not return an error from xds server when the context is cancelled, as
this is part of normal operation, and we test for this in
server_e2e_test.

This resolves a test flake:

panic: Fail in goroutine after  has completed

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Gray Liang <gray.liang@isovalent.com>
[ upstream commit 3cde59c ]

The main test goroutine might be completed before checks on the server
goroutine are completed, hence cause the below panic issue. This commit
is to defer the streamDone channel close to make sure the error check on
the stream server is done before returning from the test. We keep the
time check on the wait in the end of each test to not stall the tests in
case the stream server fails to exit.

Panic error
```
panic: Fail in goroutine after Test/ServerSuite/TestRequestStaleNonce has completed
```

Testing was done as per below:
```
$ go test -count 500 -run Test/ServerSuite/TestRequestStaleNonce ./pkg/envoy/xds/...
ok      github.com/cilium/cilium/pkg/envoy/xds  250.866s
```

Fixes: #31855
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Gray Liang <gray.liang@isovalent.com>
[ upstream commit df6afbd ]

[ backporter's note: moved changes from pkg/envoy/xds/stream_test.go to
pkg/envoy/xds/stream.go as v1.14 doesn't have the former file ]

Return io.EOF if test channel was closed, rather than returning a nil
request. This mimics the behavior of generated gRPC code, which never
returns nil request with a nil error.

This resolves a test flake with this error logs:

time="2024-04-16T08:46:23+02:00" level=error msg="received nil request from xDS stream; stopping xDS stream handling" subsys=xds xdsClientNode="node0~10.0.0.0~node0~bar" xdsStreamID=1

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
[ upstream commit d3c1fee ]

Speed up tests by eliminating CacheUpdateDelay, as it is generally not
needed.

When needed, replace with IsCompletedInTimeChecker that waits for upto
MaxCompletionDuration before returning, in contrast with
IsCompletedChecker that only returns the current state without any wait.

This change makes the server_e2e_test tests run >1000x faster.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Gray Liang <gray.liang@isovalent.com>
[ upstream commit 4efe9dd ]

TestRequestStaleNonce test code was written with the assumption that no
response would be reveived for a request with a stale nonce, and a second
SendRequest was done right after with the correct nonce value. This
caused two responses to be returned, and the first one could have been
with the old version of the resources.

Remove this duplicate SendRequest.

This resolves test flakes like this:

        --- FAIL: Test/ServerSuite/TestRequestStaleNonce (0.00s)
            server_e2e_test.go:784:
                ... response *discoveryv3.DiscoveryResponse = &discoveryv3.DiscoveryResponse{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(nil)}, sizeCache:0, unknownFields:[]uint8(nil), VersionInfo:"3", Resources:[]*anypb.Any{(*anypb.Any)(0x40003a63c0), (*anypb.Any)(0x40003a6410)}, Canary:false, TypeUrl:"type.googleapis.com/envoy.config.v3.DummyConfiguration", Nonce:"3", ControlPlane:(*corev3.ControlPlane)(nil)} ("version_info:\"3\" resources:{[type.googleapis.com/envoy.config.route.v3.RouteConfiguration]:{name:\"resource1\"}} resources:{[type.googleapis.com/envoy.config.route.v3.RouteConfiguration]:{name:\"resource0\"}} type_url:\"type.googleapis.com/envoy.config.v3.DummyConfiguration\" nonce:\"3\"")
                ... VersionInfo string = "4"
                ... Resources []protoreflect.ProtoMessage = []protoreflect.ProtoMessage{(*routev3.RouteConfiguration)(0xe45380)}
                ... Canary bool = false
                ... TypeUrl string = "type.googleapis.com/envoy.config.v3.DummyConfiguration"

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Gray Liang <gray.liang@isovalent.com>
[ upstream commit 75d144c ]

Stream timeout is a duration we use in tests to make sure the stream does
not stall for too long. In production we do not have such a timeout at
all, and in fact the requests are long-lived and responses are only sent
when there is something (new) to send.

Test stream timeout was 2 seconds, and it would occasionally cause a test
flake, especially if debug logging is enabled. This seems to happen due
to goroutine scheduling, and for this reason debug logging should not be
on for these tests.

Bump the test stream timeout to 4 seconds to further reduce the chance of
a test flake due to it.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Gray Liang <gray.liang@isovalent.com>
[ upstream commit 350e9d3 ]

[ backporter's note: minor conflicts in
.github/workflows/tests-clustermesh-upgrade.yaml ]

The goal being to slow down the rollout process, to better highlight
possible connection disruption occurring in the meanwhile. At the same
time, this also reduces the overall CPU load caused by datapath
recompilation, which is a possible additional cause for connection
disruption flakiness.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
[ upstream commit 7d2505e ]

[ backporter's note: minor conflicts in
.github/workflows/tests-clustermesh-upgrade.yaml ]

The default IPAM mode is cluster-pool, which gets automatically
overwritten by the Cilium CLI to kubernetes when running on kind.
However, the default helm value gets restored upon upgrade due to
--reset-values, causing confusion and possible issues. Hence, let's
explicitly configure it to kubernetes, to prevent changes.

Similarly, let's configure a single replica for the operator.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
[ upstream commit a0d7d37 ]

So that it gets actually executed.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Gray Liang <gray.liang@isovalent.com>
[ upstream commit 1e28a10 ]

[ backporter's note: minor conflicts in
.github/workflows/tests-clustermesh-upgrade.yaml ]

Hubble relay is not deployed in this workflow, hence it doesn't make
sense to wait for the image availability.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
[ upstream commit 0c211e1 ]

[ backporter's note: minor conflicts in
.github/workflows/tests-clustermesh-upgrade.yaml ]

As it simplifies troubleshooting possible connection disruptions.
However, let's configure monitor aggregation to medium (i.e., the
maximum, and default value) to avoid the performance penalty due
to the relatively high traffic load.

Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
[ upstream commit 9dc89f7 ]

Fixes: 5c06c8e ("ci-eks: Add IPsec key rotation tests")
Signed-off-by: Marco Iorio <marco.iorio@isovalent.com>
Signed-off-by: Zhichuan Liang <gray.liang@isovalent.com>
@jschwinger233 jschwinger233 force-pushed the pr/v1.14-backport-2024-04-22-01-00 branch from 435188c to 285a502 Compare April 23, 2024 03:57
@jschwinger233
Copy link
Member Author

/test-backport-1.14

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 23, 2024
@jschwinger233 jschwinger233 marked this pull request as ready for review April 23, 2024 08:19
@jschwinger233 jschwinger233 requested review from a team as code owners April 23, 2024 08:19
@jschwinger233 jschwinger233 requested a review from a team as a code owner April 23, 2024 08:19
@maintainer-s-little-helper maintainer-s-little-helper bot removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 23, 2024
@ti-mo ti-mo merged commit 7bfaa6a into v1.14 Apr 23, 2024
234 checks passed
@ti-mo ti-mo deleted the pr/v1.14-backport-2024-04-22-01-00 branch April 23, 2024 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.14 This PR represents a backport for Cilium 1.14.x of a PR that was merged to main. kind/backports This PR provides functionality previously merged into master.
Projects
No open projects
Status: Released
Development

Successfully merging this pull request may close these issues.

None yet

6 participants