Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: De-flake xds server_e2e_test #32004

Merged
merged 6 commits into from
Apr 17, 2024
Merged

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Apr 16, 2024

De-flake xds server_e2e_test by:

  • return nil error from xDS server after context cancel
  • wait for stream server error check to be done before completing test
  • return io.EOF upon channel close from stream_test mock
  • eliminate xDS CacheUpdateDelay
  • eliminate duplicate SendRequest in TestRequestStaleNonce
  • increase xDS test stream timeout

See commit messages for rationale for each step.

Fixes: #31855

Do not return an error from xds server when the context is cancelled, as
this is part of normal operation, and we test for this in
server_e2e_test.

This resolves a test flake:

panic: Fail in goroutine after  has completed

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
The main test goroutine might be completed before checks on the server
goroutine are completed, hence cause the below panic issue. This commit
is to defer the streamDone channel close to make sure the error check on
the stream server is done before returning from the test. We keep the
time check on the wait in the end of each test to not stall the tests in
case the stream server fails to exit.

Panic error
```
panic: Fail in goroutine after Test/ServerSuite/TestRequestStaleNonce has completed
```

Testing was done as per below:
```
$ go test -count 500 -run Test/ServerSuite/TestRequestStaleNonce ./pkg/envoy/xds/...
ok      github.com/cilium/cilium/pkg/envoy/xds  250.866s
```

Fixes: cilium#31855
Signed-off-by: Tam Mach <tam.mach@cilium.io>
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Return io.EOF if test channel was closed, rather than returning a nil
request. This mimics the behavior of generated gRPC code, which never
returns nil request with a nil error.

This resolves a test flake with this error logs:

time="2024-04-16T08:46:23+02:00" level=error msg="received nil request from xDS stream; stopping xDS stream handling" subsys=xds xdsClientNode="node0~10.0.0.0~node0~bar" xdsStreamID=1

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Speed up tests by eliminating CacheUpdateDelay, as it is generally not
needed.

When needed, replace with IsCompletedInTimeChecker that waits for upto
MaxCompletionDuration before returning, in contrast with
IsCompletedChecker that only returns the current state without any wait.

This change makes the server_e2e_test tests run >1000x faster.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
TestRequestStaleNonce test code was written with the assumption that no
response would be reveived for a request with a stale nonce, and a second
SendRequest was done right after with the correct nonce value. This
caused two responses to be returned, and the first one could have been
with the old version of the resources.

Remove this duplicate SendRequest.

This resolves test flakes like this:

        --- FAIL: Test/ServerSuite/TestRequestStaleNonce (0.00s)
            server_e2e_test.go:784:
                ... response *discoveryv3.DiscoveryResponse = &discoveryv3.DiscoveryResponse{state:impl.MessageState{NoUnkeyedLiterals:pragma.NoUnkeyedLiterals{}, DoNotCompare:pragma.DoNotCompare{}, DoNotCopy:pragma.DoNotCopy{}, atomicMessageInfo:(*impl.MessageInfo)(nil)}, sizeCache:0, unknownFields:[]uint8(nil), VersionInfo:"3", Resources:[]*anypb.Any{(*anypb.Any)(0x40003a63c0), (*anypb.Any)(0x40003a6410)}, Canary:false, TypeUrl:"type.googleapis.com/envoy.config.v3.DummyConfiguration", Nonce:"3", ControlPlane:(*corev3.ControlPlane)(nil)} ("version_info:\"3\" resources:{[type.googleapis.com/envoy.config.route.v3.RouteConfiguration]:{name:\"resource1\"}} resources:{[type.googleapis.com/envoy.config.route.v3.RouteConfiguration]:{name:\"resource0\"}} type_url:\"type.googleapis.com/envoy.config.v3.DummyConfiguration\" nonce:\"3\"")
                ... VersionInfo string = "4"
                ... Resources []protoreflect.ProtoMessage = []protoreflect.ProtoMessage{(*routev3.RouteConfiguration)(0xe45380)}
                ... Canary bool = false
                ... TypeUrl string = "type.googleapis.com/envoy.config.v3.DummyConfiguration"

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Stream timeout is a duration we use in tests to make sure the stream does
not stall for too long. In production we do not have such a timeout at
all, and in fact the requests are long-lived and responses are only sent
when there is something (new) to send.

Test stream timeout was 2 seconds, and it would occasionally cause a test
flake, especially if debug logging is enabled. This seems to happen due
to goroutine scheduling, and for this reason debug logging should not be
on for these tests.

Bump the test stream timeout to 4 seconds to further reduce the chance of
a test flake due to it.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme added area/CI Continuous Integration testing issue or flake area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Apr 16, 2024
@jrajahalme jrajahalme requested a review from a team as a code owner April 16, 2024 13:50
@maintainer-s-little-helper maintainer-s-little-helper bot added dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Apr 16, 2024
@jrajahalme jrajahalme added release-note/ci This PR makes changes to the CI. and removed dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. labels Apr 16, 2024
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the tests are running much faster than previous iteration ✔️

$ go test -count 1000 ./pkg/envoy/xds/...           
ok      github.com/cilium/cilium/pkg/envoy/xds  4.371s

@jrajahalme
Copy link
Member Author

/test

@jrajahalme
Copy link
Member Author

@sayboras How far you think we should backport this? The code touched here has barely changed, so it should be easy to backport, and recuding flakes in release builds may be valuable as well.

@sayboras
Copy link
Member

How far you think we should backport this? The code touched here has barely changed, so it should be easy to backport, and recuding flakes in release builds may be valuable as well.

Yes, xds code is not changed much for a long time, we can backport this till 1.13 imo.

@jrajahalme jrajahalme added needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 16, 2024
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Apr 16, 2024
@sayboras sayboras added this pull request to the merge queue Apr 17, 2024
Merged via the queue into cilium:main with commit 75d144c Apr 17, 2024
62 of 63 checks passed
@giorio94 giorio94 mentioned this pull request Apr 18, 2024
6 tasks
@giorio94 giorio94 added backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. and removed needs-backport/1.13 This PR / issue needs backporting to the v1.13 branch labels Apr 18, 2024
@jschwinger233 jschwinger233 mentioned this pull request Apr 22, 2024
9 tasks
@jschwinger233 jschwinger233 added backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. and removed needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 22, 2024
@jschwinger233 jschwinger233 mentioned this pull request Apr 22, 2024
5 tasks
@jschwinger233 jschwinger233 added backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. and removed needs-backport/1.14 This PR / issue needs backporting to the v1.14 branch labels Apr 22, 2024
@github-actions github-actions bot added backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. and removed backport-pending/1.13 The backport for Cilium 1.13.x for this PR is in progress. backport-pending/1.14 The backport for Cilium 1.14.x for this PR is in progress. backport-pending/1.15 The backport for Cilium 1.15.x for this PR is in progress. labels Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. backport-done/1.13 The backport for Cilium 1.13.x for this PR is done. backport-done/1.14 The backport for Cilium 1.14.x for this PR is done. backport-done/1.15 The backport for Cilium 1.15.x for this PR is done. ci/flake This is a known failure that occurs in the tree. Please investigate me! ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/ci This PR makes changes to the CI.
Projects
None yet
4 participants