Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestStoreRangeMergeConcurrentRequests failed #68703

Closed
cockroach-teamcity opened this issue Aug 11, 2021 · 5 comments · Fixed by #68894
Closed

kv/kvserver: TestStoreRangeMergeConcurrentRequests failed #68703

cockroach-teamcity opened this issue Aug 11, 2021 · 5 comments · Fixed by #68894
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

kv/kvserver.TestStoreRangeMergeConcurrentRequests failed with artifacts on master @ 57173569d3892064bde592fee1fb119d64bf7d8b:

=== RUN   TestStoreRangeMergeConcurrentRequests
    test_log_scope.go:73: test logs captured to: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeMergeConcurrentRequests734036705
    test_log_scope.go:74: use -show-logs to present logs inline
    client_merge_test.go:2374: /System/"�aaa-testing": merge unexpected error: kv/kvserver/replica_command.go:791: merge failed: waiting for all left-hand replicas to initialize: operation "wait for replicas init" timed out after 5s: could not dial n1: context deadline exceeded
E210811 07:14:41.784119 157273 kv/kvserver/consistency_queue.go:188  [n1,consistencyChecker,s1,r12/1:/Table/1{6-7}] 1  computing own checksum: could not dial node ID 1: failed to connect to n1 at 127.0.0.1:34963: stopped
E210811 07:14:41.784221 157273 kv/kvserver/queue.go:1098  [n1,consistencyChecker,s1,r12/1:/Table/1{6-7}] 2  computing own checksum: could not dial node ID 1: failed to connect to n1 at 127.0.0.1:34963: stopped
    panic.go:613: -- test log scope end --
test logs left over in: /go/src/github.com/cockroachdb/cockroach/artifacts/logTestStoreRangeMergeConcurrentRequests734036705
--- FAIL: TestStoreRangeMergeConcurrentRequests (5.28s)
Reproduce

To reproduce, try:

make stressrace TESTS=TestStoreRangeMergeConcurrentRequests PKG=./pkg/kv/kvserver TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Parameters in this failure:

  • GOFLAGS=-parallel=4

Internal log

benesch marked as alumn{us/a}; resolving to nvanbenschoten instead

/cc @cockroachdb/kv nvanbenschoten

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Aug 11, 2021
@cockroach-teamcity cockroach-teamcity added this to roachtest/unit test backlog in KV Aug 11, 2021
@knz
Copy link
Contributor

knz commented Aug 12, 2021

I am seeing this test fail immediately under stress

E210812 11:49:33.443303 1152 kv/kvserver/consistency_queue.go:188  [n1,consistencyChecker,s1,r19/1:/Table/2{3-4}] 1  computing own checksum: could not dial node ID 1: failed to connect to n1 at 127.0.0.1:56209: initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"
E210812 11:49:33.443409 1152 kv/kvserver/queue.go:1098  [n1,consistencyChecker,s1,r19/1:/Table/2{3-4}] 2  computing own checksum: could not dial node ID 1: failed to connect to n1 at 127.0.0.1:56209: initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"
    client_merge_test.go:2374: /System/"aaa-testing": merge unexpected error: kv/kvserver/replica_command.go:791: merge failed: waiting for all left-hand replicas to initialize: could not dial n1: failed to connect to n1 at 127.0.0.1:56209: initial connection heartbeat failed: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: context deadline exceeded"

@knz
Copy link
Contributor

knz commented Aug 12, 2021

@nvanbenschoten suggested to have a short look to see what is a possible cause.

However we want to discuss this further:

  • for this particular issue, we'll want to discuss with the team where it falls inside the team scopes
  • nathan will try to have a look but on a best-effort, falling back on discussion next week

@erikgrinaker
Copy link
Contributor

I think it's just that these sorts of tests don't work very well under stress or stressrace with high concurrency, as they get too slow. stress defaults to the number of CPUs (24 on my system), which fails every time, but running with e.g. STRESSFLAGS='-p 8' gets me hundreds of runs without failures.

@nvanbenschoten
Copy link
Member

I'm not seeing this fail normally or under stress on my machine, but I think Erik's analysis is probably correct. We could skip.UnderStress this test, though it's not doing anything particularly out of the ordinary besides spinning up 16 workers. Are we stressing too hard in CI by assuming we can run a concurrent test process per CPU without badly overloading the system?

@erikgrinaker
Copy link
Contributor

Are we stressing too hard in CI by assuming we can run a concurrent test process per CPU without badly overloading the system?

It sort of depends on the test. I've always struggled with integration tests that have >= 3 servers under stress, both in CI and locally. However, regular unit tests that don't spin up larger components can run just fine with that amount of concurrency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants