-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testcluster: default to queues and other moving parts disabled #107528
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
db-cy-23
quality-friday
A good issue to work on on Quality Friday
T-testeng
TestEng Team
Projects
Comments
cc @cockroachdb/test-eng |
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Jul 25, 2023
For a more holistic suggestion on how to fix this for the likely many other tests susceptible to similar issues, see: cockroachdb#107528 Fixes cockroachdb#101824. Release note: None Epic: none
craig bot
pushed a commit
that referenced
this issue
Jul 26, 2023
107265: liveness: allow registering callbacks after start r=erikgrinaker a=tbg I discovered[^1] a deadlock scenario when multiple nodes in the cluster restart with additional stores that need to be bootstrapped. In that case, liveness must be running when the StoreIDs are allocated, but it is not. Trying to address this problem, I realized that when an auxiliary Store is bootstrapped, it will create a new replicateQueue, which will register a new callback into NodeLiveness. But if liveness must be started at this point to fix #106706, we'll run into the assertion that checks that we don't register callbacks on a started node liveness. Something's got to give: we will allow registering callbacks at any given point in time, and they'll get an initial set of notifications synchronously. I audited the few users of RegisterCallback and this seems OK with all of them. [^1]: #106706 (comment) Epic: None Release note: None 107417: kvserver: ignore RPC conn when deciding to campaign/vote r=erikgrinaker a=erikgrinaker **kvserver: remove stale mayCampaignOnWake comment** The comment is about a parameter that no longer exists. **kvserver: revamp shouldCampaign/Forget tests** **kvserver: ignore RPC conn in `shouldCampaignOnWake`** Previously, `shouldCampaignOnWake()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent an unquiescing replica from acquiring Raft leadership if the leader is still alive but unable to heartbeat liveness, and the leader will be unable to acquire epoch leases in this case. This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state. **kvserver: ignore RPC conn in `shouldForgetLeaderOnVoteRequest`** Previously, `shouldForgetLeaderOnVoteRequest()` used `IsLiveMapEntry.IsLive` to determine whether the leader was dead. However, this not only depends on the node's liveness, but also its RPC connectivity. This can prevent granting votes to a new leader that may be attempting to acquire a epoch lease (which the current leader can't). This patch ignores the RPC connection state when deciding whether to campaign, using only on the liveness state. Resolves #107060. Epic: none Release note: None **kvserver: remove `StoreTestingKnobs.DisableLivenessMapConnHealth`** 107424: kvserver: scale Raft entry cache size with system memory r=erikgrinaker a=erikgrinaker The Raft entry cache size defaulted to 16 MB, which is rather small. This has been seen to cause tail latency and throughput degradation with high write volume on large nodes, correlating with a reduction in the entry cache hit rate. This patch linearly scales the Raft entry cache size as 1/256 of total system/cgroup memory, shared evenly between all stores, with a minimum 32 MB. For example, a 32 GB 8-vCPU node will have a 128 MB entry cache. This is a conservative default, since this memory is not accounted for in existing memory budgets nor by the `--cache` flag. We rarely see cache misses in production clusters anyway, and have seen significantly improved hit rates with this scaling (e.g. a 64 KB kv0 workload on 8-vCPU nodes increased from 87% to 99% hit rate). Resolves #98666. Epic: none Release note (performance improvement): The default Raft entry cache size has been increased from 16 MB to 1/256 of system memory with a minimum of 32 MB, divided evenly between all stores. This can be configured via `COCKROACH_RAFT_ENTRY_CACHE_SIZE`. 107442: kvserver: deflake TestRequestsOnFollowerWithNonLiveLeaseholder r=erikgrinaker a=tbg The test previously relied on aggressive liveness heartbeat expirations to avoid running for too long. As a result, it was flaky since liveness wasn't reliably pinned in the way the test wanted. The hybrid manual clock allows time to jump forward at an opportune moment. Use it here to avoid running with a tight lease interval. On my gceworker, previously flaked within a few minutes. As of this commit, I ran it for double-digit minutes without issue. Fixes #107200. Epic: None Release note: None 107526: kvserver: fail gracefully in TestLeaseTransferRejectedIfTargetNeedsSnapshot r=erikgrinaker a=tbg We saw this test hang in CI. What likely happened (according to the stacks) is that a lease transfer that was supposed to be caught by an interceptor never showed up in the interceptor. The most likely explanation is that it errored out before it got to evaluation. It then signaled a channel the test was only prepared to check later, so the test hung (waiting for a channel that was now never to be touched). This test is hard to maintain. It would be great (though, for now, out of reach) to write tests like it in a deterministic framework[^1] [^1]: see #105177. For now, fix the test so that when the (so far unknown) error rears its head again, it will fail properly, so we get to see the error and can take another pass at fixing the test (separately). Stressing this commit[^2], we get: > transferErrC unexpectedly signaled: /Table/Max: transfer lease unexpected > error: refusing to transfer lease to (n3,s3):3 because target may need a Raft > snapshot: replica in StateProbe This makes sense. The test wants to exercise the below-raft mechanism, but the above-raft mechanism also exists and while we didn't want to interact with it, we sometimes do[^1] The second commit introduces a testing knob that disables the above-raft mechanism selectively. I've stressed the test for 15 minutes without issues after this change. [^1]: somewhat related to #107524 [^2]: `./dev test --filter TestLeaseTransferRejectedIfTargetNeedsSnapshot --stress ./pkg/kv/kvserver/` on gceworker, 285s Fixes #106383. Epic: None Release note: None 107531: kvserver: disable replicate queue and lease transfers in closedts tests r=erikgrinaker a=tbg For a more holistic suggestion on how to fix this for the likely many other tests susceptible to similar issues, see: #107528 > 1171 runs so far, 0 failures, over 15m55s Fixes #101824. Release note: None Epic: none Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
craig bot
pushed a commit
that referenced
this issue
Jul 27, 2023
107613: kvserver: avoid moving parts in closedts tests r=erikgrinaker a=tbg Many tests (two just this last week[^1][^2]) are flaky because the replicate queue can make rebalancing decisions that undermine the state the test is trying to set up. Often, this happens "accidentally" because ReplicationAuto is our default ReplicationMode. This PR improves the situation at least for closed timestamp integration tests by switching them all over to `ReplicationManual` (and preventing any new ones from accidentally using `ReplicationAuto` in the future). This can be seen as a small step towards #107528, which I am increasingly convinced is an ocean worth boiling. [^1]: #107179 [^2]: #101824 Epic: None Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
andrewbaptist
added a commit
to andrewbaptist/cockroach
that referenced
this issue
Sep 29, 2023
This change reduces the long startup time waiting for SQL services to start for tests that are KV layer and below. Additionally this disables the call to set `kv.range_merge.queue.enabled` as part of testcluster as it is no longer necessary. Informs: cockroachdb#107528 Epic: none Release note: None
andrewbaptist
added a commit
to andrewbaptist/cockroach
that referenced
this issue
Sep 29, 2023
This change reduces the long startup time waiting for SQL services to start for tests that are KV layer and below. Additionally this disables the call to set `kv.range_merge.queue.enabled` as part of testcluster as it is no longer necessary. Informs: cockroachdb#107528 Epic: none Release note: None
andrewbaptist
added a commit
to andrewbaptist/cockroach
that referenced
this issue
Sep 29, 2023
This change reduces the long startup time waiting for SQL services to start for tests that are KV layer and below. Additionally this disables the call to set `kv.range_merge.queue.enabled` as part of testcluster as it is no longer necessary. Informs: cockroachdb#107528 Epic: none Release note: None
andrewbaptist
added a commit
to andrewbaptist/cockroach
that referenced
this issue
Oct 2, 2023
This change reduces the long startup time waiting for SQL services to start for tests that are KV layer and below. Additionally this disables the call to set `kv.range_merge.queue.enabled` as part of testcluster as it is no longer necessary. Informs: cockroachdb#107528 Epic: none Release note: None
andrewbaptist
added a commit
to andrewbaptist/cockroach
that referenced
this issue
Oct 2, 2023
This change reduces the long startup time waiting for SQL services to start for tests that are KV layer and below. Additionally this disables the call to set `kv.range_merge.queue.enabled` as part of testcluster as it is no longer necessary. Informs: cockroachdb#107528 Epic: none Release note: None
andrewbaptist
added a commit
to andrewbaptist/cockroach
that referenced
this issue
Oct 2, 2023
This change reduces the long startup time waiting for SQL services to start for tests that are KV layer and below. Additionally this disables the call to set `kv.range_merge.queue.enabled` as part of testcluster as it is no longer necessary. Informs: cockroachdb#107528 Epic: none Release note: None
THardy98
pushed a commit
to THardy98/cockroach
that referenced
this issue
Oct 6, 2023
This change reduces the long startup time waiting for SQL services to start for tests that are KV layer and below. Additionally this disables the call to set `kv.range_merge.queue.enabled` as part of testcluster as it is no longer necessary. Informs: cockroachdb#107528 Epic: none Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
db-cy-23
quality-friday
A good issue to work on on Quality Friday
T-testeng
TestEng Team
Is your feature request related to a problem? Please describe.
I've been spending a lot of time on test flakes lately, and as is well known, we rely (some would say over-rely) on TestCluster tests fairly heavily to test very precise interactions, often in the transaction and lease protocols.
TestCluster has many moving parts and this brings an amount of nondeterminism that produces an static of test flakes. These are often not very interesting (the test expects the lease to be in one place, but the store rebalancer may move it, oops) and they are not very frequent for any individual test. However, they are very annoying to reproduce and hard to sort out, and since they affect many tests, they amount to a tax we are paying without anything too concrete in return.
Generally speaking, our testing should be more deterministic1 when we're in the context of a unit test, and less deterministic when we're end-to-end testing (i.e. roachtest). This issue is about the former.
Let's default to turning off all moving parts in TestCluster that aren't necessary for the correct functionality of a purely in-memory cluster.
Today, TestCluster has two "replication" modes:
We should evolve this into a general "Mode" follows:
For example,
StartTestCluster
in modeReplicationAuto
should turn the replicate queue off after it has waited for full replication. Similarly, we want to disable the other queues (+store rebalancer), perhaps with the exception of the Raft snapshot queue, though that one too should be disabled once we are convinced there aren't any more errant raft snapshots2.To migrate over, we'd replace the semantics of
ReplicationAuto
with that ofStaticAfterUpreplication
and see which tests fail, migrating them into one of the other options as appropriate. We'd do something similar with all explicit uses ofReplicationManual
.Something similar goes for the few direct users of
TestServer
(which should likely migrate toTestCluster
instead).Describe the solution you'd like
Describe alternatives you've considered
Additional context
Adding quality-friday in case someone wants to prototype this and get an idea of how many tests fail loudly with the new semantics.
Jira issue: CRDB-30097
Footnotes
https://github.com/cockroachdb/cockroach/issues/105177 ↩
https://github.com/cockroachdb/cockroach/issues/87553 ↩
The text was updated successfully, but these errors were encountered: