-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: cpu spike during import due to split stats computation #122309
Comments
@kvoli and I looked at this together and it does seem likely that the new split behavior, introduced in #119894, is causing instability in two ways (see below). The precondition here is that the import is issuing tens (we see 60 in the logs for n11) of manual splits on the same non-empty range. In the new split behavior, each split first re-computes the MVCC stats for the range (to prevent stats from drifting across successive splits), and then computes the stats for the user data on the left hand side (to pass to the split trigger).
To confirm the above, we can disable estimated-stats splits ( |
Manual splits issued via AdminSplit are used in bulk operations, like import, and tests to split many ranges out of the same original range. Pre-computing the LHS user stats for each of these ranges concurrently causes CPU spikes and split slowness; issuing repeated RecomputeStats requests for the same range contributes even more and can cause contention on the range descriptor. Estimating MVCC stats during a split is an improvement targeted at size-based splits, so in this patch we revert the manual-split behavior back to computing accurate MVCC stats. Fixes: cockroachdb#122309 Release note: None
122824: kvserver: opt manual splits out of estimated MVCC stats r=miraradeva a=miraradeva Manual splits issued via AdminSplit are used in bulk operations, like import, and tests to split many ranges out of the same original range. Pre-computing the LHS user stats for each of these ranges concurrently causes CPU spikes and split slowness; issuing repeated RecomputeStats requests for the same range contributes even more and can cause contention on the range descriptor. Estimating MVCC stats during a split is an improvement targeted at size-based splits, so in this patch we revert the manual-split behavior back to computing accurate MVCC stats. Fixes: #122309 Release note: None 122908: roachtest: use time.Duration instead of int64 for latency measurement r=rafiss a=rafiss This makes the time printed in a more readable way in logs. Epic: None Release note: None 122926: roachtest: wait for 3x replication before chaos in multiregion/system-database r=rafiss a=rafiss The test could previously kill nodes before replication was complete, so we could lose quorum. Now we make sure replication is completed first. fixes #122742 Release note: None Co-authored-by: Mira Radeva <mira@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Manual splits issued via AdminSplit are used in bulk operations, like import, and tests to split many ranges out of the same original range. Pre-computing the LHS user stats for each of these ranges concurrently causes CPU spikes and split slowness; issuing repeated RecomputeStats requests for the same range contributes even more and can cause contention on the range descriptor. Estimating MVCC stats during a split is an improvement targeted at size-based splits, so in this patch we revert the manual-split behavior back to computing accurate MVCC stats. Fixes: #122309 Release note: None
Describe the problem
High CPU utilization during an import, which ranges are being split and scattered. The CPU is attributed to
AdminSplitRequest
computing the user MVCC stats of the (to be) LHS range:cockroach/pkg/kv/kvserver/replica_command.go
Line 537 in 610688b
To Reproduce
See (internal) thread.
Expected behavior
CPU doesn't spike beyond a nominal 10-15%.
Additional data / screenshots
There are profiles and cluster information linked in the above (internal) thread.
Environment:
V24.1.0-ALPHA.5-DEV-C43F54CDDE5B7578F4A0CA61DE41463F0D690993
Jira issue: CRDB-37797
The text was updated successfully, but these errors were encountered: