kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119894

miraradeva · 2024-03-04T23:13:13Z

This PR includes four commits:

c3edbe5 adds two new stats computation functions to compute the stats wrt only the user data keys in the range, and to compute the stats of all non-user-data keys in the range. These functions are used in the next commit to estimate the stats of the LHS of the split.
23bf51a includes the main logic for moving some of the stats computation from splitTrigger (where we hold latches) to AdminSplit (where we don't hold latches) to reduce the duration of the critical section of the split. The resulting MVCC stats are not always guaranteed to be 100% correct because they may not account for writes concurrent with the split.
37598e1 is a followup to 23bf51a that adds some extra protection against creating stats that differ too much from the range's data. It does so by ensuring the pre- and post-split stats for the entire range are close.
b88c331 adds logging and observability around the new stats computation.

Results:

We ran the following workloads to confirm the effect of the change on p99.99 latency. In the following screenshots, the green cluster is running with accurate stats computation (i.e. kv.split.estimated_mvcc_stats.enabled = false), and the yellow cluster is running with estimated stats computation, which is the default.

KV workload and default cluster settings.
./cockroach workload run kv --drop --insert-count=5000000 --max-block-bytes=1024 --min-block-bytes=1024 --concurrency=1024 --max-rate=10000

We see significant spikes in p99.99 latency on the accurate-stats cluster, and none on the estimated-stats cluster.

KV workload with sequential writes and load-based splitting/rebalancing turned off.
./cockroach workload run kv --drop –sequential --insert-count=5000000 --max-block-bytes=1024 --min-block-bytes=1024 --concurrency=1024 --max-rate=10000

Next, we tested a sequential-write load to see an even more pronounced effect on p99.99 latency. With the default cluster settings, the p99.99 latency is dominated by latch waits caused by load-based lease transfers (specifically LeaseTransfer requests holding latches).

To get a more meaningful test, we turned off load-based splitting and rebalancing (kv.range_split.by_load_enabled, kv.allocator.load_based_rebalancing).

The main trade-off of this change is the increased split duration due to re-computing MVCC stats at the start of a split.

This hasn’t been an issue but if it becomes one, turning stats re-computation off (kv.split.mvcc_stats_recomputation.enabled) brings the split time back to normal. Stats re-computation itself is not cheap; on a cluster that’s close to being overloaded, multiple splits with stats re-computation can significantly load the cluster additionally.

cockroach-teamcity · 2024-03-04T23:13:24Z

This change is

miraradeva · 2024-03-07T21:34:28Z

I ran a sequential KV workload with this PR, comparing to master. In the steady state, we can definitely see the shorter latch wait durations, and slightly lower p99.99 latency. But there are some significant spikes that I need to understand.

The spikes seem to somewhat correlate with admission delay.

Also, there is a significant increase in the total split duration after a certain point. I wonder if we're seeing the effects of RecomputeStats not being rate-limited. Rate-limiting it would probably make the split duration even higher, but it might help with the overload?

miraradeva · 2024-03-08T14:18:37Z

Disabling stats re-computation in replica_command seems to have helped a little in that there are no latch contention spikes anymore, but there are still spikes in p99.99 latency that are not there on master. The big one around 9:07 corresponded to some NotLeaseHolderErrors after some rebalancing, so it might be unrelated but at the same time it doesn't seem to be happening on master.

The split duration was similar and stable across the master and estimated stats clusters.

kvoli

Took a first pass, looking good. I'll wait for @nvanbenschoten to take a look before doing another round.

The spikes seem to somewhat correlate with admission delay.

This would be indicative that it is unrelated to the split latches.

I ran a sequential KV workload with this PR, comparing to master.

Would it be possible to run a uniform write only workload without pre-splitting. We should splits occur at the same time across ranges and that'd give a better indication of latency. I'm guessing for the sequential workload the splits were load based, so we might not get as much data scanned under latches vs size based splits.

Reviewed 3 of 5 files at r1, 11 of 11 files at r3, 11 of 11 files at r4, 7 of 7 files at r5, 6 of 6 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miraradeva, @nvanbenschoten, and @sumeerbhola)

-- commits line 61 at r5:
Consider mentioning the default diff thresholds for bytes/count in the commit message; also how they were selected.

-- commits line 83 at r6:
nit: add a release note as an ops change for these new metrics.

pkg/kv/kvserver/metrics.go line 3444 at r6 (raw file):

		ReplicaReadBatchWithoutInterleavingIter:  metric.NewCounter(metaReplicaReadBatchWithoutInterleavingIter),

		// Estimated MVCC stats in split

nit: period.

pkg/kv/kvserver/replica_command.go line 452 at r4 (raw file):

		ba.AddRawRequest(&req)
		if err = r.store.DB().Run(ctx, &ba); err != nil {
			log.Event(ctx, "failed to re-compute MVCCStats pre split")

nit: does it makes sense to include the error message here?

pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 61 at r5 (raw file):

	settings.SystemVisible,
	"kv.split.max_mvcc_stat_count_diff",
	"defines the max number of units that are acceptable for an individual MVCC stat to diverge; needs kv.split.estimated_mvcc_stats.enabled to be true",

nit: wrap to 80 chars.

pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 72 at r5 (raw file):

	settings.SystemVisible,
	"kv.split.max_mvcc_stat_bytes_diff",
	"defines the max number of bytes that are acceptable for an individual MVCC stat to diverge; needs kv.split.estimated_mvcc_stats.enabled to be true",

Same as above.

pkg/kv/kvserver/batcheval/split_stats_helper.go line 136 at r4 (raw file):

	// be scanned.
	ScanRightFirst bool
	// LeftIsEmpty denotes that the LHS is empty. If this is the case, the entire

nit: the brevity of LHS vs RHS is good but inconsistent with the field comments above, consider spelling them out.

pkg/kv/kvserver/batcheval/split_stats_helper.go line 221 at r4 (raw file):

// to be 100% accurate, so they are marked with ContainsEstimates. However,
// if there are no writes concurrent with the split (i.e. the pre-computed
// PreSplitLeftUser  still correspond to the correct LHS user stats), then the

nit: double space.

pkg/kv/kvserver/batcheval/split_stats_helper.go line 260 at r5 (raw file):

	}

	// If the user pre-split stats differ significantly form the current stats

nit: from.

pkg/kv/kvserver/batcheval/split_stats_helper.go line 276 at r6 (raw file):

		h.in.PreSplitStats, h.in.MaxCountDiff, h.in.MaxBytesDiff) {
		log.VEventf(context.Background(), 2,
			"split falling back to accurate stats computation because of large difference in pre- and post-split MVCC stats; pre: %v, post: %v",

nit : wrap to 80 chars.

pkg/storage/enginepb/mvcc.go line 207 at r5 (raw file):

// HasUserDataCloseTo compares the fields corresponding to user data and returns
// whether their absolute difference is within a certain limit. Separate limits
// are passed in for stats measures in count, bytes, and nanos.

Should the comment be updated or method changed? The nano fields (age) aren't being compared here.

pkg/kv/kvserver/rditer/stats_test.go line 1 at r3 (raw file):

// Copyright 2023 The Cockroach Authors.

s/2023/2024/

miraradeva · 2024-03-11T17:19:22Z

I re-ran the kv sequential workload on this PR with re-computation turned off, and with the count and byte thresholds disabled (just to make sure we don't fall back to accurate splits). It looks pretty good, and I confirmed any latency spikes (e.g. the green p99.99 bump just before 17:20) are due to latch conflicts with TransferLease.

In this graph, the first half is with the default cluster settings, and the second half is with load-based splitting turned off and with range size bumped up to 4gb to better see the individual splits and their impact on latency.

Additionally, I re-ran kv sequential with re-computation turned back on; it ran for over an hour and I didn't see the degraded behavior from the first run above.

miraradeva

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli, @nvanbenschoten, and @sumeerbhola)

-- commits line 61 at r5:

Previously, kvoli (Austen) wrote…

Consider mentioning the default diff thresholds for bytes/count in the commit message; also how they were selected.

Done.

pkg/kv/kvserver/replica_command.go line 452 at r4 (raw file):

Previously, kvoli (Austen) wrote…

nit: does it makes sense to include the error message here?

I returned the error instead to be more consistent with the rest of the error handling in this function.

pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 72 at r5 (raw file):

Previously, kvoli (Austen) wrote…

Same as above.

Done.

pkg/storage/enginepb/mvcc.go line 207 at r5 (raw file):

Previously, kvoli (Austen) wrote…

Should the comment be updated or method changed? The nano fields (age) aren't being compared here.

Done.

pkg/kv/kvserver/rditer/stats_test.go line 1 at r3 (raw file):

Previously, kvoli (Austen) wrote…

s/2023/2024/

Done.

nvanbenschoten

Reviewed 2 of 5 files at r1, 2 of 11 files at r3, 18 of 18 files at r7, 6 of 11 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli, @miraradeva, and @sumeerbhola)

pkg/kv/kvserver/replica_command.go line 467 at r8 (raw file):

	// post-split LHS stats by combining these stats with the non-user stats
	// computed in splitTrigger. More details in makeEstimatedSplitStatsHelper.
	userOnlyLeftStats, err := rditer.ComputeStatsForRangeUserOnly(

Should this scan be subject to the batcheval.EnableEstimatedMVCCStatsInSplit cluster setting?

Also, should it all be behind a cluster version check?

pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 65 at r8 (raw file):

	"if enabled, MVCC stats will be recomputed at the beginning of a split "+
		"to prevent stats estimates from drifting",
	util.ConstantWithMetamorphicTestBool("kv.split.mvcc_stats_recomputation.enabled", true))

I had thought we were going to default this to false, given your recent experiments.

pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1064 at r8 (raw file):

	startKey := split.LeftDesc.StartKey.AsRawKey()
	endKey := split.LeftDesc.EndKey.AsRawKey()
	if startKey.Compare(keys.LocalMax) <= 0 && endKey.Compare(keys.LocalMax) > 0 {

I think split.LeftDesc.KeySpan().AsRawSpanWithNoLocals() will give you what you want.

pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1185 at r8 (raw file):

	// makeEstimatedSplitStatsHelper.
	if rec.ClusterSettings().Version.IsActive(ctx, clusterversion.V24_1_EstimatedMVCCStatsInSplit) &&
		EnableEstimatedMVCCStatsInSplit.Get(&rec.ClusterSettings().SV) {

I think we should be checking statsInput.PreSplitLeftUser (statsInput.PreSplitLeftUser == enginepb.MVCCStats{}) to decide whether to enter this branch, and then lift these cluster setting and version checks to AdminSplit. Doing the checks here:

prevents you from skipping the scan in AdminSplit when the cluster setting is disabled
could race with a cluster version switch. The PreSplitLeftUser might not be supplied from an old node, but the cluster version may change before this request is processed.

pkg/kv/kvserver/batcheval/split_stats_helper.go line 136 at r8 (raw file):

	// be scanned.
	ScanRightFirst bool
	// LeftIsEmpty denotes that the left hand side is empty. If this is the case,

For both of these fields, these flags is only saying that the user keyspaces are empty, right? So they're not an indication that the stats are empty, just that an accurate stats computation will be cheap. Consider commenting that and also changing the name to something like LeftUserIsEmpty and RightUserIsEmpty.

pkg/kv/kvserver/batcheval/split_stats_helper.go line 144 at r8 (raw file):

	// accurately. This is cheap because the range is empty.
	RightIsEmpty bool
	// PreSplitLeftUser contains the pre-split user-only stats of the left hand

nit: feel free to move around these fields to where they make sense. For example, move this up below the other MVCCStats fields, and move PostSplitScanLocalLeftFn up below the other splitStatsScanFn fields.

pkg/kv/kvserver/batcheval/split_stats_helper.go line 233 at r8 (raw file):

	}

	// Fast path: if both ranges are guaranteed to be empty, just scan them and

Would it be simpler to handle all of the h.in.LeftIsEmpty || h.in.RightIsEmpty cases in makeSplitStatsHelper and then only use makeEstimatedSplitStatsHelper if !h.in.LeftIsEmpty && !h.in.RightIsEmpty? Then this function will always leave the stats on both sides estimated.

pkg/roachpb/data.proto line 133 at r8 (raw file):

  reserved 3, 4;
  // Pre-computed stats for the LHS spans corresponding to user data.

nit: move above reserved.

miraradeva

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli, @nvanbenschoten, and @sumeerbhola)

pkg/kv/kvserver/replica_command.go line 467 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this scan be subject to the batcheval.EnableEstimatedMVCCStatsInSplit cluster setting?

Also, should it all be behind a cluster version check?

Yes, it should check EnableEstimatedMVCCStatsInSplit.

But if we have the statsInput.PreSplitLeftUser == enginepb.MVCCStats{} check in cmd_end_transaction, we don't need the version gate anymore, right?

If the leaseholder is running an old version, it won't populate PreSplitLeftUser and we fall back to accurate stats. If the follower is running an old version, it only knows about accurate stats computation (and ignores PreSplitLeftUser). The only discrepancy we can get is if some replicas have the old version, and some have the new version; they would end up with different stats. But that should be ok? I can keep the version gate to prevent this but let me know if you think it's simpler to remove it.

pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 65 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I had thought we were going to default this to false, given your recent experiments.

I re-ran the experiment with stats re-computation on (ran it for over 2 hours), and couldn't repro the performance degradation I saw in that first experiment. Maybe a longer-term run will help decide but I don't have any concrete evidence that re-computation is causing issues.

pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1185 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we should be checking statsInput.PreSplitLeftUser (statsInput.PreSplitLeftUser == enginepb.MVCCStats{}) to decide whether to enter this branch, and then lift these cluster setting and version checks to AdminSplit. Doing the checks here:

prevents you from skipping the scan in AdminSplit when the cluster setting is disabled

could race with a cluster version switch. The PreSplitLeftUser might not be supplied from an old node, but the cluster version may change before this request is processed.

Yes, that's a great point. I hesitated to compare PreSplitLeftUser to enginepb.MVCCStats{} because the cases of (1) the field being unset, and (2) the left user stats being empty, are indistinguishable. But it's ok to fall back to accurate stats computation in both of these cases; in the former we have no choice, and in the latter the LHS is empty to it's not expensive to scan.

pkg/kv/kvserver/batcheval/split_stats_helper.go line 233 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Would it be simpler to handle all of the h.in.LeftIsEmpty || h.in.RightIsEmpty cases in makeSplitStatsHelper and then only use makeEstimatedSplitStatsHelper if !h.in.LeftIsEmpty && !h.in.RightIsEmpty? Then this function will always leave the stats on both sides estimated.

Yes, so much simpler!

nvanbenschoten

Reviewed 2 of 11 files at r8, 12 of 12 files at r12, 7 of 7 files at r13, 6 of 6 files at r14, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli, @miraradeva, and @sumeerbhola)

-- commits line 77 at r13:
Have we confirmed that the customer case which motivated this work will fall below these thresholds? We could look at the QPS on the range and the time the stats computation took to perform (the latch delay) and determine whether the thresholds would have been tripped.

pkg/kv/kvserver/replica_command.go line 467 at r8 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

Yes, it should check EnableEstimatedMVCCStatsInSplit.

But if we have the statsInput.PreSplitLeftUser == enginepb.MVCCStats{} check in cmd_end_transaction, we don't need the version gate anymore, right?

If the leaseholder is running an old version, it won't populate PreSplitLeftUser and we fall back to accurate stats. If the follower is running an old version, it only knows about accurate stats computation (and ignores PreSplitLeftUser). The only discrepancy we can get is if some replicas have the old version, and some have the new version; they would end up with different stats. But that should be ok? I can keep the version gate to prevent this but let me know if you think it's simpler to remove it.

I think it's probably good to keep the version gate, if only to avoid double-computing the stats in mixed-version clusters where the PreSplitLeftUser is computed and then ignored.

pkg/kv/kvserver/replica_command.go line 500 at r13 (raw file):

	if err := r.store.DB().Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
		return splitTxnAttempt(ctx, r.store, txn, leftDesc, rightDesc, desc, reason, userOnlyLeftStats, r.GetMVCCStats())

Should we grab r.GetMVCCStats at the same time that we're grabbing userOnlyLeftStats? Do we want to grab a new value on each txn retry?

pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1159 at r12 (raw file):

	// to accurate stats computation because scanning the empty LHS is not
	// expensive.
	if split.PreSplitLeftUserStats.Equal(enginepb.MVCCStats{}) ||

nit: does split.PreSplitLeftUserStats == (enginepb.MVCCStats{}) not work? The Equal method is auto-generated and operates on a interface{}, so it will be less performant.

pkg/storage/enginepb/mvcc.go line 212 at r13 (raw file):

) bool {
	countWithinLimit := func(v1 int64, v2 int64) bool {
		return math.Abs(float64(v1)-float64(v2)) <= float64(maxCountDiff)

nit: Abs is easy enough to define for int64. Want to just pull out a helper right above countWithinLimit?

miraradeva

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli, @nvanbenschoten, and @sumeerbhola)

-- commits line 77 at r13:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Have we confirmed that the customer case which motivated this work will fall below these thresholds? We could look at the QPS on the range and the time the stats computation took to perform (the latch delay) and determine whether the thresholds would have been tripped.

I bumped these up to 5k count and 5.12MB respectively, which ensures no splits fall through to accurate stats for the kv sequential workload (which was doing ~15k QPS). The thresholds (5k and 5.12MB) handle this comfortably; as I was tuning them, in the logs I saw count diffs of ~2k and bytes diffs of ~2M for individual stats.

I looked through some old escalations where the split latency was an issue. There isn't a consistent latch wait duration (it varies from a few hundred ms to 1.5-2 seconds); the QPS in these escalations seems to be around 8k on average, though in some it ramps up to 10k, and one escalation mentions 30k for some other high-qps workloads.

Since we won't be holding latches for 2s anymore, it's probably too extreme to expect (30k QPS) * (2s) = 60k concurrent writes. If we want to handle 30k QPS, and extrapolate from the kv sequential workload, we'd need about 4-5k count diff and 4-5M bytes diff. The bumped-up values should handle just about that much but not more. I'm happy to increase these more, especially if we decide to keep the re-computation cluster setting on.

pkg/kv/kvserver/replica_command.go line 467 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think it's probably good to keep the version gate, if only to avoid double-computing the stats in mixed-version clusters where the PreSplitLeftUser is computed and then ignored.

Good point, I'll keep it.

pkg/kv/kvserver/replica_command.go line 500 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we grab r.GetMVCCStats at the same time that we're grabbing userOnlyLeftStats? Do we want to grab a new value on each txn retry?

Yes, we should keep them consistent. In the next commit, I use the total pre-split stats (r.GetMVCCStats) to export a metric on the total bytes of estimates produced. But the actual estimates being produced come from the discrepancy between userOnlyLeftStats and the actual user left stats.

kvoli

Reviewed 19 of 19 files at r15.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

nvanbenschoten

Reviewed 2 of 19 files at r15, 6 of 11 files at r16, 5 of 7 files at r17, 6 of 6 files at r18, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @miraradeva and @sumeerbhola)

pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1212 at r18 (raw file):

				statsInput.PreSplitStats, statsInput.AbsPreSplitBothStored)
		}
		log.VEventf(ctx, 2, "falling back to accurate stats computation because %v", reason)

We should probably log.Info if preComputedStatsDiff. I imagine this will be a common suspect for tail latency in the near future, especially as we try to find the right values for the kv.split.max_mvcc_stat_XXX_diff settings.

pkg/storage/enginepb/mvcc.go line 212 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: Abs is easy enough to define for int64. Want to just pull out a helper right above countWithinLimit?

nit: I was thinking you could do it without math.Abs or the cast through a float64 (with the corresponding loss of precision). Just

abs := func(x int64) int64 {
    if x < 0 {
        return -x
    }
    return x
}

miraradeva

TYFTR!

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @sumeerbhola)

pkg/storage/enginepb/mvcc.go line 212 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: I was thinking you could do it without math.Abs or the cast through a float64 (with the corresponding loss of precision). Just
abs := func(x int64) int64 {
    if x < 0 {
        return -x
    }
    return x
}

Lol yes, easy enough.

…-user spans ComputeStatsForRange iterates over an entire range and computes its MVCC stats. This patch adds ComputeStatsForRangeUserOnly and ComputeStatsForRangeExcludingUser, which compute the MVCC stats corresponding to the user-only and non-user spans in the range, respectively. These functions are useful for computing the estimated stats proposed in cockroachdb#119499 by allowing to quickly iterate only over the non-user key spans in the range, and separately iterate over the user-only key spans. Part of: cockroachdb#119499 Release note: None

Previously, we computed MVCC stats for the LHS and RHS of a split in splitTrigger, while holding latches. This resulted in high tail latencies due to foreground work not being able to proceed as long as the split holds these latches. More details in cockroachdb#119499. This change moves some of the stats computation to AdminSplit, where no latches are held. In particular, in AdminSplit, we pre-compute the part of the LHS stats corresponding to user-data spans, and pass that to the splitTrigger. In the splitTrigger, we iterate only over the non-user spans in the range (which are small), and combine the resulting stats with the passed in user stats to arrive at the new LHS stats. The RHS stats are derived by substituting the LHS stats from the total on-disk stats. This computation does not produce 100% correct stats because any writes concurrent with the split may not be attributed correctly to the LHS or RHS. Therefore, both LHS and RHS stats are marked with ContainsEstimates. To prevent the stats from drifting after successive splits, we kick off a stats re-computation in AdminSplit (only if the on-disk stats contain estimates). Thus, each split may introduce some stats estimates, but before doing so it will correct any previous estimates. There are two invariants that the stats computation guarantees: 1. Non-user stats (SysBytes, SysCount, etc.) are always correct. This is because we scan those spans while holding latches. 2. If there are no writes concurrent with the split, the stats are always correct. Fixes: cockroachdb#119499 Release note (performance improvement): Splits no longer hold latches for time proportional to the range size while computing MVCC stats. Instead, MVCC stats are pre-computed before the critical section of the split. As a side effect, the resulting stats are no longer 100% accurate because they may correctly distribute writes concurrent with the split. To mitigate this, and to prevent the stats from drifting after successive splits, the existing stored stats are re-computed and corrected (if needed) as part of the non-critical section of the split.

…ccurate-stats split This patch is a followup to cockroachdb#119499, and allows to add thresholds on the number of keys and bytes of the difference between the pre-split MVCC stats (retrieved in AdminSplit) and the stats when the split holds latches (retrieved in splitTrigger). This difference corresponds to writes concurrent with the split. If the difference is too large, the split falls back to computing LHS and RHS stats accurately. The difference is computed only for stats corresponding to user-data; system stats are always kept accurate. These thresholds are tunable by two new cluster settings, MaxMVCCStatCountDiff and MaxMVCCStatBytesDiff, which denote the maximum number of entities (e.g. keys, intents) and maximum number of bytes (e.g. value bytes, range value bytes), respectively, that are acceptable as the difference between the pre- and post-split values of an individual stat. The former defaults to 5000, and the latter to 5.12MB (1% of the maximum range size). Fixes: cockroachdb#119503 Release note: None

This patch adds logging and two new metrics to track estimated MVCC stats computed during splits. `kv.split.estimated_stats`: the number of splits that computed estimated MVCC stats, as opposed to 100% accurate ones. `kv.split.total_bytes_estimates`: the number of total bytes of estimates introduced by splits. These are calculated as the difference between the pre-computed stats before the split and the stored stats during the split (while holding latches). Fixes: cockroachdb#119516 Release note (ops change): Two new metrics (kv.split.estimated_stats and kv.split.total_bytes_estimates) added to track the number of splits that produce MVCC stats estimates, and the total bytes of estimates produced.

miraradeva · 2024-03-21T12:33:28Z

bors r=nvanbenschoten,kvoli

craig · 2024-03-21T13:40:41Z

Build succeeded:

miraradeva mentioned this pull request Mar 5, 2024

kvserver: estimate stats during split when external files are present #119922

Merged

miraradeva force-pushed the mira-119499-commits branch 3 times, most recently from 655a2a5 to b88c331 Compare March 6, 2024 18:20

miraradeva marked this pull request as ready for review March 6, 2024 19:29

miraradeva requested review from a team as code owners March 6, 2024 19:29

miraradeva requested review from sumeerbhola, nvanbenschoten and kvoli March 6, 2024 19:29

kvoli reviewed Mar 11, 2024

View reviewed changes

miraradeva force-pushed the mira-119499-commits branch from b88c331 to 05daef9 Compare March 11, 2024 22:20

miraradeva commented Mar 11, 2024

View reviewed changes

miraradeva force-pushed the mira-119499-commits branch from 05daef9 to 9bca223 Compare March 12, 2024 16:40

nvanbenschoten requested a review from kvoli March 12, 2024 16:58

nvanbenschoten reviewed Mar 12, 2024

View reviewed changes

miraradeva force-pushed the mira-119499-commits branch from 9bca223 to 14f7ade Compare March 13, 2024 13:46

miraradeva commented Mar 13, 2024

View reviewed changes

nvanbenschoten reviewed Mar 13, 2024

View reviewed changes

miraradeva force-pushed the mira-119499-commits branch from 14f7ade to f1b8546 Compare March 14, 2024 14:29

miraradeva commented Mar 14, 2024

View reviewed changes

kvoli requested a review from nvanbenschoten March 15, 2024 21:47

kvoli approved these changes Mar 15, 2024

View reviewed changes

nvanbenschoten approved these changes Mar 15, 2024

View reviewed changes

miraradeva force-pushed the mira-119499-commits branch from f1b8546 to d0e5ee4 Compare March 18, 2024 10:53

miraradeva commented Mar 18, 2024

View reviewed changes

miraradeva added 4 commits March 18, 2024 20:06

miraradeva force-pushed the mira-119499-commits branch from d0e5ee4 to af328f4 Compare March 19, 2024 00:07

stevendanna mentioned this pull request Mar 20, 2024

kvserver: RequestCancelled error during AdminSplit is reported as replica corruption #120749

Open

craig bot merged commit 1867fa1 into cockroachdb:master Mar 21, 2024
22 checks passed

miraradeva mentioned this pull request Mar 22, 2024

roachtest: splits/largerange/size=32GiB,nodes=6 failed #120885

Closed

msbutler mentioned this pull request Mar 28, 2024

backupccl: admin split can hang for several minutes during online restore linking phase #121017

Closed

miraradeva mentioned this pull request Apr 15, 2024

kvserver: cpu spike during import due to split stats computation #122309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119894

kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119894

miraradeva commented Mar 4, 2024 •

edited

cockroach-teamcity commented Mar 4, 2024

miraradeva commented Mar 7, 2024 •

edited

miraradeva commented Mar 8, 2024

kvoli left a comment

miraradeva commented Mar 11, 2024 •

edited

miraradeva left a comment

nvanbenschoten left a comment

miraradeva left a comment

nvanbenschoten left a comment

miraradeva left a comment •

edited

kvoli left a comment

nvanbenschoten left a comment

miraradeva left a comment

miraradeva commented Mar 21, 2024

craig bot commented Mar 21, 2024

kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119894

kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119894

Conversation

miraradeva commented Mar 4, 2024 • edited

cockroach-teamcity commented Mar 4, 2024

miraradeva commented Mar 7, 2024 • edited

miraradeva commented Mar 8, 2024

kvoli left a comment

Choose a reason for hiding this comment

miraradeva commented Mar 11, 2024 • edited

miraradeva left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

miraradeva left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

miraradeva left a comment • edited

Choose a reason for hiding this comment

kvoli left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

miraradeva left a comment

Choose a reason for hiding this comment

miraradeva commented Mar 21, 2024

craig bot commented Mar 21, 2024

miraradeva commented Mar 4, 2024 •

edited

miraradeva commented Mar 7, 2024 •

edited

miraradeva commented Mar 11, 2024 •

edited

miraradeva left a comment •

edited