Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: compute MVCCStats in AdminSplit instead of splitTrigger #119894

Merged
merged 4 commits into from Mar 21, 2024

Conversation

miraradeva
Copy link
Contributor

@miraradeva miraradeva commented Mar 4, 2024

This PR includes four commits:

  • c3edbe5 adds two new stats computation functions to compute the stats wrt only the user data keys in the range, and to compute the stats of all non-user-data keys in the range. These functions are used in the next commit to estimate the stats of the LHS of the split.
  • 23bf51a includes the main logic for moving some of the stats computation from splitTrigger (where we hold latches) to AdminSplit (where we don't hold latches) to reduce the duration of the critical section of the split. The resulting MVCC stats are not always guaranteed to be 100% correct because they may not account for writes concurrent with the split.
  • 37598e1 is a followup to 23bf51a that adds some extra protection against creating stats that differ too much from the range's data. It does so by ensuring the pre- and post-split stats for the entire range are close.
  • b88c331 adds logging and observability around the new stats computation.

Results:

We ran the following workloads to confirm the effect of the change on p99.99 latency. In the following screenshots, the green cluster is running with accurate stats computation (i.e. kv.split.estimated_mvcc_stats.enabled = false), and the yellow cluster is running with estimated stats computation, which is the default.

  1. KV workload and default cluster settings.
    ./cockroach workload run kv --drop --insert-count=5000000 --max-block-bytes=1024 --min-block-bytes=1024 --concurrency=1024 --max-rate=10000

We see significant spikes in p99.99 latency on the accurate-stats cluster, and none on the estimated-stats cluster.

non-sequential
  1. KV workload with sequential writes and load-based splitting/rebalancing turned off.
    ./cockroach workload run kv --drop –sequential --insert-count=5000000 --max-block-bytes=1024 --min-block-bytes=1024 --concurrency=1024 --max-rate=10000

Next, we tested a sequential-write load to see an even more pronounced effect on p99.99 latency. With the default cluster settings, the p99.99 latency is dominated by latch waits caused by load-based lease transfers (specifically LeaseTransfer requests holding latches).

nolh-error

To get a more meaningful test, we turned off load-based splitting and rebalancing (kv.range_split.by_load_enabled, kv.allocator.load_based_rebalancing).

sequential-no-load

The main trade-off of this change is the increased split duration due to re-computing MVCC stats at the start of a split.

split-processing

This hasn’t been an issue but if it becomes one, turning stats re-computation off (kv.split.mvcc_stats_recomputation.enabled) brings the split time back to normal. Stats re-computation itself is not cheap; on a cluster that’s close to being overloaded, multiple splits with stats re-computation can significantly load the cluster additionally.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miraradeva
Copy link
Contributor Author

miraradeva commented Mar 7, 2024

I ran a sequential KV workload with this PR, comparing to master. In the steady state, we can definitely see the shorter latch wait durations, and slightly lower p99.99 latency. But there are some significant spikes that I need to understand.

Screenshot 2024-03-07 at 4 28 47 PM

The spikes seem to somewhat correlate with admission delay.

Screenshot 2024-03-07 at 4 32 24 PM

Also, there is a significant increase in the total split duration after a certain point. I wonder if we're seeing the effects of RecomputeStats not being rate-limited. Rate-limiting it would probably make the split duration even higher, but it might help with the overload?

Screenshot 2024-03-07 at 4 33 28 PM

@miraradeva
Copy link
Contributor Author

Disabling stats re-computation in replica_command seems to have helped a little in that there are no latch contention spikes anymore, but there are still spikes in p99.99 latency that are not there on master. The big one around 9:07 corresponded to some NotLeaseHolderErrors after some rebalancing, so it might be unrelated but at the same time it doesn't seem to be happening on master.

The split duration was similar and stable across the master and estimated stats clusters.

Screenshot 2024-03-08 at 9 10 33 AM

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass, looking good. I'll wait for @nvanbenschoten to take a look before doing another round.

The spikes seem to somewhat correlate with admission delay.

This would be indicative that it is unrelated to the split latches.

I ran a sequential KV workload with this PR, comparing to master.

Would it be possible to run a uniform write only workload without pre-splitting. We should splits occur at the same time across ranges and that'd give a better indication of latency. I'm guessing for the sequential workload the splits were load based, so we might not get as much data scanned under latches vs size based splits.

Reviewed 3 of 5 files at r1, 11 of 11 files at r3, 11 of 11 files at r4, 7 of 7 files at r5, 6 of 6 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miraradeva, @nvanbenschoten, and @sumeerbhola)


-- commits line 61 at r5:
Consider mentioning the default diff thresholds for bytes/count in the commit message; also how they were selected.


-- commits line 83 at r6:
nit: add a release note as an ops change for these new metrics.


pkg/kv/kvserver/metrics.go line 3444 at r6 (raw file):

		ReplicaReadBatchWithoutInterleavingIter:  metric.NewCounter(metaReplicaReadBatchWithoutInterleavingIter),

		// Estimated MVCC stats in split

nit: period.


pkg/kv/kvserver/replica_command.go line 452 at r4 (raw file):

		ba.AddRawRequest(&req)
		if err = r.store.DB().Run(ctx, &ba); err != nil {
			log.Event(ctx, "failed to re-compute MVCCStats pre split")

nit: does it makes sense to include the error message here?


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 61 at r5 (raw file):

	settings.SystemVisible,
	"kv.split.max_mvcc_stat_count_diff",
	"defines the max number of units that are acceptable for an individual MVCC stat to diverge; needs kv.split.estimated_mvcc_stats.enabled to be true",

nit: wrap to 80 chars.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 72 at r5 (raw file):

	settings.SystemVisible,
	"kv.split.max_mvcc_stat_bytes_diff",
	"defines the max number of bytes that are acceptable for an individual MVCC stat to diverge; needs kv.split.estimated_mvcc_stats.enabled to be true",

Same as above.


pkg/kv/kvserver/batcheval/split_stats_helper.go line 136 at r4 (raw file):

	// be scanned.
	ScanRightFirst bool
	// LeftIsEmpty denotes that the LHS is empty. If this is the case, the entire

nit: the brevity of LHS vs RHS is good but inconsistent with the field comments above, consider spelling them out.


pkg/kv/kvserver/batcheval/split_stats_helper.go line 221 at r4 (raw file):

// to be 100% accurate, so they are marked with ContainsEstimates. However,
// if there are no writes concurrent with the split (i.e. the pre-computed
// PreSplitLeftUser  still correspond to the correct LHS user stats), then the

nit: double space.


pkg/kv/kvserver/batcheval/split_stats_helper.go line 260 at r5 (raw file):

	}

	// If the user pre-split stats differ significantly form the current stats

nit: from.


pkg/kv/kvserver/batcheval/split_stats_helper.go line 276 at r6 (raw file):

		h.in.PreSplitStats, h.in.MaxCountDiff, h.in.MaxBytesDiff) {
		log.VEventf(context.Background(), 2,
			"split falling back to accurate stats computation because of large difference in pre- and post-split MVCC stats; pre: %v, post: %v",

nit : wrap to 80 chars.


pkg/storage/enginepb/mvcc.go line 207 at r5 (raw file):

// HasUserDataCloseTo compares the fields corresponding to user data and returns
// whether their absolute difference is within a certain limit. Separate limits
// are passed in for stats measures in count, bytes, and nanos.

Should the comment be updated or method changed? The nano fields (age) aren't being compared here.


pkg/kv/kvserver/rditer/stats_test.go line 1 at r3 (raw file):

// Copyright 2023 The Cockroach Authors.

s/2023/2024/

@miraradeva
Copy link
Contributor Author

miraradeva commented Mar 11, 2024

I re-ran the kv sequential workload on this PR with re-computation turned off, and with the count and byte thresholds disabled (just to make sure we don't fall back to accurate splits). It looks pretty good, and I confirmed any latency spikes (e.g. the green p99.99 bump just before 17:20) are due to latch conflicts with TransferLease.

In this graph, the first half is with the default cluster settings, and the second half is with load-based splitting turned off and with range size bumped up to 4gb to better see the individual splits and their impact on latency.

Screenshot 2024-03-11 at 12 28 10 PM

Additionally, I re-ran kv sequential with re-computation turned back on; it ran for over an hour and I didn't see the degraded behavior from the first run above.

Copy link
Contributor Author

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli, @nvanbenschoten, and @sumeerbhola)


-- commits line 61 at r5:

Previously, kvoli (Austen) wrote…

Consider mentioning the default diff thresholds for bytes/count in the commit message; also how they were selected.

Done.


pkg/kv/kvserver/replica_command.go line 452 at r4 (raw file):

Previously, kvoli (Austen) wrote…

nit: does it makes sense to include the error message here?

I returned the error instead to be more consistent with the rest of the error handling in this function.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 72 at r5 (raw file):

Previously, kvoli (Austen) wrote…

Same as above.

Done.


pkg/storage/enginepb/mvcc.go line 207 at r5 (raw file):

Previously, kvoli (Austen) wrote…

Should the comment be updated or method changed? The nano fields (age) aren't being compared here.

Done.


pkg/kv/kvserver/rditer/stats_test.go line 1 at r3 (raw file):

Previously, kvoli (Austen) wrote…

s/2023/2024/

Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 5 files at r1, 2 of 11 files at r3, 18 of 18 files at r7, 6 of 11 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli, @miraradeva, and @sumeerbhola)


pkg/kv/kvserver/replica_command.go line 467 at r8 (raw file):

	// post-split LHS stats by combining these stats with the non-user stats
	// computed in splitTrigger. More details in makeEstimatedSplitStatsHelper.
	userOnlyLeftStats, err := rditer.ComputeStatsForRangeUserOnly(

Should this scan be subject to the batcheval.EnableEstimatedMVCCStatsInSplit cluster setting?

Also, should it all be behind a cluster version check?


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 65 at r8 (raw file):

	"if enabled, MVCC stats will be recomputed at the beginning of a split "+
		"to prevent stats estimates from drifting",
	util.ConstantWithMetamorphicTestBool("kv.split.mvcc_stats_recomputation.enabled", true))

I had thought we were going to default this to false, given your recent experiments.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1064 at r8 (raw file):

	startKey := split.LeftDesc.StartKey.AsRawKey()
	endKey := split.LeftDesc.EndKey.AsRawKey()
	if startKey.Compare(keys.LocalMax) <= 0 && endKey.Compare(keys.LocalMax) > 0 {

I think split.LeftDesc.KeySpan().AsRawSpanWithNoLocals() will give you what you want.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1185 at r8 (raw file):

	// makeEstimatedSplitStatsHelper.
	if rec.ClusterSettings().Version.IsActive(ctx, clusterversion.V24_1_EstimatedMVCCStatsInSplit) &&
		EnableEstimatedMVCCStatsInSplit.Get(&rec.ClusterSettings().SV) {

I think we should be checking statsInput.PreSplitLeftUser (statsInput.PreSplitLeftUser == enginepb.MVCCStats{}) to decide whether to enter this branch, and then lift these cluster setting and version checks to AdminSplit. Doing the checks here:

  • prevents you from skipping the scan in AdminSplit when the cluster setting is disabled
  • could race with a cluster version switch. The PreSplitLeftUser might not be supplied from an old node, but the cluster version may change before this request is processed.

pkg/kv/kvserver/batcheval/split_stats_helper.go line 136 at r8 (raw file):

	// be scanned.
	ScanRightFirst bool
	// LeftIsEmpty denotes that the left hand side is empty. If this is the case,

For both of these fields, these flags is only saying that the user keyspaces are empty, right? So they're not an indication that the stats are empty, just that an accurate stats computation will be cheap. Consider commenting that and also changing the name to something like LeftUserIsEmpty and RightUserIsEmpty.


pkg/kv/kvserver/batcheval/split_stats_helper.go line 144 at r8 (raw file):

	// accurately. This is cheap because the range is empty.
	RightIsEmpty bool
	// PreSplitLeftUser contains the pre-split user-only stats of the left hand

nit: feel free to move around these fields to where they make sense. For example, move this up below the other MVCCStats fields, and move PostSplitScanLocalLeftFn up below the other splitStatsScanFn fields.


pkg/kv/kvserver/batcheval/split_stats_helper.go line 233 at r8 (raw file):

	}

	// Fast path: if both ranges are guaranteed to be empty, just scan them and

Would it be simpler to handle all of the h.in.LeftIsEmpty || h.in.RightIsEmpty cases in makeSplitStatsHelper and then only use makeEstimatedSplitStatsHelper if !h.in.LeftIsEmpty && !h.in.RightIsEmpty? Then this function will always leave the stats on both sides estimated.


pkg/roachpb/data.proto line 133 at r8 (raw file):

  reserved 3, 4;
  // Pre-computed stats for the LHS spans corresponding to user data.

nit: move above reserved.

Copy link
Contributor Author

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli, @nvanbenschoten, and @sumeerbhola)


pkg/kv/kvserver/replica_command.go line 467 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should this scan be subject to the batcheval.EnableEstimatedMVCCStatsInSplit cluster setting?

Also, should it all be behind a cluster version check?

Yes, it should check EnableEstimatedMVCCStatsInSplit.

But if we have the statsInput.PreSplitLeftUser == enginepb.MVCCStats{} check in cmd_end_transaction, we don't need the version gate anymore, right?

If the leaseholder is running an old version, it won't populate PreSplitLeftUser and we fall back to accurate stats. If the follower is running an old version, it only knows about accurate stats computation (and ignores PreSplitLeftUser). The only discrepancy we can get is if some replicas have the old version, and some have the new version; they would end up with different stats. But that should be ok? I can keep the version gate to prevent this but let me know if you think it's simpler to remove it.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 65 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I had thought we were going to default this to false, given your recent experiments.

I re-ran the experiment with stats re-computation on (ran it for over 2 hours), and couldn't repro the performance degradation I saw in that first experiment. Maybe a longer-term run will help decide but I don't have any concrete evidence that re-computation is causing issues.


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1185 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we should be checking statsInput.PreSplitLeftUser (statsInput.PreSplitLeftUser == enginepb.MVCCStats{}) to decide whether to enter this branch, and then lift these cluster setting and version checks to AdminSplit. Doing the checks here:

  • prevents you from skipping the scan in AdminSplit when the cluster setting is disabled
  • could race with a cluster version switch. The PreSplitLeftUser might not be supplied from an old node, but the cluster version may change before this request is processed.

Yes, that's a great point. I hesitated to compare PreSplitLeftUser to enginepb.MVCCStats{} because the cases of (1) the field being unset, and (2) the left user stats being empty, are indistinguishable. But it's ok to fall back to accurate stats computation in both of these cases; in the former we have no choice, and in the latter the LHS is empty to it's not expensive to scan.


pkg/kv/kvserver/batcheval/split_stats_helper.go line 233 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Would it be simpler to handle all of the h.in.LeftIsEmpty || h.in.RightIsEmpty cases in makeSplitStatsHelper and then only use makeEstimatedSplitStatsHelper if !h.in.LeftIsEmpty && !h.in.RightIsEmpty? Then this function will always leave the stats on both sides estimated.

Yes, so much simpler!

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 11 files at r8, 12 of 12 files at r12, 7 of 7 files at r13, 6 of 6 files at r14, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli, @miraradeva, and @sumeerbhola)


-- commits line 77 at r13:
Have we confirmed that the customer case which motivated this work will fall below these thresholds? We could look at the QPS on the range and the time the stats computation took to perform (the latch delay) and determine whether the thresholds would have been tripped.


pkg/kv/kvserver/replica_command.go line 467 at r8 (raw file):

Previously, miraradeva (Mira Radeva) wrote…

Yes, it should check EnableEstimatedMVCCStatsInSplit.

But if we have the statsInput.PreSplitLeftUser == enginepb.MVCCStats{} check in cmd_end_transaction, we don't need the version gate anymore, right?

If the leaseholder is running an old version, it won't populate PreSplitLeftUser and we fall back to accurate stats. If the follower is running an old version, it only knows about accurate stats computation (and ignores PreSplitLeftUser). The only discrepancy we can get is if some replicas have the old version, and some have the new version; they would end up with different stats. But that should be ok? I can keep the version gate to prevent this but let me know if you think it's simpler to remove it.

I think it's probably good to keep the version gate, if only to avoid double-computing the stats in mixed-version clusters where the PreSplitLeftUser is computed and then ignored.


pkg/kv/kvserver/replica_command.go line 500 at r13 (raw file):

	if err := r.store.DB().Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
		return splitTxnAttempt(ctx, r.store, txn, leftDesc, rightDesc, desc, reason, userOnlyLeftStats, r.GetMVCCStats())

Should we grab r.GetMVCCStats at the same time that we're grabbing userOnlyLeftStats? Do we want to grab a new value on each txn retry?


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1159 at r12 (raw file):

	// to accurate stats computation because scanning the empty LHS is not
	// expensive.
	if split.PreSplitLeftUserStats.Equal(enginepb.MVCCStats{}) ||

nit: does split.PreSplitLeftUserStats == (enginepb.MVCCStats{}) not work? The Equal method is auto-generated and operates on a interface{}, so it will be less performant.


pkg/storage/enginepb/mvcc.go line 212 at r13 (raw file):

) bool {
	countWithinLimit := func(v1 int64, v2 int64) bool {
		return math.Abs(float64(v1)-float64(v2)) <= float64(maxCountDiff)

nit: Abs is easy enough to define for int64. Want to just pull out a helper right above countWithinLimit?

Copy link
Contributor Author

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli, @nvanbenschoten, and @sumeerbhola)


-- commits line 77 at r13:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Have we confirmed that the customer case which motivated this work will fall below these thresholds? We could look at the QPS on the range and the time the stats computation took to perform (the latch delay) and determine whether the thresholds would have been tripped.

I bumped these up to 5k count and 5.12MB respectively, which ensures no splits fall through to accurate stats for the kv sequential workload (which was doing ~15k QPS). The thresholds (5k and 5.12MB) handle this comfortably; as I was tuning them, in the logs I saw count diffs of ~2k and bytes diffs of ~2M for individual stats.

I looked through some old escalations where the split latency was an issue. There isn't a consistent latch wait duration (it varies from a few hundred ms to 1.5-2 seconds); the QPS in these escalations seems to be around 8k on average, though in some it ramps up to 10k, and one escalation mentions 30k for some other high-qps workloads.

Since we won't be holding latches for 2s anymore, it's probably too extreme to expect (30k QPS) * (2s) = 60k concurrent writes. If we want to handle 30k QPS, and extrapolate from the kv sequential workload, we'd need about 4-5k count diff and 4-5M bytes diff. The bumped-up values should handle just about that much but not more. I'm happy to increase these more, especially if we decide to keep the re-computation cluster setting on.


pkg/kv/kvserver/replica_command.go line 467 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think it's probably good to keep the version gate, if only to avoid double-computing the stats in mixed-version clusters where the PreSplitLeftUser is computed and then ignored.

Good point, I'll keep it.


pkg/kv/kvserver/replica_command.go line 500 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we grab r.GetMVCCStats at the same time that we're grabbing userOnlyLeftStats? Do we want to grab a new value on each txn retry?

Yes, we should keep them consistent. In the next commit, I use the total pre-split stats (r.GetMVCCStats) to export a metric on the total bytes of estimates produced. But the actual estimates being produced come from the discrepancy between userOnlyLeftStats and the actual user left stats.

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 19 of 19 files at r15.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 19 files at r15, 6 of 11 files at r16, 5 of 7 files at r17, 6 of 6 files at r18, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @miraradeva and @sumeerbhola)


pkg/kv/kvserver/batcheval/cmd_end_transaction.go line 1212 at r18 (raw file):

				statsInput.PreSplitStats, statsInput.AbsPreSplitBothStored)
		}
		log.VEventf(ctx, 2, "falling back to accurate stats computation because %v", reason)

We should probably log.Info if preComputedStatsDiff. I imagine this will be a common suspect for tail latency in the near future, especially as we try to find the right values for the kv.split.max_mvcc_stat_XXX_diff settings.


pkg/storage/enginepb/mvcc.go line 212 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: Abs is easy enough to define for int64. Want to just pull out a helper right above countWithinLimit?

nit: I was thinking you could do it without math.Abs or the cast through a float64 (with the corresponding loss of precision). Just

abs := func(x int64) int64 {
    if x < 0 {
        return -x
    }
    return x
}

Copy link
Contributor Author

@miraradeva miraradeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYFTR!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten and @sumeerbhola)


pkg/storage/enginepb/mvcc.go line 212 at r13 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: I was thinking you could do it without math.Abs or the cast through a float64 (with the corresponding loss of precision). Just

abs := func(x int64) int64 {
    if x < 0 {
        return -x
    }
    return x
}

Lol yes, easy enough.

…-user spans

ComputeStatsForRange iterates over an entire range and computes its MVCC
stats.

This patch adds ComputeStatsForRangeUserOnly and
ComputeStatsForRangeExcludingUser, which compute the MVCC stats
corresponding to the user-only and non-user spans in the range,
respectively. These functions are useful for computing the estimated
stats proposed in cockroachdb#119499 by allowing to quickly iterate only over the
non-user key spans in the range, and separately iterate over the
user-only key spans.

Part of: cockroachdb#119499
Release note: None
Previously, we computed MVCC stats for the LHS and RHS of a split in
splitTrigger, while holding latches. This resulted in high tail
latencies due to foreground work not being able to proceed as long as
the split holds these latches. More details in cockroachdb#119499.

This change moves some of the stats computation to AdminSplit, where no
latches are held. In particular, in AdminSplit, we pre-compute the part
of the LHS stats corresponding to user-data spans, and pass that to the
splitTrigger. In the splitTrigger, we iterate only over the non-user
spans in the range (which are small), and combine the resulting stats
with the passed in user stats to arrive at the new LHS stats. The RHS
stats are derived by substituting the LHS stats from the total on-disk
stats. This computation does not produce 100% correct stats because any
writes concurrent with the split may not be attributed correctly to the
LHS or RHS. Therefore, both LHS and RHS stats are marked with
ContainsEstimates.

To prevent the stats from drifting after successive splits, we kick off
a stats re-computation in AdminSplit (only if the on-disk stats contain
estimates). Thus, each split may introduce some stats estimates, but
before doing so it will correct any previous estimates.

There are two invariants that the stats computation guarantees: 1.
Non-user stats (SysBytes, SysCount, etc.) are always correct. This is
because we scan those spans while holding latches. 2. If there are no
writes concurrent with the split, the stats are always correct.

Fixes: cockroachdb#119499

Release note (performance improvement): Splits no longer
hold latches for time proportional to the range size while computing
MVCC stats. Instead, MVCC stats are pre-computed before the critical
section of the split. As a side effect, the resulting stats are no
longer 100% accurate because they may correctly distribute writes
concurrent with the split. To mitigate this, and to prevent the stats
from drifting after successive splits, the existing stored stats are
re-computed and corrected (if needed) as part of the non-critical
section of the split.
…ccurate-stats split

This patch is a followup to cockroachdb#119499, and allows to add thresholds on the
number of keys and bytes of the difference between the pre-split MVCC
stats (retrieved in AdminSplit) and the stats when the split holds
latches (retrieved in splitTrigger). This difference corresponds to
writes concurrent with the split. If the difference is too large, the
split falls back to computing LHS and RHS stats accurately. The
difference is computed only for stats corresponding to user-data;
system stats are always kept accurate.

These thresholds are tunable by two new cluster settings,
MaxMVCCStatCountDiff and MaxMVCCStatBytesDiff, which denote the maximum
number of entities (e.g. keys, intents) and maximum number of bytes
(e.g. value bytes, range value bytes), respectively, that are
acceptable as the difference between the pre- and post-split values of
an individual stat. The former defaults to 5000, and the latter to
5.12MB (1% of the maximum range size).

Fixes: cockroachdb#119503

Release note: None
This patch adds logging and two new metrics to track estimated MVCC
stats computed during splits.

`kv.split.estimated_stats`: the number of splits that computed estimated
MVCC stats, as opposed to 100% accurate ones.

`kv.split.total_bytes_estimates`: the number of total bytes of estimates
introduced by splits. These are calculated as the difference between
the pre-computed stats before the split and the stored stats during the
split (while holding latches).

Fixes: cockroachdb#119516

Release note (ops change): Two new metrics (kv.split.estimated_stats and
kv.split.total_bytes_estimates) added to track the number of splits
that produce MVCC stats estimates, and the total bytes of estimates
produced.
@miraradeva
Copy link
Contributor Author

bors r=nvanbenschoten,kvoli

@craig
Copy link
Contributor

craig bot commented Mar 21, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants