New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: automatically split ranges based on load #31413

Merged
merged 2 commits into from Oct 24, 2018

Conversation

Projects
None yet
4 participants
@ridwanmsharif
Collaborator

ridwanmsharif commented Oct 16, 2018

This PR works off of #21406 and implements load-based splitting.
Issue: #19154

Currently, you have to wait for a decent amount of data to be inserted into
CockroachDB before you'll have enough ranges to make use of a big cluster.
This is inconvenient for POC testing and although the ALTER TABLE SPLIT AT
SQL command is available to manually pre-split ranges, its use is difficult
because it requires users to understand some of the architecture and the
nature of their data model.

The algorithm is pretty simple: once a range exceeds the load threshold, we
begin tracking spans which it's servicing in requests. We use a reservoir
sample, inserting span.Key. If a sample doesn't replace an entry in the
reservoir, we take the opportunity instead to increment one or both of two
counters each reservoir sample maintains: left & right. The left counter is
incremented if the span falls to the left of the reservoir entry's key,
right, to the right. If the span overlaps the entry's key, then both counters
are incremented.

When a range has seen sustained load over the threshold rate for a threshold
duration, the key from the reservoir sample containing the most even balance
of left and right values is chosen for the split. Another heuristic used is
the count of requests that span over the split point. These requests
would be strictly worse than before if the split is initiated and so we
try to prevent this from happening. There is again a threshold
to prevent poor choices in the event that load is sequentially writing keys
or otherwise making a good choice for the split key impossible.

Release note: None

Co-authored-by: Ridwan Sharif ridwan@cockroachlabs.com
Co-authored-by: Spencer Kimball spencer@cockroachlabs.com

Release note: None

@ridwanmsharif ridwanmsharif requested a review from cockroachdb/core-prs as a code owner Oct 16, 2018

@cockroach-teamcity

This comment has been minimized.

Member

cockroach-teamcity commented Oct 16, 2018

This change is Reviewable

@ridwanmsharif ridwanmsharif removed the request for review from cockroachdb/core-prs Oct 16, 2018

@ridwanmsharif ridwanmsharif force-pushed the ridwanmsharif:new-auto-split branch 5 times, most recently from d723d07 to c4c3a2c Oct 16, 2018

@ridwanmsharif

This comment has been minimized.

Collaborator

ridwanmsharif commented Oct 16, 2018

Working on tests and running this against different workloads now. We may want to tweak the scoring heuristic we're using (its trying to take into account how keys that have requests on both sides, are given lower priorities). I've decided for now, to continue using the reservoir sampling approach as it doesn't add too much complexity to the hot code path during requests, its also fairly simple and can be tweaked intuitively as our needs/wants change. This is still a WIP, and test coverage is almost nonexistent but I hoped to gather some early thoughts on direction now.

cc @a-robinson @nvanbenschoten @petermattis

@a-robinson

Just sending out partial comments for now. Hopefully I can finish reviewing in my one or two gaps small between meetings today

Reviewed 4 of 11 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/roachpb/api.proto, line 1345 at r1 (raw file):

  ];

  // RequestRate is the rate of request/ns for the range.

Requests per nanosecond seems odd. Why not per second?


pkg/roachpb/api.proto, line 1346 at r1 (raw file):

  // RequestRate is the rate of request/ns for the range.
  double request_rate = 3;

We certainly don't have to name it similarly if there's good reason not to, but queries_per_second would be consistent with similar fields elsewhere. And in the absence of good reason one way or the other, consistency is as good a way as any of picking a name.


pkg/storage/merge_queue.go, line 229 at r1 (raw file):

	// found for over 10 seconds, the merge queue should feel free to merge the ranges.
	if lhsRepl.SplitByLoadEnabled() && timeSinceLastReq <= (10*time.Second) {
		mergedReqRate = lhsReqRate + rhsReqRate

It's worth thinking through how things will behave with a mixed-version cluster where some nodes are running 2.1 and others are running 2.2. In such cases, the 2.1 nodes won't return any qps info in their RangeStatsResponse, so you could conceivably have a situation like:

  1. n2 (running v2.2) splits r100 based on load, creating r101.
  2. The lease for r101 transfers to n1 (running v2.1) to better spread load
  3. n2 decides to merge r100 with r101 because r100's qps isn't enough to avoid splitting and r101 shows up as having 0 qps
  4. Return to step 1

It's not a devastating scenario, but is worth considering carefully. The easy way to avoid this, of course, is to just add a new cluster version that gates load-based splitting. There may be a less restrictive way as well.

@ridwanmsharif ridwanmsharif force-pushed the ridwanmsharif:new-auto-split branch 2 times, most recently from 0794213 to 3d3d393 Oct 17, 2018

@ridwanmsharif

Testing with evenly distributed workloads showed great signs. Increased throughput, more queries per second, more CPU utilization (the CPU utilization converged for all nodes instead of having some nodes use a lot more than others). Sequential workloads showed almost no splits initiated (I say almost because we still see like one or two splits created after an hour because of the randomness of the algorithm but nothing noticeable). There was no noticeable hit in performance. Testing with YCSB's zipfian distributions showed a lot of splits happening, but no improvement - I believe because the contention is on single keys and that's the bottleneck. To test how this works with that kind of distribution, I believe I have to build a workload like that into kv. As far as tests, I have some locally but there are a few bugs to whack out and this should be ready to go. You should be able to keep on with the review.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/roachpb/api.proto, line 1345 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Requests per nanosecond seems odd. Why not per second?

Done.


pkg/roachpb/api.proto, line 1346 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

We certainly don't have to name it similarly if there's good reason not to, but queries_per_second would be consistent with similar fields elsewhere. And in the absence of good reason one way or the other, consistency is as good a way as any of picking a name.

Done.


pkg/storage/merge_queue.go, line 229 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It's worth thinking through how things will behave with a mixed-version cluster where some nodes are running 2.1 and others are running 2.2. In such cases, the 2.1 nodes won't return any qps info in their RangeStatsResponse, so you could conceivably have a situation like:

  1. n2 (running v2.2) splits r100 based on load, creating r101.
  2. The lease for r101 transfers to n1 (running v2.1) to better spread load
  3. n2 decides to merge r100 with r101 because r100's qps isn't enough to avoid splitting and r101 shows up as having 0 qps
  4. Return to step 1

It's not a devastating scenario, but is worth considering carefully. The easy way to avoid this, of course, is to just add a new cluster version that gates load-based splitting. There may be a less restrictive way as well.

I do need to think a little harder on this, I just figured a cluster version would suffice but if you think its too restrictive, we can discuss that for sure. I do not see a way around that yet though, if a range splits it will have to do so, knowing that the change would not be undone immediately after by the merge queue. And doing that must mean having a version check gate when splitting. When merging, I figure we can have some check in the scenario where the rhsQPS is 0 and the lhsQPS is roughly over half the QPS of the SplitByLoadRateThreshold - we can then ignore this case but I don't know how much that buys us. I'll think about it some more. I maybe missing something obvious.

@ridwanmsharif ridwanmsharif requested a review from a-robinson Oct 18, 2018

@a-robinson

Finished reviewing everything, sorry for the delay. I'd suggest also addressing @bdarnell and @tschottdorf's comments from #21406.

Reviewed 2 of 11 files at r1, 8 of 8 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/merge_queue.go, line 229 at r1 (raw file):

Previously, ridwanmsharif (Ridwan Sharif) wrote…

I do need to think a little harder on this, I just figured a cluster version would suffice but if you think its too restrictive, we can discuss that for sure. I do not see a way around that yet though, if a range splits it will have to do so, knowing that the change would not be undone immediately after by the merge queue. And doing that must mean having a version check gate when splitting. When merging, I figure we can have some check in the scenario where the rhsQPS is 0 and the lhsQPS is roughly over half the QPS of the SplitByLoadRateThreshold - we can then ignore this case but I don't know how much that buys us. I'll think about it some more. I maybe missing something obvious.

I think a cluster version is fine, I just want make sure that we at least end up with something. Leave a TODO in somewhere, or open an issue, or whatever you think will help you remember to add in a cluster version for this once the dust settles on it.


pkg/storage/merge_queue.go, line 234 at r2 (raw file):

	}
	if ok, _ := shouldSplitRange(mergedDesc, mergedStats, mergedQPS,
		lhsRepl.SplitByLoadRateThreshold(), lhsRepl.GetMaxBytes(), sysCfg); ok {

There needs to be a reasonable difference between the threshold used when deciding when to split and the threshold used when deciding whether to merge. Otherwise a range can hover right around the threshold and get into a cycle of getting split and merged over and over. This is much like how our default max range size is 64MB but the min range size is much less than 32MB.

The threshold for merging should be some percentage lower than the threshold for splitting to avoid this. There's no exact science behind picking a value for this, but multiplying the threshold by something like .5 would be a reasonable place to start.


pkg/storage/replica.go, line 1817 at r2 (raw file):

}

// GetQPS returns the Replica's queries/s request rate.

Given that this is exported, the comment should include some idea of how the qps is measured so that it can be interpreted properly. QPS isn't a fully-defined concept without info about the window over which it's measured.

It's especially important in this case given the particular implementation involved here. If a replica hasn't received a request for a while, it will still return the qps from the last time it received a request, meaning that a replica that hasn't processed a request for hours could still return a very high qps if its qps was high but then it completely stopped receiving load.

In fact, I think this accessor would be better off if it just recalculated the qps whenever it's been more than a second since r.splitMu.nanos.


pkg/storage/replica.go, line 1826 at r2 (raw file):

// GetLastRequestNanos returns the most recent time in nanos
// when the last rate was recorded.
func (r *Replica) GetLastRequestNanos() int64 {

Is there any reason for this to be an exported method?


pkg/storage/replica.go, line 2544 at r2 (raw file):

Quoted 6 lines of code…
			for _, s := range spans.GetSpans(spanset.SpanReadOnly, spanset.SpanGlobal) {
				r.splitMu.splitFinder.Record(s, rand.Intn)
			}
			for _, s := range spans.GetSpans(spanset.SpanReadWrite, spanset.SpanGlobal) {
				r.splitMu.splitFinder.Record(s, rand.Intn)
			}

I believe we may need to rework this to pass all the spans into splitFinder.Record together such that it can properly record whether all the spans were part of the same batch. For example, if a batch wants to write to two keys, one on the left side of the range and one on the right, would this properly account for the fact that they were written together?

A workload that tries writing multiple keys per operation might be interesting for catching things like this. The simplest case of this would be a table with a secondary index and the RangeMaxBytes set really high so that the range won't split due to size. In such cases the table and index will be in the same range, and we'd want it to never split based on load if given a workload that just does a lot of point writes, since each write will touch both the primary index (on the left side of the range) and the secondary index (on the right side).


pkg/storage/replica.go, line 6915 at r2 (raw file):

// SplitByLoadEnabled wraps "kv.load_split.load_based_splitting_enabled".
var SplitByLoadEnabled = settings.RegisterBoolSetting(
	"kv.load_split.load_based_splitting_enabled",

Alternative suggestion: kv.range_split.by_load_enabled to somewhat parallel the existing kv.range_merge.queue_enabled setting


pkg/storage/replica.go, line 6922 at r2 (raw file):

// SplitByLoadQPSThreshold wraps "kv.load_split.load_qps_threshold".
var SplitByLoadQPSThreshold = settings.RegisterIntSetting(
	"kv.load_split.load_qps_threshold",

If we go with something like the above suggestion, this would then become kv.range_split.load_qps_threshold


pkg/storage/replica.go, line 6928 at r2 (raw file):

// SplitByLoadRateThreshold returns the request/s rate for a given replica.
func (r *Replica) SplitByLoadRateThreshold() float64 {

Not that the name is bad, but it looks like you may have forgotten to rename this when you switched everything from rate to qps?


pkg/storage/replica.go, line 6949 at r2 (raw file):

Quoted 8 lines of code…
	nowNanos := r.store.Clock().PhysicalNow()
	duration := nowNanos - r.splitMu.nanos
	if duration < int64(time.Second) {
		return r.splitMu.splitFinder != nil, false
	}
	r.splitMu.qps = (float64(r.splitMu.count) / float64(duration)) * 1e9
	r.splitMu.nanos = nowNanos
	r.splitMu.count = 0

This code and its interaction with the creation/maintenance/deletion of the splitFinder is an impressively subtle way of measuring qps and doing the sampling. If we're going to stick with it, it could use a comment explaining how/why it works.

And while it works well for this purpose, I think the GetQPS method definitely deserves a disclaimer about how the instantaneous the measurement it returns is.


pkg/storage/split_queue.go, line 111 at r2 (raw file):

	// Check if the request rate is higher than the QPS threshold.
	if reqRate >= splitByLoadRateThreshold {

What priority do you think this should have?


pkg/storage/split_queue.go, line 128 at r2 (raw file):

	// If there's a split-by-load for this range, split if the range
	// exceeds the minimum size in bytes.

Why only if it exceeds the minimum size?


pkg/storage/split_queue.go, line 132 at r2 (raw file):

		repl.splitMu.Lock()
		defer repl.splitMu.Unlock()
		// Only queue if the total number of ranges is less than the max.

Is this comment applicable anymore?


pkg/storage/split_queue.go, line 179 at r2 (raw file):

	desc := r.Desc()

	// First, handle the case of splitting by load.

Why split by load first? I'd think we should split by zone config first, since it's possible it'll achieve the goal of splitting by load for us, and we're definitely going to have to split there no matter what.


pkg/storage/split_queue.go, line 191 at r2 (raw file):

	if splitByLoad {
		log.Eventf(ctx, "splitting based on load")
		log.Warningf(ctx, "splitting based on load")

I get that this is just here for debugging now, but I do think we'll want a log.Infof for when a load-based split happens, at least during the 2.2 development cycle.


pkg/storage/split/split.go, line 79 at r2 (raw file):

// sample duration.
func (f *Finder) Ready(nowNanos int64) bool {
	return time.Duration(nowNanos-f.startNanos) > durationThreshold

Is there a reason why we're using int64 nano counts everywhere rather than time.Time objects? Getting to take advantage of the built-in monotonic clock in time.Time for free would be nice.


pkg/storage/split/split.go, line 100 at r2 (raw file):

				sa.contained++
			} else {
				if comp := bytes.Compare(sa.key, span.Key); comp < 0 {

Given the way that this is used to choose split keys and the way split keys work, any spans equal to the key should be counted as part of the right side.


pkg/storage/split/split.go, line 106 at r2 (raw file):

				}
			}
			f.samples[i] = sa

Copying samples in and out of the array seems kind of silly from a style perspective. Just switch the loop to

for i := range f.samples {

and replace sa with f.samples[i] in the code.


pkg/storage/split/split.go, line 134 at r2 (raw file):

	for i, s := range f.samples {
		if s.left+s.right < splitKeyMinCounter {

Why aren't s.contained samples counted here?


pkg/storage/split/split.go, line 138 at r2 (raw file):

		}
		balanceScore := math.Abs(float64(s.left-s.right)) / float64(s.left+s.right)
		finalScore := balanceScore + (float64(s.contained) / float64(mostContainedKeyCount))

What's the benefit of or intuition behind considering the number of contained spans as a fraction of the sample with the most contained spans? Wouldn't it make more sense to consider it as a fraction of all the spans encountered during sampling?

@ridwanmsharif ridwanmsharif force-pushed the ridwanmsharif:new-auto-split branch 2 times, most recently from 63c77f2 to 76f9c88 Oct 19, 2018

@ridwanmsharif

With some of the new changes, I did notice some erratic behavior with some workloads (cc @a-robinson)
This may just have been an issue with the cluster. I'll investigate some more.

Some of the new changes
are:

  • The default zone config is now updated to have a minimum range size of 16MB
  • Ranges that are smaller than the minimum size are also considered for splits if they have heavy load

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/merge_queue.go, line 229 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I think a cluster version is fine, I just want make sure that we at least end up with something. Leave a TODO in somewhere, or open an issue, or whatever you think will help you remember to add in a cluster version for this once the dust settles on it.

I was thinking of doing that as part of this PR actually, that wouldn't hurt anything would it?


pkg/storage/merge_queue.go, line 234 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

There needs to be a reasonable difference between the threshold used when deciding when to split and the threshold used when deciding whether to merge. Otherwise a range can hover right around the threshold and get into a cycle of getting split and merged over and over. This is much like how our default max range size is 64MB but the min range size is much less than 32MB.

The threshold for merging should be some percentage lower than the threshold for splitting to avoid this. There's no exact science behind picking a value for this, but multiplying the threshold by something like .5 would be a reasonable place to start.

Done.


pkg/storage/replica.go, line 1817 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Given that this is exported, the comment should include some idea of how the qps is measured so that it can be interpreted properly. QPS isn't a fully-defined concept without info about the window over which it's measured.

It's especially important in this case given the particular implementation involved here. If a replica hasn't received a request for a while, it will still return the qps from the last time it received a request, meaning that a replica that hasn't processed a request for hours could still return a very high qps if its qps was high but then it completely stopped receiving load.

In fact, I think this accessor would be better off if it just recalculated the qps whenever it's been more than a second since r.splitMu.nanos.

I don't think it can recalculate it here but that's why I have the GetLastRequestTime that the caller can check the staleness with. We use it at the merge queue because of this very reason. I added comments for it now.


pkg/storage/replica.go, line 1826 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Is there any reason for this to be an exported method?

Figured it should be since GetSplitQPS is as well, only way to check for staleness I think.


pkg/storage/replica.go, line 2544 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…
			for _, s := range spans.GetSpans(spanset.SpanReadOnly, spanset.SpanGlobal) {
				r.splitMu.splitFinder.Record(s, rand.Intn)
			}
			for _, s := range spans.GetSpans(spanset.SpanReadWrite, spanset.SpanGlobal) {
				r.splitMu.splitFinder.Record(s, rand.Intn)
			}

I believe we may need to rework this to pass all the spans into splitFinder.Record together such that it can properly record whether all the spans were part of the same batch. For example, if a batch wants to write to two keys, one on the left side of the range and one on the right, would this properly account for the fact that they were written together?

A workload that tries writing multiple keys per operation might be interesting for catching things like this. The simplest case of this would be a table with a secondary index and the RangeMaxBytes set really high so that the range won't split due to size. In such cases the table and index will be in the same range, and we'd want it to never split based on load if given a workload that just does a lot of point writes, since each write will touch both the primary index (on the left side of the range) and the secondary index (on the right side).

Done. The new way considers the number of requests in the batch for the count towards the QPS but creates a new boundary span that the splitFinder uses to record. The boundary span is just a span that tightly encapsulates all the spans returned by GetSpans. Does this make more sense now?


pkg/storage/replica.go, line 6915 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Alternative suggestion: kv.range_split.by_load_enabled to somewhat parallel the existing kv.range_merge.queue_enabled setting

Done.


pkg/storage/replica.go, line 6922 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

If we go with something like the above suggestion, this would then become kv.range_split.load_qps_threshold

Done.


pkg/storage/replica.go, line 6928 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Not that the name is bad, but it looks like you may have forgotten to rename this when you switched everything from rate to qps?

Done.


pkg/storage/replica.go, line 6949 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…
	nowNanos := r.store.Clock().PhysicalNow()
	duration := nowNanos - r.splitMu.nanos
	if duration < int64(time.Second) {
		return r.splitMu.splitFinder != nil, false
	}
	r.splitMu.qps = (float64(r.splitMu.count) / float64(duration)) * 1e9
	r.splitMu.nanos = nowNanos
	r.splitMu.count = 0

This code and its interaction with the creation/maintenance/deletion of the splitFinder is an impressively subtle way of measuring qps and doing the sampling. If we're going to stick with it, it could use a comment explaining how/why it works.

And while it works well for this purpose, I think the GetQPS method definitely deserves a disclaimer about how the instantaneous the measurement it returns is.

Done.


pkg/storage/split_queue.go, line 111 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

What priority do you think this should have?

I think it should have a priority proportional to how much higher it is compared to the threshold.


pkg/storage/split_queue.go, line 128 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…
	// If there's a split-by-load for this range, split if the range
	// exceeds the minimum size in bytes.

Why only if it exceeds the minimum size?

We don't care about minimum size anymore. Done.


pkg/storage/split_queue.go, line 132 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Is this comment applicable anymore?

Done.


pkg/storage/split_queue.go, line 179 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Why split by load first? I'd think we should split by zone config first, since it's possible it'll achieve the goal of splitting by load for us, and we're definitely going to have to split there no matter what.

Done.


pkg/storage/split_queue.go, line 191 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

I get that this is just here for debugging now, but I do think we'll want a log.Infof for when a load-based split happens, at least during the 2.2 development cycle.

Done.


pkg/storage/split/split.go, line 79 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Is there a reason why we're using int64 nano counts everywhere rather than time.Time objects? Getting to take advantage of the built-in monotonic clock in time.Time for free would be nice.

Done.


pkg/storage/split/split.go, line 100 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Given the way that this is used to choose split keys and the way split keys work, any spans equal to the key should be counted as part of the right side.

Done.


pkg/storage/split/split.go, line 106 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Copying samples in and out of the array seems kind of silly from a style perspective. Just switch the loop to

for i := range f.samples {

and replace sa with f.samples[i] in the code.

Done.


pkg/storage/split/split.go, line 134 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Why aren't s.contained samples counted here?

Done.


pkg/storage/split/split.go, line 138 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

What's the benefit of or intuition behind considering the number of contained spans as a fraction of the sample with the most contained spans? Wouldn't it make more sense to consider it as a fraction of all the spans encountered during sampling?

That's what I had originally, I just felt that was too small a number to affect the score significantly. This way its more potent. Other suggestion I'm in favor of:

finalScore := balanceScore + (float64(s.contained) / float64(s.contained + s.left + s.right))

@ridwanmsharif ridwanmsharif force-pushed the ridwanmsharif:new-auto-split branch 4 times, most recently from 5a31553 to 49d9a60 Oct 19, 2018

@ridwanmsharif ridwanmsharif changed the title from [WIP, DNM] storage: automatically split ranges based on load to storage: automatically split ranges based on load Oct 22, 2018

@ridwanmsharif ridwanmsharif force-pushed the ridwanmsharif:new-auto-split branch 3 times, most recently from 110f08c to eea9468 Oct 22, 2018

@ridwanmsharif ridwanmsharif requested a review from a-robinson Oct 22, 2018

@ridwanmsharif ridwanmsharif force-pushed the ridwanmsharif:new-auto-split branch from eea9468 to eae70f8 Oct 22, 2018

@a-robinson

With some of the new changes, I did notice some erratic behavior with some workloads (cc @a-robinson)
This may just have been an issue with the cluster. I'll investigate some more.

Any updates on the erratic behavior?

Also, this is looking really good. Almost there. Sorry for the minor delay.

Reviewed 10 of 10 files at r3, 2 of 5 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/merge_queue.go, line 229 at r1 (raw file):

Previously, ridwanmsharif (Ridwan Sharif) wrote…

I was thinking of doing that as part of this PR actually, that wouldn't hurt anything would it?

It could make for weirdness in clusters with some nodes running one alpha version and others running another alpha version if the algorithm changes in between those alphas, but that's not something we need to worry about. Go ahead and add it.


pkg/storage/replica.go, line 2544 at r2 (raw file):

Previously, ridwanmsharif (Ridwan Sharif) wrote…

Done. The new way considers the number of requests in the batch for the count towards the QPS but creates a new boundary span that the splitFinder uses to record. The boundary span is just a span that tightly encapsulates all the spans returned by GetSpans. Does this make more sense now?

Why are we recording reads and write separately? If I read key 1 and write key 4 as part of the same operation, why shouldn't they be recorded together?


pkg/storage/replica.go, line 500 at r3 (raw file):

	splitMu struct {
		syncutil.Mutex
		nowNanos    time.Time // most recent time recorded by requests.

The name nanos doesn't make much sense anymore.


pkg/storage/split_queue.go, line 128 at r2 (raw file):

Previously, ridwanmsharif (Ridwan Sharif) wrote…

We don't care about minimum size anymore. Done.

The comment is still here, though.


pkg/storage/split_queue.go, line 214 at r3 (raw file):

	r.splitMu.Unlock()
	if splitByLoad {
		log.Eventf(ctx, "splitting based on load")

Just leaving a reminder here to get rid of the event/warning before merging.


pkg/storage/split/split.go, line 100 at r2 (raw file):

Previously, ridwanmsharif (Ridwan Sharif) wrote…

Done.

Minor, but that's non-obvious enough that a comment would be nice for future readers.


pkg/storage/split/split.go, line 138 at r2 (raw file):

Previously, ridwanmsharif (Ridwan Sharif) wrote…

That's what I had originally, I just felt that was too small a number to affect the score significantly. This way its more potent. Other suggestion I'm in favor of:

finalScore := balanceScore + (float64(s.contained) / float64(s.contained + s.left + s.right))

It's more potent, but is it meaningful? It means that the split key with the most spans containing it is seriously penalized even if it only happens a couple times. While there could be another range where one of the split keys contains all spans, and so other split keys may look pretty good in comparison even if they're actually contained in a large number of spans as well.

That suggestion looks a lot more intuitive if it can be made to actually influence things when s.contained is sufficiently large. Do we have any idea from experimentation what "sufficiently large" means for this? Some testing of a workload where a controllable fraction of transactions touch multiple keys some distance apart from each other would be useful to quantify whether 5% of requests spanning the split key is too much or only something like 50% is too much.

@ridwanmsharif ridwanmsharif requested review from a-robinson and removed request for a-robinson Oct 22, 2018

@ridwanmsharif ridwanmsharif force-pushed the ridwanmsharif:new-auto-split branch 3 times, most recently from ec28d97 to e54e574 Oct 22, 2018

@ridwanmsharif

With some of the new changes, I did notice some erratic behavior with some workloads (cc @a-robinson)
This may just have been an issue with the cluster. I'll investigate some more.

Any updates on the erratic behavior?

Not been able to repro it in all the experiments since. I chalked it with an issue with the cluster at that particular time, independent of the changes.
I'm also creating a new issue to keep track of the experiments and results of load based splitting so after it merges we can continue to run experiments with new workloads before we turn it on by default. The workloads we have available are not exhaustive enough to test all parts of this change IMO. If this looks clear, I'll add in another commit to the PR changing the default zone config to use 16GB as the minimum range size now and fix all the tests that breaks.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 2544 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Why are we recording reads and write separately? If I read key 1 and write key 4 as part of the same operation, why shouldn't they be recorded together?

You're right. Done.


pkg/storage/replica.go, line 500 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

The name nanos doesn't make much sense anymore.

Done.


pkg/storage/split_queue.go, line 128 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

The comment is still here, though.

Done.


pkg/storage/split_queue.go, line 214 at r3 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Just leaving a reminder here to get rid of the event/warning before merging.

Done.


pkg/storage/split/split.go, line 138 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It's more potent, but is it meaningful? It means that the split key with the most spans containing it is seriously penalized even if it only happens a couple times. While there could be another range where one of the split keys contains all spans, and so other split keys may look pretty good in comparison even if they're actually contained in a large number of spans as well.

That suggestion looks a lot more intuitive if it can be made to actually influence things when s.contained is sufficiently large. Do we have any idea from experimentation what "sufficiently large" means for this? Some testing of a workload where a controllable fraction of transactions touch multiple keys some distance apart from each other would be useful to quantify whether 5% of requests spanning the split key is too much or only something like 50% is too much.

Intuitively I went with 50% but this will change as we experiment some more. With most of my experiments this case didn't come up too often, if at all. Are there existing workloads that I can run that I can count on querying keys in the same range? If not then, we can add this one too when we add more workloads to kv (we need a non-uniform distribution workload as well).

@ridwanmsharif ridwanmsharif force-pushed the ridwanmsharif:new-auto-split branch from e54e574 to 49a55b9 Oct 23, 2018

@ridwanmsharif ridwanmsharif requested a review from cockroachdb/sql-rest-prs as a code owner Oct 23, 2018

@a-robinson

:lgtm:

Reviewed 1 of 5 files at r4, 1 of 2 files at r5, 6 of 6 files at r6, 2 of 2 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/split/split.go, line 138 at r2 (raw file):

Are there existing workloads that I can run that I can count on querying keys in the same range

Not that I'm aware of.

50% sounds a little high at first glance, but let's leave it there and see how things go.


pkg/storage/split/split.go, line 106 at r7 (raw file):

				// Similarly, if the key is greater than the start key (and given that it is not
				// properly contained by the span) it must mean that the request would be on
				// the left.

s/left/right

storage: automatically split ranges based on load
Currently, you have to wait for a decent amount of data to be inserted into
CockroachDB before you'll have enough ranges to make use of a big cluster.
This is inconvenient for POC testing and although the `ALTER TABLE SPLIT AT`
SQL command is available to manually pre-split ranges, its use is difficult
because it requires users to understand some of the architecture and the
nature of their data model.

The algorithm is pretty simple: once a range exceeds the load threshold, we
begin tracking spans which it's servicing in requests. We use a reservoir
sample, inserting `span.Key`. If a sample doesn't replace an entry in the
reservoir, we take the opportunity instead to increment one or both of two
counters each reservoir sample maintains: left & right. The left counter is
incremented if the span falls to the left of the reservoir entry's key,
right, to the right. If the span overlaps the entry's key, then both counters
are incremented.

When a range has seen sustained load over the threshold rate for a threshold
duration, the key from the reservoir sample containing the most even balance
of left and right values is chosen for the split. Another heuristic used - is
the count of requests that span over the split point. These requests
would be strictly worse than before if the split is initiated and so we
try to prevent this from happening. There is again a threshold
to prevent poor choices in the event that load is sequentially writing keys
or otherwise making a good choice for the split key impossible.

Release note: None

Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
Co-authored-by: Spencer Kimball <spencer@cockroachlabs.com>

Release note: None

@ridwanmsharif ridwanmsharif force-pushed the ridwanmsharif:new-auto-split branch from 49a55b9 to adc7b88 Oct 24, 2018

config: change the default zone config stats
With load based splitting and range merges implemented, it makes sense
to change the `range_min_bytes` set to a higher/more appropriate size
such that merging is made more aggressive. This helps when load based
splitting creates a lot of ranges that exceed the previous small range
size of 1MB, they're never considered for merging in that case.

Release note (general change): Default Zone Config stats changes.
Will affect all zone configs created after this point.

@ridwanmsharif ridwanmsharif requested review from cockroachdb/cli-prs as code owners Oct 24, 2018

@ridwanmsharif

Creating a new issue to track the experiments conducted with this change. Leaving load based splitting disabled by default. Before this is bors'ed, I wanted to confirm with @a-robinson and @benesch that the default sizes change is okay (See second commit on the PR)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/split/split.go, line 106 at r7 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

s/left/right

I believe left is still right. I elaborate a little better in the comment now.

@benesch

This comment has been minimized.

Member

benesch commented Oct 24, 2018

A 16MB minimum range size SGTM.

@ridwanmsharif

This comment has been minimized.

Collaborator

ridwanmsharif commented Oct 24, 2018

Just too improve confidence with this project here's some details from basic testing (KV workload):

On Master:
image

image

image

With my changes (I turn load based splitting on at ~15:03):
image

image

image

The jump is when I turn load based splitting on:
image

More results with different workloads, distributions, heuristics will be posted in an issue later toady. Just thought I should show some here before bors'ing here. cc @a-robinson @nvanbenschoten @petermattis

@ridwanmsharif

This comment has been minimized.

Collaborator

ridwanmsharif commented Oct 24, 2018

bors r+

craig bot pushed a commit that referenced this pull request Oct 24, 2018

Merge #31413
31413: storage: automatically split ranges based on load r=ridwanmsharif a=ridwanmsharif

This PR works off of #21406 and implements load-based splitting.
Issue: #19154 

Currently, you have to wait for a decent amount of data to be inserted into
CockroachDB before you'll have enough ranges to make use of a big cluster.
This is inconvenient for POC testing and although the `ALTER TABLE SPLIT AT`
SQL command is available to manually pre-split ranges, its use is difficult
because it requires users to understand some of the architecture and the
nature of their data model.

The algorithm is pretty simple: once a range exceeds the load threshold, we
begin tracking spans which it's servicing in requests. We use a reservoir
sample, inserting `span.Key`. If a sample doesn't replace an entry in the
reservoir, we take the opportunity instead to increment one or both of two
counters each reservoir sample maintains: left & right. The left counter is
incremented if the span falls to the left of the reservoir entry's key,
right, to the right. If the span overlaps the entry's key, then both counters
are incremented.

When a range has seen sustained load over the threshold rate for a threshold
duration, the key from the reservoir sample containing the most even balance
of left and right values is chosen for the split. Another heuristic used is
the count of requests that span over the split point. These requests
would be strictly worse than before if the split is initiated and so we
try to prevent this from happening. There is again a threshold
to prevent poor choices in the event that load is sequentially writing keys
or otherwise making a good choice for the split key impossible.

Release note: None

Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
Co-authored-by: Spencer Kimball <spencer@cockroachlabs.com>

Release note: None

Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
@craig

This comment has been minimized.

craig bot commented Oct 24, 2018

Build succeeded

@craig craig bot merged commit 2fa6d8c into cockroachdb:master Oct 24, 2018

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
bors Build succeeded
Details
license/cla Contributor License Agreement is signed.
Details

@ridwanmsharif ridwanmsharif deleted the ridwanmsharif:new-auto-split branch Oct 24, 2018

@a-robinson

This comment has been minimized.

Member

a-robinson commented Oct 24, 2018

Just too improve confidence with this project here's some details from basic testing:

For future reference, it's nice to at least mention what the workload is when showing results :)

The jump is when I turn load based splitting on:
image

I suspect this is creating way more ranges than needed (and thus the qps threshold should be higher), but that's what experimentation is for.

Nice work getting this in!

@ridwanmsharif

This comment has been minimized.

Collaborator

ridwanmsharif commented Oct 24, 2018

Tracking experiments with load based splitting on #31819.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment