New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP, DNM] storage: automatically split ranges based on load #31413

Open
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
3 participants
@ridwanmsharif
Collaborator

ridwanmsharif commented Oct 16, 2018

This PR works off of #21406 and implements load-based splitting.
Issue: #19154

Currently, you have to wait for a decent amount of data to be inserted into
CockroachDB before you'll have enough ranges to make use of a big cluster.
This is inconvenient for POC testing and although the ALTER TABLE SPLIT AT
SQL command is available to manually pre-split ranges, its use is difficult
because it requires users to understand some of the architecture and the
nature of their data model.

The algorithm is pretty simple: once a range exceeds the load threshold, we
begin tracking spans which it's servicing in requests. We use a reservoir
sample, inserting span.Key. If a sample doesn't replace an entry in the
reservoir, we take the opportunity instead to increment one or both of two
counters each reservoir sample maintains: left & right. The left counter is
incremented if the span falls to the left of the reservoir entry's key,
right, to the right. If the span overlaps the entry's key, then both counters
are incremented.

When a range has seen sustained load over the threshold rate for a threshold
duration, the key from the reservoir sample containing the most even balance
of left and right values is chosen for the split. Another heuristic used is
the count of requests that span over the split point. These requests
would be strictly worse than before if the split is initiated and so we
try to prevent this from happening. There is again a threshold
to prevent poor choices in the event that load is sequentially writing keys
or otherwise making a good choice for the split key impossible.

Release note: None

Co-authored-by: Ridwan Sharif ridwan@cockroachlabs.com
Co-authored-by: Spencer Kimball spencer@cockroachlabs.com

Release note: None

@ridwanmsharif ridwanmsharif requested a review from cockroachdb/core-prs as a code owner Oct 16, 2018

@cockroach-teamcity

This comment has been minimized.

Show comment
Hide comment
@cockroach-teamcity

cockroach-teamcity Oct 16, 2018

Member

This change is Reviewable

Member

cockroach-teamcity commented Oct 16, 2018

This change is Reviewable

@ridwanmsharif ridwanmsharif removed the request for review from cockroachdb/core-prs Oct 16, 2018

@ridwanmsharif

This comment has been minimized.

Show comment
Hide comment
@ridwanmsharif

ridwanmsharif Oct 16, 2018

Collaborator

Working on tests and running this against different workloads now. We may want to tweak the scoring heuristic we're using (its trying to take into account how keys that have requests on both sides, are given lower priorities). I've decided for now, to continue using the reservoir sampling approach as it doesn't add too much complexity to the hot code path during requests, its also fairly simple and can be tweaked intuitively as our needs/wants change. This is still a WIP, and test coverage is almost nonexistent but I hoped to gather some early thoughts on direction now.

cc @a-robinson @nvanbenschoten @petermattis

Collaborator

ridwanmsharif commented Oct 16, 2018

Working on tests and running this against different workloads now. We may want to tweak the scoring heuristic we're using (its trying to take into account how keys that have requests on both sides, are given lower priorities). I've decided for now, to continue using the reservoir sampling approach as it doesn't add too much complexity to the hot code path during requests, its also fairly simple and can be tweaked intuitively as our needs/wants change. This is still a WIP, and test coverage is almost nonexistent but I hoped to gather some early thoughts on direction now.

cc @a-robinson @nvanbenschoten @petermattis

@a-robinson

Just sending out partial comments for now. Hopefully I can finish reviewing in my one or two gaps small between meetings today

Reviewed 4 of 11 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/roachpb/api.proto, line 1345 at r1 (raw file):

  ];

  // RequestRate is the rate of request/ns for the range.

Requests per nanosecond seems odd. Why not per second?


pkg/roachpb/api.proto, line 1346 at r1 (raw file):

  // RequestRate is the rate of request/ns for the range.
  double request_rate = 3;

We certainly don't have to name it similarly if there's good reason not to, but queries_per_second would be consistent with similar fields elsewhere. And in the absence of good reason one way or the other, consistency is as good a way as any of picking a name.


pkg/storage/merge_queue.go, line 229 at r1 (raw file):

	// found for over 10 seconds, the merge queue should feel free to merge the ranges.
	if lhsRepl.SplitByLoadEnabled() && timeSinceLastReq <= (10*time.Second) {
		mergedReqRate = lhsReqRate + rhsReqRate

It's worth thinking through how things will behave with a mixed-version cluster where some nodes are running 2.1 and others are running 2.2. In such cases, the 2.1 nodes won't return any qps info in their RangeStatsResponse, so you could conceivably have a situation like:

  1. n2 (running v2.2) splits r100 based on load, creating r101.
  2. The lease for r101 transfers to n1 (running v2.1) to better spread load
  3. n2 decides to merge r100 with r101 because r100's qps isn't enough to avoid splitting and r101 shows up as having 0 qps
  4. Return to step 1

It's not a devastating scenario, but is worth considering carefully. The easy way to avoid this, of course, is to just add a new cluster version that gates load-based splitting. There may be a less restrictive way as well.

storage: automatically split ranges based on load
Currently, you have to wait for a decent amount of data to be inserted into
CockroachDB before you'll have enough ranges to make use of a big cluster.
This is inconvenient for POC testing and although the `ALTER TABLE SPLIT AT`
SQL command is available to manually pre-split ranges, its use is difficult
because it requires users to understand some of the architecture and the
nature of their data model.

The algorithm is pretty simple: once a range exceeds the load threshold, we
begin tracking spans which it's servicing in requests. We use a reservoir
sample, inserting `span.Key`. If a sample doesn't replace an entry in the
reservoir, we take the opportunity instead to increment one or both of two
counters each reservoir sample maintains: left & right. The left counter is
incremented if the span falls to the left of the reservoir entry's key,
right, to the right. If the span overlaps the entry's key, then both counters
are incremented.

When a range has seen sustained load over the threshold rate for a threshold
duration, the key from the reservoir sample containing the most even balance
of left and right values is chosen for the split. Another heuristic used is
the count of requests that span over the split point. These requests
would be strictly worse than before if the split is initiated and so we
try to prevent this from happening. There is again a threshold
to prevent poor choices in the event that load is sequentially writing keys
or otherwise making a good choice for the split key impossible.

Release note: None

Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
Co-authored-by: Spencer Kimball <spencer@cockroachlabs.com>

Release note: None
@ridwanmsharif

Testing with evenly distributed workloads showed great signs. Increased throughput, more queries per second, more CPU utilization (the CPU utilization converged for all nodes instead of having some nodes use a lot more than others). Sequential workloads showed almost no splits initiated (I say almost because we still see like one or two splits created after an hour because of the randomness of the algorithm but nothing noticeable). There was no noticeable hit in performance. Testing with YCSB's zipfian distributions showed a lot of splits happening, but no improvement - I believe because the contention is on single keys and that's the bottleneck. To test how this works with that kind of distribution, I believe I have to build a workload like that into kv. As far as tests, I have some locally but there are a few bugs to whack out and this should be ready to go. You should be able to keep on with the review.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/roachpb/api.proto, line 1345 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Requests per nanosecond seems odd. Why not per second?

Done.


pkg/roachpb/api.proto, line 1346 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

We certainly don't have to name it similarly if there's good reason not to, but queries_per_second would be consistent with similar fields elsewhere. And in the absence of good reason one way or the other, consistency is as good a way as any of picking a name.

Done.


pkg/storage/merge_queue.go, line 229 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

It's worth thinking through how things will behave with a mixed-version cluster where some nodes are running 2.1 and others are running 2.2. In such cases, the 2.1 nodes won't return any qps info in their RangeStatsResponse, so you could conceivably have a situation like:

  1. n2 (running v2.2) splits r100 based on load, creating r101.
  2. The lease for r101 transfers to n1 (running v2.1) to better spread load
  3. n2 decides to merge r100 with r101 because r100's qps isn't enough to avoid splitting and r101 shows up as having 0 qps
  4. Return to step 1

It's not a devastating scenario, but is worth considering carefully. The easy way to avoid this, of course, is to just add a new cluster version that gates load-based splitting. There may be a less restrictive way as well.

I do need to think a little harder on this, I just figured a cluster version would suffice but if you think its too restrictive, we can discuss that for sure. I do not see a way around that yet though, if a range splits it will have to do so, knowing that the change would not be undone immediately after by the merge queue. And doing that must mean having a version check gate when splitting. When merging, I figure we can have some check in the scenario where the rhsQPS is 0 and the lhsQPS is roughly over half the QPS of the SplitByLoadRateThreshold - we can then ignore this case but I don't know how much that buys us. I'll think about it some more. I maybe missing something obvious.

@ridwanmsharif ridwanmsharif requested a review from a-robinson Oct 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment