Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign up[WIP, DNM] storage: automatically split ranges based on load #31413
Conversation
ridwanmsharif
added
the
do-not-merge
label
Oct 16, 2018
ridwanmsharif
requested a review
from cockroachdb/core-prs
as a
code owner
Oct 16, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ridwanmsharif
removed the request for review
from cockroachdb/core-prs
Oct 16, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ridwanmsharif
Oct 16, 2018
Collaborator
Working on tests and running this against different workloads now. We may want to tweak the scoring heuristic we're using (its trying to take into account how keys that have requests on both sides, are given lower priorities). I've decided for now, to continue using the reservoir sampling approach as it doesn't add too much complexity to the hot code path during requests, its also fairly simple and can be tweaked intuitively as our needs/wants change. This is still a WIP, and test coverage is almost nonexistent but I hoped to gather some early thoughts on direction now.
|
Working on tests and running this against different workloads now. We may want to tweak the scoring heuristic we're using (its trying to take into account how keys that have requests on both sides, are given lower priorities). I've decided for now, to continue using the reservoir sampling approach as it doesn't add too much complexity to the hot code path during requests, its also fairly simple and can be tweaked intuitively as our needs/wants change. This is still a WIP, and test coverage is almost nonexistent but I hoped to gather some early thoughts on direction now. |
a-robinson
requested changes
Oct 17, 2018
Just sending out partial comments for now. Hopefully I can finish reviewing in my one or two gaps small between meetings today
Reviewed 4 of 11 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained
pkg/roachpb/api.proto, line 1345 at r1 (raw file):
]; // RequestRate is the rate of request/ns for the range.
Requests per nanosecond seems odd. Why not per second?
pkg/roachpb/api.proto, line 1346 at r1 (raw file):
// RequestRate is the rate of request/ns for the range. double request_rate = 3;
We certainly don't have to name it similarly if there's good reason not to, but queries_per_second would be consistent with similar fields elsewhere. And in the absence of good reason one way or the other, consistency is as good a way as any of picking a name.
pkg/storage/merge_queue.go, line 229 at r1 (raw file):
// found for over 10 seconds, the merge queue should feel free to merge the ranges. if lhsRepl.SplitByLoadEnabled() && timeSinceLastReq <= (10*time.Second) { mergedReqRate = lhsReqRate + rhsReqRate
It's worth thinking through how things will behave with a mixed-version cluster where some nodes are running 2.1 and others are running 2.2. In such cases, the 2.1 nodes won't return any qps info in their RangeStatsResponse, so you could conceivably have a situation like:
- n2 (running v2.2) splits r100 based on load, creating r101.
- The lease for r101 transfers to n1 (running v2.1) to better spread load
- n2 decides to merge r100 with r101 because r100's qps isn't enough to avoid splitting and r101 shows up as having 0 qps
- Return to step 1
It's not a devastating scenario, but is worth considering carefully. The easy way to avoid this, of course, is to just add a new cluster version that gates load-based splitting. There may be a less restrictive way as well.
ridwanmsharif
reviewed
Oct 18, 2018
Testing with evenly distributed workloads showed great signs. Increased throughput, more queries per second, more CPU utilization (the CPU utilization converged for all nodes instead of having some nodes use a lot more than others). Sequential workloads showed almost no splits initiated (I say almost because we still see like one or two splits created after an hour because of the randomness of the algorithm but nothing noticeable). There was no noticeable hit in performance. Testing with YCSB's zipfian distributions showed a lot of splits happening, but no improvement - I believe because the contention is on single keys and that's the bottleneck. To test how this works with that kind of distribution, I believe I have to build a workload like that into kv. As far as tests, I have some locally but there are a few bugs to whack out and this should be ready to go. You should be able to keep on with the review.
Reviewable status:
complete! 0 of 0 LGTMs obtained
pkg/roachpb/api.proto, line 1345 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
Requests per nanosecond seems odd. Why not per second?
Done.
pkg/roachpb/api.proto, line 1346 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
We certainly don't have to name it similarly if there's good reason not to, but
queries_per_secondwould be consistent with similar fields elsewhere. And in the absence of good reason one way or the other, consistency is as good a way as any of picking a name.
Done.
pkg/storage/merge_queue.go, line 229 at r1 (raw file):
Previously, a-robinson (Alex Robinson) wrote…
It's worth thinking through how things will behave with a mixed-version cluster where some nodes are running 2.1 and others are running 2.2. In such cases, the 2.1 nodes won't return any qps info in their
RangeStatsResponse, so you could conceivably have a situation like:
- n2 (running v2.2) splits r100 based on load, creating r101.
- The lease for r101 transfers to n1 (running v2.1) to better spread load
- n2 decides to merge r100 with r101 because r100's qps isn't enough to avoid splitting and r101 shows up as having 0 qps
- Return to step 1
It's not a devastating scenario, but is worth considering carefully. The easy way to avoid this, of course, is to just add a new cluster version that gates load-based splitting. There may be a less restrictive way as well.
I do need to think a little harder on this, I just figured a cluster version would suffice but if you think its too restrictive, we can discuss that for sure. I do not see a way around that yet though, if a range splits it will have to do so, knowing that the change would not be undone immediately after by the merge queue. And doing that must mean having a version check gate when splitting. When merging, I figure we can have some check in the scenario where the rhsQPS is 0 and the lhsQPS is roughly over half the QPS of the SplitByLoadRateThreshold - we can then ignore this case but I don't know how much that buys us. I'll think about it some more. I maybe missing something obvious.
ridwanmsharif commentedOct 16, 2018
•
edited
This PR works off of #21406 and implements load-based splitting.
Issue: #19154
Currently, you have to wait for a decent amount of data to be inserted into
CockroachDB before you'll have enough ranges to make use of a big cluster.
This is inconvenient for POC testing and although the
ALTER TABLE SPLIT ATSQL command is available to manually pre-split ranges, its use is difficult
because it requires users to understand some of the architecture and the
nature of their data model.
The algorithm is pretty simple: once a range exceeds the load threshold, we
begin tracking spans which it's servicing in requests. We use a reservoir
sample, inserting
span.Key. If a sample doesn't replace an entry in thereservoir, we take the opportunity instead to increment one or both of two
counters each reservoir sample maintains: left & right. The left counter is
incremented if the span falls to the left of the reservoir entry's key,
right, to the right. If the span overlaps the entry's key, then both counters
are incremented.
When a range has seen sustained load over the threshold rate for a threshold
duration, the key from the reservoir sample containing the most even balance
of left and right values is chosen for the split. Another heuristic used is
the count of requests that span over the split point. These requests
would be strictly worse than before if the split is initiated and so we
try to prevent this from happening. There is again a threshold
to prevent poor choices in the event that load is sequentially writing keys
or otherwise making a good choice for the split key impossible.
Release note: None
Co-authored-by: Ridwan Sharif ridwan@cockroachlabs.com
Co-authored-by: Spencer Kimball spencer@cockroachlabs.com
Release note: None