storage: reconcile manual Range splits and automatic Range merges #37487

nvanbenschoten · 2019-05-12T22:23:41Z

Background reading:

Range Merges RFC (how things were designed)
Range Merges RFC pull request (for related discussion)
Range Merges Tech-note (how things work today)

CockroachDB's distributed keyspace is split into a series of contiguous Ranges, each of which attempts to remain between 32 MB and 64 MB in size. These Ranges are automatically split by the splitQueue based on their size and foreground load. To complement this process, the recent 19.1 release introduced automatic Range merging, where previously split Ranges are merged back together by the mergeQueue if they fall below a size and load threshold.

In addition to these automated processes, CockroachDB has long supported manual hooks into data distribution. The ALTER TABLE ... SPLIT AT allows an operator to manually override Range splitting decisions and inject their own split point into the keyspace. There are a number of reasons why an operator might want to do this, which are discussed here.

The current design of automatic Range merges doesn't play well with these manual Range splits. The mergeQueue is not aware of which Ranges were automatically split and which were manually split and will happily merge away a manual split. For this reason, using manual splits while automatic merges are enabled is currently disallowed. An operator that attempts to perform a manual Range split without disabling automatic Range merging will be greeted with the following error:

splits would be immediately discarded by merge queue; disable the merge queue first by running 'SET CLUSTER SETTING kv.range_merge.queue_enabled = false'

Manual Range splits and automatic Range merges should be fixed to work with one another.

The proposed solution to this was originally discussed in the Range Merges RFC. The RFC mentions:

To preserve these user-created split points, we need to store state. Every range
to the right of a split point created by an ALTER TABLE ... SPLIT AT command
will have a special range local key set that indicates it is not to be merged
into the range to its left. I'll hereafter refer to this key as the "sticky"
bit.

This "sticky" bit is the first part of the solution. A manual split will need to denote somewhere that a Range was split manually. The best discussion about this is in the range merges RFC pull request. It's still not clear to me whether a range local key is a better solution than marking the bit directly on the Range descriptor. Perhaps @bdarnell has an opinion on this.

Once manual splits begin writing this sticky bit, we'll make the mergeQueue search for the sticky bit before deciding on whether to merge a Range into its left neighbor.

Once this is complete, we can allow manual Range splits and automatic Range merges.

Finally, we'll want to introduce a new ALTER TABLE ... UNSPLIT AT command that simply removes a sticky bit if one exists. This again was discussed most thoroughly in the range merges RFC pull request.

Stages:

make manual splits write sticky bit
make merge queue respects sticky bit
remove "splits would be immediately discarded by merge queue" error
expose functionality to check which splits are manually split (crdb_internal.ranges and SHOW EXPERIMENTAL_RANGES)
introduce ALTER TABLE ... UNSPLIT AT syntax to manually remove sticky bit

The text was updated successfully, but these errors were encountered:

bdarnell · 2019-05-13T15:59:44Z

It's still not clear to me whether a range local key is a better solution than marking the bit directly on the Range descriptor. Perhaps @bdarnell has an opinion on this.

You introduced the idea of using a range-local key in #24394 (comment)

Each approach introduces some complexity. Putting it in the range descriptor is probably simpler to start with, but because range descriptors are mirrored into meta ranges and cached, it'll be a little harder to make the UNSPLIT command (it'll need to be a transaction across the local copy of the descriptor and the meta copy. Addressable range-local keys are rare but adding one for this might be simpler in the long run.

One additional stage before the USNPLIT command: we need a way to inspect the current splits and see which ones are sticky. Maybe that's just an extra column for show experimental_ranges, but we need something (ideally non-experimental).

Do we want a MERGE AT command (which would both clear the sticky bit and merge if possible) or just UNSPLIT to clear the sticky bit and leave the rest to the queues?

nvanbenschoten · 2019-05-15T02:58:26Z

One additional stage before the UNSPLIT command: we need a way to inspect the current splits and see which ones are sticky. Maybe that's just an extra column for show experimental_ranges, but we need something (ideally non-experimental).

Good point. This will require an extra lookup per range if we don't add the sticky bit to the range descriptor, so we'll probably want to do something similar to crdb_internal.ranges_no_leases/crdb_internal.ranges.

Do we want a MERGE AT command (which would both clear the sticky bit and merge if possible) or just UNSPLIT to clear the sticky bit and leave the rest to the queues?

I was thinking we would start with just the UNSPLIT at command. That's significantly simpler. I also don't know of any reason why a user would want to manually merge a Range.

nvanbenschoten · 2019-05-15T19:52:16Z

Each approach introduces some complexity. Putting it in the range descriptor is probably simpler to start with, but because range descriptors are mirrored into meta ranges and cached, it'll be a little harder to make the UNSPLIT command (it'll need to be a transaction across the local copy of the descriptor and the meta copy. Addressable range-local keys are rare but adding one for this might be simpler in the long run.

@ajwerner and I discussed this earlier today and we decided that storing the sticky bit in the range descriptor will be a better long-term descriptor. This will simplify things in SQL because it means we won't need to issue a kv lookup per range to search for the sticky bit when populating the virtual table. It will also simplify the mergeQueue's introspection into the sticky bit. Finally, we saw in #37506 a few cases where we had to be careful about touching both the range descriptor and the sticky bit in the same serializable transaction to avoid the local key getting out of sync with split boundaries. This won't be an issue if the bit it stored directly on the range descriptor. Instead, we can just CPut the range descriptor directly to serialize with any conflicting changes.

The complication here is that updating the sticky bit will need to touch both the meta descriptor and the range-local descriptor. To do this, we'll need to use the functions updateRangeAddressing and updateRangeDescriptor.

37506: storage: reconcile manual splitting with automatic merging r=jeffrey-xiao a=jeffrey-xiao Follows the steps outlined in #37487 to reconcile manual splitting with automatic merging. This PR includes the changes needed at the storage layer. The sticky bit indicating that a range is manually split is added to the range descriptor. 37558: docs/tla-plus: add timestamp refreshes to ParallelCommits spec r=nvanbenschoten a=nvanbenschoten This commit adds transaction timestamp refreshes to the PlusCal model for parallel commits. This allows the committing transaction to bump its timestamp without a required epoch bump. This completes the parallel commits formal specification. 37578: opt, sql: fix type inference of TypeCheck for subqueries r=rytaft a=rytaft Prior to this commit, the optimizer was not correctly inferring the types of columns in subqueries for expressions of the form `scalar IN (subquery)`. This was due to two problems which have now been fixed: 1. The subquery was built as a relational expression before the desired types were known. Now the subquery build is delayed until `TypeCheck` is called for the first time. 2. For subqueries on the right side of an `IN` expression, the desired type passed into `TypeCheck` was `AnyTuple`. This resulted in an error later on in `typeCheckSubqueryWithIn`, which checks to make sure the type of the subquery is `tuple{T}` where `T` is the type of the left expression. Now the desired type passed into `TypeCheck` is `tuple{T}`. Note that this commit only fixes type inference for the optimizer. It is still broken in the heuristic planner. Fixes #37263 Fixes #14554 Release note (bug fix): Fixed type inference of columns in subqueries for some expressions of the form `scalar IN (subquery)`. Co-authored-by: Jeffrey Xiao <jeffrey.xiao1998@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>

nvanbenschoten · 2019-08-06T18:15:32Z

@jeffrey-xiao we can close this now, right?

nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. labels May 12, 2019

nvanbenschoten assigned jeffrey-xiao May 12, 2019

jeffrey-xiao mentioned this issue May 14, 2019

storage: reconcile manual splitting with automatic merging #37506

Merged

ajwerner mentioned this issue May 21, 2019

storage: add cluster setting to expire manual split sticky bits #37696

Closed

jeffrey-xiao closed this as completed Aug 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: reconcile manual Range splits and automatic Range merges #37487

storage: reconcile manual Range splits and automatic Range merges #37487

nvanbenschoten commented May 12, 2019 •

edited by jeffrey-xiao

Loading

bdarnell commented May 13, 2019

nvanbenschoten commented May 15, 2019

nvanbenschoten commented May 15, 2019

nvanbenschoten commented Aug 6, 2019

storage: reconcile manual Range splits and automatic Range merges #37487

storage: reconcile manual Range splits and automatic Range merges #37487

Comments

nvanbenschoten commented May 12, 2019 • edited by jeffrey-xiao Loading

bdarnell commented May 13, 2019

nvanbenschoten commented May 15, 2019

nvanbenschoten commented May 15, 2019

nvanbenschoten commented Aug 6, 2019

nvanbenschoten commented May 12, 2019 •

edited by jeffrey-xiao

Loading