Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: reconcile manual Range splits and automatic Range merges #37487

Closed
5 tasks done
nvanbenschoten opened this issue May 12, 2019 · 4 comments
Closed
5 tasks done
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented May 12, 2019

Background reading:

CockroachDB's distributed keyspace is split into a series of contiguous Ranges, each of which attempts to remain between 32 MB and 64 MB in size. These Ranges are automatically split by the splitQueue based on their size and foreground load. To complement this process, the recent 19.1 release introduced automatic Range merging, where previously split Ranges are merged back together by the mergeQueue if they fall below a size and load threshold.

In addition to these automated processes, CockroachDB has long supported manual hooks into data distribution. The ALTER TABLE ... SPLIT AT allows an operator to manually override Range splitting decisions and inject their own split point into the keyspace. There are a number of reasons why an operator might want to do this, which are discussed here.

The current design of automatic Range merges doesn't play well with these manual Range splits. The mergeQueue is not aware of which Ranges were automatically split and which were manually split and will happily merge away a manual split. For this reason, using manual splits while automatic merges are enabled is currently disallowed. An operator that attempts to perform a manual Range split without disabling automatic Range merging will be greeted with the following error:

splits would be immediately discarded by merge queue; disable the merge queue first by running 'SET CLUSTER SETTING kv.range_merge.queue_enabled = false'

Manual Range splits and automatic Range merges should be fixed to work with one another.

The proposed solution to this was originally discussed in the Range Merges RFC. The RFC mentions:

To preserve these user-created split points, we need to store state. Every range
to the right of a split point created by an ALTER TABLE ... SPLIT AT command
will have a special range local key set that indicates it is not to be merged
into the range to its left. I'll hereafter refer to this key as the "sticky"
bit.

This "sticky" bit is the first part of the solution. A manual split will need to denote somewhere that a Range was split manually. The best discussion about this is in the range merges RFC pull request. It's still not clear to me whether a range local key is a better solution than marking the bit directly on the Range descriptor. Perhaps @bdarnell has an opinion on this.

Once manual splits begin writing this sticky bit, we'll make the mergeQueue search for the sticky bit before deciding on whether to merge a Range into its left neighbor.

Once this is complete, we can allow manual Range splits and automatic Range merges.

Finally, we'll want to introduce a new ALTER TABLE ... UNSPLIT AT command that simply removes a sticky bit if one exists. This again was discussed most thoroughly in the range merges RFC pull request.

Stages:

  • make manual splits write sticky bit
  • make merge queue respects sticky bit
  • remove "splits would be immediately discarded by merge queue" error
  • expose functionality to check which splits are manually split (crdb_internal.ranges and SHOW EXPERIMENTAL_RANGES)
  • introduce ALTER TABLE ... UNSPLIT AT syntax to manually remove sticky bit
@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. labels May 12, 2019
@bdarnell
Copy link
Contributor

It's still not clear to me whether a range local key is a better solution than marking the bit directly on the Range descriptor. Perhaps @bdarnell has an opinion on this.

You introduced the idea of using a range-local key in #24394 (comment)

Each approach introduces some complexity. Putting it in the range descriptor is probably simpler to start with, but because range descriptors are mirrored into meta ranges and cached, it'll be a little harder to make the UNSPLIT command (it'll need to be a transaction across the local copy of the descriptor and the meta copy. Addressable range-local keys are rare but adding one for this might be simpler in the long run.

One additional stage before the USNPLIT command: we need a way to inspect the current splits and see which ones are sticky. Maybe that's just an extra column for show experimental_ranges, but we need something (ideally non-experimental).

Do we want a MERGE AT command (which would both clear the sticky bit and merge if possible) or just UNSPLIT to clear the sticky bit and leave the rest to the queues?

@nvanbenschoten
Copy link
Member Author

One additional stage before the UNSPLIT command: we need a way to inspect the current splits and see which ones are sticky. Maybe that's just an extra column for show experimental_ranges, but we need something (ideally non-experimental).

Good point. This will require an extra lookup per range if we don't add the sticky bit to the range descriptor, so we'll probably want to do something similar to crdb_internal.ranges_no_leases/crdb_internal.ranges.

Do we want a MERGE AT command (which would both clear the sticky bit and merge if possible) or just UNSPLIT to clear the sticky bit and leave the rest to the queues?

I was thinking we would start with just the UNSPLIT at command. That's significantly simpler. I also don't know of any reason why a user would want to manually merge a Range.

@nvanbenschoten
Copy link
Member Author

Each approach introduces some complexity. Putting it in the range descriptor is probably simpler to start with, but because range descriptors are mirrored into meta ranges and cached, it'll be a little harder to make the UNSPLIT command (it'll need to be a transaction across the local copy of the descriptor and the meta copy. Addressable range-local keys are rare but adding one for this might be simpler in the long run.

@ajwerner and I discussed this earlier today and we decided that storing the sticky bit in the range descriptor will be a better long-term descriptor. This will simplify things in SQL because it means we won't need to issue a kv lookup per range to search for the sticky bit when populating the virtual table. It will also simplify the mergeQueue's introspection into the sticky bit. Finally, we saw in #37506 a few cases where we had to be careful about touching both the range descriptor and the sticky bit in the same serializable transaction to avoid the local key getting out of sync with split boundaries. This won't be an issue if the bit it stored directly on the range descriptor. Instead, we can just CPut the range descriptor directly to serialize with any conflicting changes.

The complication here is that updating the sticky bit will need to touch both the meta descriptor and the range-local descriptor. To do this, we'll need to use the functions updateRangeAddressing and updateRangeDescriptor.

craig bot pushed a commit that referenced this issue May 20, 2019
37506: storage: reconcile manual splitting with automatic merging r=jeffrey-xiao a=jeffrey-xiao

Follows the steps outlined in #37487 to reconcile manual splitting with automatic merging. This PR includes the changes needed at the storage layer. The sticky bit indicating that a range is manually split is added to the range descriptor.

37558: docs/tla-plus: add timestamp refreshes to ParallelCommits spec r=nvanbenschoten a=nvanbenschoten

This commit adds transaction timestamp refreshes to the PlusCal model
for parallel commits. This allows the committing transaction to bump
its timestamp without a required epoch bump.

This completes the parallel commits formal specification.

37578: opt, sql: fix type inference of TypeCheck for subqueries r=rytaft a=rytaft

Prior to this commit, the optimizer was not correctly inferring the types of
columns in subqueries for expressions of the form `scalar IN (subquery)`.
This was due to two problems which have now been fixed:

1. The subquery was built as a relational expression before the desired types
   were known. Now the subquery build is delayed until `TypeCheck` is called for
   the first time.

2. For subqueries on the right side of an `IN` expression, the desired type
   passed into `TypeCheck` was `AnyTuple`. This resulted in an error later on in
   `typeCheckSubqueryWithIn`, which checks to make sure the type of the subquery
   is `tuple{T}` where `T` is the type of the left expression. Now the desired
   type passed into `TypeCheck` is `tuple{T}`.

Note that this commit only fixes type inference for the optimizer. It is still
broken in the heuristic planner.

Fixes #37263
Fixes #14554

Release note (bug fix): Fixed type inference of columns in subqueries for
some expressions of the form `scalar IN (subquery)`.

Co-authored-by: Jeffrey Xiao <jeffrey.xiao1998@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Rebecca Taft <becca@cockroachlabs.com>
@nvanbenschoten
Copy link
Member Author

@jeffrey-xiao we can close this now, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

3 participants