Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: prepare RESTORE router for multitenancy #81989

Open
msbutler opened this issue May 27, 2022 · 3 comments
Open

backupccl: prepare RESTORE router for multitenancy #81989

msbutler opened this issue May 27, 2022 · 3 comments
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. sync-me sync-me-5 T-disaster-recovery

Comments

@msbutler
Copy link
Collaborator

msbutler commented May 27, 2022

In a multi tenant cluster, Restore's distSQL processors are assigned to sql instances using the sqlInstanceID. Currently, the splitAndScatterProcessor routes a scattered range to a sql instance running the restoreProcessor using the nodeID returned by the adminScatterRequest, which actually identifies a KV instance. In other words, to route ranges for restore ingestion after scatter, we currently assume the list of sqlInstanceIds from planning are identical to the nodeIDs returned by split and scatter during execution, which is certainly not the case, implying multitenant restore could be significantly slower. If there are fewer kv instances than planned sql instances ,for example, a subset of sql instances would never get sent any ranges to ingest!

In a non-multiregion multitenant cluster, we don't know (or even care) which sql instance is "closest" to a given a kv instance; thus, we ought to route ranges for ingestion such that we balance load across all available sql instances.

  • A simple solution: use a hashRouter as oppose to a rangeRouter. During planning, map each available kv node to a set of sql instances. If the backup job detects significant churn of sql instances, the job should be replanned.
  • A better solution: route ranges to sql instances dynamically. I'm not sure if this is possible right now.

In a multiregion multitenant cluster, we will likely want to route a range to a sql instance that is "close" to the range's leaseholder (or at least a follower?). Solution: apply the solution above, by region.

Jira issue: CRDB-16375

@msbutler msbutler added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-disaster-recovery T-disaster-recovery labels May 27, 2022
@blathers-crl
Copy link

blathers-crl bot commented May 27, 2022

cc @cockroachdb/bulk-io

@msbutler msbutler changed the title backupccl: prepare restore router for Multitenancy backupccl: prepare RESTORE router for Multitenancy May 27, 2022
@msbutler msbutler changed the title backupccl: prepare RESTORE router for Multitenancy backupccl: prepare RESTORE router for multitenancy May 27, 2022
@mari-crl mari-crl added sync-me and removed sync-me labels Jun 2, 2022
@livlobo livlobo moved this from Triage to Backup/Restore in Disaster Recovery Backlog Jun 7, 2022
Copy link

github-actions bot commented Dec 4, 2023

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@msbutler
Copy link
Collaborator Author

Not working on this, but this is still a problem.

@msbutler msbutler removed their assignment Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. sync-me sync-me-5 T-disaster-recovery
Projects
Development

No branches or pull requests

3 participants