backupccl: prepare RESTORE router for multitenancy #81989
Labels
A-disaster-recovery
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
sync-me
sync-me-5
T-disaster-recovery
Projects
In a multi tenant cluster, Restore's distSQL processors are assigned to sql instances using the
sqlInstanceID
. Currently, thesplitAndScatterProcessor
routes a scattered range to a sql instance running therestoreProcessor
using thenodeID
returned by theadminScatterRequest
, which actually identifies a KV instance. In other words, to route ranges for restore ingestion after scatter, we currently assume the list ofsqlInstanceId
s from planning are identical to thenodeID
s returned by split and scatter during execution, which is certainly not the case, implying multitenant restore could be significantly slower. If there are fewer kv instances than planned sql instances ,for example, a subset of sql instances would never get sent any ranges to ingest!In a non-multiregion multitenant cluster, we don't know (or even care) which sql instance is "closest" to a given a kv instance; thus, we ought to route ranges for ingestion such that we balance load across all available sql instances.
hashRouter
as oppose to arangeRouter
. During planning, map each available kv node to a set of sql instances. If the backup job detects significant churn of sql instances, the job should be replanned.In a multiregion multitenant cluster, we will likely want to route a range to a sql instance that is "close" to the range's leaseholder (or at least a follower?). Solution: apply the solution above, by region.
Jira issue: CRDB-16375
The text was updated successfully, but these errors were encountered: