backup: allow restricting backup coordination by region #95791

dt · 2023-01-24T21:25:07Z

The coordinator of a backup job needs to access the 'default' locality
to read and write metadata including the backup checkpoint files and
manifest. Prior to this change, this meant that every node in the
cluster needed to be able to access this default locality storage
location, since any node could become the coordinator.

This change introduces a new locality filter option that can be
specified when the backup job is created. If set, this new option will
cause any node which attempts to execute the backup job to check if it
meets the locality requirements and, if not, move the execution of the
job to a node which does meet them, or fail if not such node can be
found.

A locality requirement is specified as any number of key=value pairs,
each of which a node must match to be eligible to execute the job, for
example, a job run with BACKUP ... WITH coordinator_locality =
'region=east,cloud=azure' would require a node have both 'region=east'
and 'cloud=azure', however the order is not significant, only that each
specified filter is met.

Jobs typically begin executing directly on the node on which they are
created so if that node has matching localities, it will execute the
job, or relocate it if it does not.

Relocated executions may take some amount of time to resume on the new
node to which they were relocated, similar to the delay seen when a
paused job is resumed, typically between a few seconds to a minute.

Note that this only restricts the coordination of the backup job --
reading the row data from individual ranges and exporting that data to
the destination storage location or locations is still performed by many
nodes, typically the leaseholders for each range.

Release note (enterprise change): coordination of BACKUP jobs and thus
writing of BACKUP metadata can be restricted to nodes within designated
localities using the new 'coordinator_locality' option.

Epic: CRDB-9547.

cockroach-teamcity · 2023-01-24T21:25:23Z

This change is

stevendanna

Overall, it is pretty pleasing how small the implementation is for this feature.

I left some questions and comments, but most are style related, so take them or leave them.

One question I have is whether the errors produced during a relocation contain enough information to not confuse users into thinking something is wrong.

stevendanna · 2023-01-25T11:01:44Z

pkg/jobs/registry.go

+	id jobspb.JobID,
+	destID base.SQLInstanceID,
+	destSession sqlliveness.SessionID,
+) (sentinel error, failure error) {


🤷 I can't decide if a function that returns two errors is more or less weird than a function that always returns an error.

I know, right? I went back and forth on this and decided I liked this best: I want to indicate failure vs success separately since e.g. callers are encouraged to Wrapf() their failures with contextual info or log them or clean up differently. I thought about making the first have a different type that just happened to implement Error() but I eventually decided error made it most obvious that you return this thing from Resume() and don't go make your own.

stevendanna · 2023-01-25T11:05:27Z

pkg/jobs/registry.go

+	destID base.SQLInstanceID,
+	destSession sqlliveness.SessionID,


Should this function take a sqlinstance.InstanceInfo (or some other type that bundles the instance id and session ID)?

InstanceInfo has a lot more going on than I need/have. I don't know if we have anything that is just the pair of id and session. I'm not sure I feel like they need to be paired, conceptually? I guess I read ID as the destination, and session as a way to make sure it doesn't get dropped in transit but they're not obviously more linked than the other individual args?

pkg/jobs/registry.go

pkg/roachpb/metadata_test.go

pkg/sql/parser/sql.y

stevendanna · 2023-01-25T11:18:33Z

pkg/ccl/backupccl/backup_planning.go

+		// Check that a node will currently be able to run this before we create it.
+		if coordinatorLocality.NonEmpty() {
+			if _, err := p.DistSQLPlanner().GetAllInstancesByLocality(ctx, coordinatorLocality); err != nil {
+				return err


What does this error end up looking like? I'm wondering if we should wrap it, mentioning the COORDINATOR_LOCALITY option so the user knows what they might need to change.

it looks like pq: no instances found matching locality filter %s. It includes the filter that didn't match, which seems more useful than the field they put that filter in?

pkg/ccl/backupccl/backup_job.go

Remembered another issue.

pkg/ccl/backupccl/backup_planning.go

michae2 · 2023-01-26T22:03:53Z

SQL changes LGTM.

Release note: none. Epic: none.

The coordinator of a backup job needs to access the 'default' locality to read and write metadata including the backup checkpoint files and manifest. Prior to this change, this meant that every node in the cluster needed to be able to access this default locality storage location, since any node could become the coordinator. This change introduces a new locality filter option that can be specified when the backup job is created. If set, this new option will cause any node which attempts to execute the backup job to check if it meets the locality requirements and, if not, move the execution of the job to a node which does meet them, or fail if not such node can be found. A locality requirement is specified as any number of key=value pairs, each of which a node must match to be eligible to execute the job, for example, a job run with BACKUP ... WITH coordinator_locality = 'region=east,cloud=azure' would require a node have both 'region=east' and 'cloud=azure', however the order is not significant, only that each specified filter is met. Jobs typically begin executing directly on the node on which they are created so if that node has matching localities, it will execute the job, or relocate it if it does not. Relocated executions may take some amount of time to resume on the new node to which they were relocated, similar to the delay seen when a paused job is resumed, typically between a few seconds to a minute. Note that this only restricts the *coordination* of the backup job -- reading the row data from individual ranges and exporting that data to the destination storage location or locations is still performed by many nodes, typically the leaseholders for each range. Release note (enterprise change): coordination of BACKUP jobs and thus writing of BACKUP metadata can be restricted to nodes within designated localities using the new 'coordinator_locality' option. Epic: CRDB-9547.

stevendanna · 2023-02-07T13:31:22Z

Nice work here. I'm good with this as a step. Let's be sure not to forget the alter and create schedule changes before shipping it.

dt · 2023-02-08T20:11:28Z

TFTRs!

bors r+

craig · 2023-02-08T21:06:47Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-02-08T22:03:23Z

Build succeeded:

Bazel Essential CI (Cockroach)

dt requested review from stevendanna and rhu713 January 24, 2023 21:25

dt requested review from a team as code owners January 24, 2023 21:25

dt requested a review from michae2 January 24, 2023 21:25

dt force-pushed the move branch from 545ecb3 to d2dcae9 Compare January 24, 2023 21:28

stevendanna previously approved these changes Jan 25, 2023

View reviewed changes

stevendanna requested changes Jan 26, 2023

View reviewed changes

pkg/ccl/backupccl/backup_planning.go Show resolved Hide resolved

michae2 removed request for a team and michae2 January 26, 2023 22:04

dt force-pushed the move branch 5 times, most recently from ed49d93 to 5951f80 Compare January 31, 2023 16:50

dt added 3 commits February 7, 2023 13:14

roachpb: add Locality matching helper

64a5ae9

Release note: none. Epic: none.

jobs: add API to move execution lease

a8db047

Release note: none. Epic: none.

dt force-pushed the move branch from 5951f80 to 91d2a76 Compare February 7, 2023 13:14

stevendanna approved these changes Feb 7, 2023

View reviewed changes

craig bot merged commit 6ca0b8c into cockroachdb:master Feb 8, 2023

cockroach-teamcity mentioned this pull request Feb 9, 2023

PR #95791 - backup: allow restricting backup coordination by region cockroachdb/docs#16226

Open

dt deleted the move branch February 9, 2023 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backup: allow restricting backup coordination by region #95791

backup: allow restricting backup coordination by region #95791

dt commented Jan 24, 2023

cockroach-teamcity commented Jan 24, 2023

stevendanna left a comment

stevendanna Jan 25, 2023

dt Jan 30, 2023

stevendanna Jan 25, 2023

dt Jan 30, 2023 •

edited

Loading

stevendanna Jan 25, 2023

dt Jan 30, 2023

michae2 commented Jan 26, 2023

stevendanna commented Feb 7, 2023

dt commented Feb 8, 2023

craig bot commented Feb 8, 2023

craig bot commented Feb 8, 2023

		destID base.SQLInstanceID,
		destSession sqlliveness.SessionID,

backup: allow restricting backup coordination by region #95791

backup: allow restricting backup coordination by region #95791

Conversation

dt commented Jan 24, 2023

cockroach-teamcity commented Jan 24, 2023

stevendanna left a comment

Choose a reason for hiding this comment

stevendanna Jan 25, 2023

Choose a reason for hiding this comment

dt Jan 30, 2023

Choose a reason for hiding this comment

stevendanna Jan 25, 2023

Choose a reason for hiding this comment

dt Jan 30, 2023 • edited Loading

Choose a reason for hiding this comment

stevendanna Jan 25, 2023

Choose a reason for hiding this comment

dt Jan 30, 2023

Choose a reason for hiding this comment

michae2 commented Jan 26, 2023

stevendanna commented Feb 7, 2023

dt commented Feb 8, 2023

craig bot commented Feb 8, 2023

craig bot commented Feb 8, 2023

dt Jan 30, 2023 •

edited

Loading