release-20.1: backupccl: cluster restore import and restore jobs as canceled #63773

pbardea · 2021-04-16T02:17:59Z

First commit is a backport of #60458.

IMPORT and RESTORE may write non-transactionally, so their writes cannot
be trusted to be included in every backup. As such, they should be
restored in a reverting state to attempt to undo any of their untrusted
writes.

Release note (bug fix): IMPORT and RESTORE jobs are now restored as
reverting so that they cleanup after themselves. Previously, some of the
writes of the jobs while they were running may have been missed by
backup.

cockroach-teamcity · 2021-04-16T02:18:08Z

This change is

ajwerner

Reviewed 5 of 5 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @pbardea)

pkg/ccl/backupccl/restore_job.go, line 1358 at r1 (raw file):

		}

		if err := db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {

in 21.1 you've got a strong claim that no other instance is concurrently running. In 20.1 it's relatively best-effort. It's not likely but it is very possible. It would not hurt to re-read the details in this txn.

pkg/ccl/backupccl/restore_job.go, line 1459 at r2 (raw file):

				}

				var updateStatusQuery strings.Builder

nit: I'd pull out the query string construction into a helper function or closure.

pkg/ccl/backupccl/restore_job.go, line 1470 at r2 (raw file):

				fmt.Fprint(&updateStatusQuery, ")")

				fmt.Println(updateStatusQuery.String())

detritus?

pkg/ccl/backupccl/restore_test.go, line 13 at r1 (raw file):

import (
	"context"
	fmt "fmt"

🤔

pkg/jobs/jobspb/jobs.proto, line 99 at r1 (raw file):

  // SystemTablesRestored keeps track of dynamic states that need to happen only
  // once during the lifetime of a job. Note, that this state may be shared
  // between job versions, so updates to this map must be considered carefully.

what do you mean "job versions"?

Also, why is this in details and not progress? I imagine that it's because that ship has sailed but :(

pkg/jobs/jobspb/jobs.proto, line 102 at r1 (raw file):

  // It maps system table names to whether or not they have already been
  // restored.
  map<string, bool> system_tables_restored = 17;

This has been backported to every branch after this with the same definition, right?

Previously, if a cluster restore were to restart (e.g. in the case of the coordinator crashing) after restoring the system tables, but before completing, the job would attempt to restore the system table data again and potentially fail on non-idempotent system table restoration functions (e.g. restoring jobs). This commit, adds a new SystemTablesRestored field to the restore details to keep track of each system table that it has restored and maintains progress. This is required, since not all system tables may need to be restored atomically in the future. Release note (bug fix): Fix a bug where cluster restore would sometimes (very rarely) fail after retrying.

pbardea

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @dt)

pkg/ccl/backupccl/restore_job.go, line 1358 at r1 (raw file):

Previously, ajwerner wrote…

in 21.1 you've got a strong claim that no other instance is concurrently running. In 20.1 it's relatively best-effort. It's not likely but it is very possible. It would not hurt to re-read the details in this txn.

Moved the reading of the details to be contained within the txn.

pkg/ccl/backupccl/restore_job.go, line 1459 at r2 (raw file):

Previously, ajwerner wrote…

nit: I'd pull out the query string construction into a helper function or closure.

Done.

pkg/ccl/backupccl/restore_job.go, line 1470 at r2 (raw file):

Previously, ajwerner wrote…

detritus?

Yep, thanks for catching. Removed.

pkg/ccl/backupccl/restore_test.go, line 13 at r1 (raw file):

Previously, ajwerner wrote…

🤔

Something seems to be automatically adding these. Perhaps it's something up with my VSCode setup. Looks like we have quite a few violations throughout the codebase.

Perhaps there's a way to lint against these redundant import aliases.

pkg/jobs/jobspb/jobs.proto, line 99 at r1 (raw file):

Previously, ajwerner wrote…

what do you mean "job versions"?

Also, why is this in details and not progress? I imagine that it's because that ship has sailed but :(

I think I originally meant "job implementation versions", changed to "cluster versions" for more clarity.

Ya, it is unfortunate. We already have a few of these state-tracking progress metrics in details and didn't break the momentum here. Will try to keep in mind moving forward though.

pkg/jobs/jobspb/jobs.proto, line 102 at r1 (raw file):

Previously, ajwerner wrote…

This has been backported to every branch after this with the same definition, right?

Yep:
20.2 -

cockroach/pkg/jobs/jobspb/jobs.proto

Line 169 in 0f4525d

map<string, bool> system_tables_restored = 17;

21.1 -

cockroach/pkg/jobs/jobspb/jobs.proto

Line 192 in 0e1e28e

map<string, bool> system_tables_restored = 17;

IMPORT and RESTORE may write non-transactionally, so their writes cannot be trusted to be included in every backup. As such, they should be restored in a reverting state to attempt to undo any of their untrusted writes. Release note (bug fix): IMPORT and RESTORE jobs are now restored as reverting so that they cleanup after themselves. Previously, some of the writes of the jobs while they were running may have been missed by backup.

ajwerner

Reviewed 3 of 5 files at r3, 1 of 2 files at r4.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @dt)

pbardea · 2021-04-19T15:34:09Z

Thanks for the review!

pbardea requested review from dt and ajwerner April 16, 2021 02:17

pbardea force-pushed the backport20.1-60458 branch 2 times, most recently from b8fa0ed to 6158bda Compare April 16, 2021 14:10

ajwerner reviewed Apr 16, 2021

View reviewed changes

pbardea force-pushed the backport20.1-60458 branch from 6158bda to 50585b7 Compare April 16, 2021 19:32

pbardea commented Apr 16, 2021

View reviewed changes

pbardea force-pushed the backport20.1-60458 branch from 50585b7 to 93c6eff Compare April 16, 2021 19:35

pbardea force-pushed the backport20.1-60458 branch from 93c6eff to 8f06e69 Compare April 16, 2021 19:35

ajwerner approved these changes Apr 19, 2021

View reviewed changes

pbardea changed the title ~~release-20.1: backupccl: cluster restore import and restore jobs as reverting~~ release-20.1: backupccl: cluster restore import and restore jobs as canceled Apr 19, 2021

pbardea merged commit 9c5af7f into cockroachdb:release-20.1 Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-20.1: backupccl: cluster restore import and restore jobs as canceled #63773

release-20.1: backupccl: cluster restore import and restore jobs as canceled #63773

pbardea commented Apr 16, 2021

cockroach-teamcity commented Apr 16, 2021

ajwerner left a comment

pbardea left a comment

ajwerner left a comment

pbardea commented Apr 19, 2021

release-20.1: backupccl: cluster restore import and restore jobs as canceled #63773

release-20.1: backupccl: cluster restore import and restore jobs as canceled #63773

Conversation

pbardea commented Apr 16, 2021

cockroach-teamcity commented Apr 16, 2021

ajwerner left a comment

Choose a reason for hiding this comment

pbardea left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

pbardea commented Apr 19, 2021