Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.1: backupccl: cluster restore import and restore jobs as canceled #63773

Merged
merged 2 commits into from
Apr 19, 2021

Conversation

pbardea
Copy link
Contributor

@pbardea pbardea commented Apr 16, 2021

First commit is a backport of #60458.

IMPORT and RESTORE may write non-transactionally, so their writes cannot
be trusted to be included in every backup. As such, they should be
restored in a reverting state to attempt to undo any of their untrusted
writes.

Release note (bug fix): IMPORT and RESTORE jobs are now restored as
reverting so that they cleanup after themselves. Previously, some of the
writes of the jobs while they were running may have been missed by
backup.

@pbardea pbardea requested review from dt and ajwerner April 16, 2021 02:17
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pbardea pbardea force-pushed the backport20.1-60458 branch 2 times, most recently from b8fa0ed to 6158bda Compare April 16, 2021 14:10
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @pbardea)


pkg/ccl/backupccl/restore_job.go, line 1358 at r1 (raw file):

		}

		if err := db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {

in 21.1 you've got a strong claim that no other instance is concurrently running. In 20.1 it's relatively best-effort. It's not likely but it is very possible. It would not hurt to re-read the details in this txn.


pkg/ccl/backupccl/restore_job.go, line 1459 at r2 (raw file):

				}

				var updateStatusQuery strings.Builder

nit: I'd pull out the query string construction into a helper function or closure.


pkg/ccl/backupccl/restore_job.go, line 1470 at r2 (raw file):

				fmt.Fprint(&updateStatusQuery, ")")

				fmt.Println(updateStatusQuery.String())

detritus?


pkg/ccl/backupccl/restore_test.go, line 13 at r1 (raw file):

import (
	"context"
	fmt "fmt"

🤔


pkg/jobs/jobspb/jobs.proto, line 99 at r1 (raw file):

  // SystemTablesRestored keeps track of dynamic states that need to happen only
  // once during the lifetime of a job. Note, that this state may be shared
  // between job versions, so updates to this map must be considered carefully.

what do you mean "job versions"?

Also, why is this in details and not progress? I imagine that it's because that ship has sailed but :(


pkg/jobs/jobspb/jobs.proto, line 102 at r1 (raw file):

  // It maps system table names to whether or not they have already been
  // restored.
  map<string, bool> system_tables_restored = 17;

This has been backported to every branch after this with the same definition, right?

Previously, if a cluster restore were to restart (e.g. in the case of
the coordinator crashing) after restoring the system tables, but before
completing, the job would attempt to restore the system table data again
and potentially fail on non-idempotent system table restoration
functions (e.g. restoring jobs).

This commit, adds a new SystemTablesRestored field to the restore
details to keep track of each system table that it has restored and
maintains progress. This is required, since not all system tables may
need to be restored atomically in the future.

Release note (bug fix): Fix a bug where cluster restore would sometimes
(very rarely) fail after retrying.
Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @dt)


pkg/ccl/backupccl/restore_job.go, line 1358 at r1 (raw file):

Previously, ajwerner wrote…

in 21.1 you've got a strong claim that no other instance is concurrently running. In 20.1 it's relatively best-effort. It's not likely but it is very possible. It would not hurt to re-read the details in this txn.

Moved the reading of the details to be contained within the txn.


pkg/ccl/backupccl/restore_job.go, line 1459 at r2 (raw file):

Previously, ajwerner wrote…

nit: I'd pull out the query string construction into a helper function or closure.

Done.


pkg/ccl/backupccl/restore_job.go, line 1470 at r2 (raw file):

Previously, ajwerner wrote…

detritus?

Yep, thanks for catching. Removed.


pkg/ccl/backupccl/restore_test.go, line 13 at r1 (raw file):

Previously, ajwerner wrote…

🤔

Something seems to be automatically adding these. Perhaps it's something up with my VSCode setup. Looks like we have quite a few violations throughout the codebase.

Perhaps there's a way to lint against these redundant import aliases.


pkg/jobs/jobspb/jobs.proto, line 99 at r1 (raw file):

Previously, ajwerner wrote…

what do you mean "job versions"?

Also, why is this in details and not progress? I imagine that it's because that ship has sailed but :(

I think I originally meant "job implementation versions", changed to "cluster versions" for more clarity.

Ya, it is unfortunate. We already have a few of these state-tracking progress metrics in details and didn't break the momentum here. Will try to keep in mind moving forward though.


pkg/jobs/jobspb/jobs.proto, line 102 at r1 (raw file):

Previously, ajwerner wrote…

This has been backported to every branch after this with the same definition, right?

Yep:
20.2 -

map<string, bool> system_tables_restored = 17;

21.1 -
map<string, bool> system_tables_restored = 17;

IMPORT and RESTORE may write non-transactionally, so their writes cannot
be trusted to be included in every backup. As such, they should be
restored in a reverting state to attempt to undo any of their untrusted
writes.

Release note (bug fix): IMPORT and RESTORE jobs are now restored as
reverting so that they cleanup after themselves. Previously, some of the
writes of the jobs while they were running may have been missed by
backup.
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 5 files at r3, 1 of 2 files at r4.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @dt)

@pbardea pbardea changed the title release-20.1: backupccl: cluster restore import and restore jobs as reverting release-20.1: backupccl: cluster restore import and restore jobs as canceled Apr 19, 2021
@pbardea
Copy link
Contributor Author

pbardea commented Apr 19, 2021

Thanks for the review!

@pbardea pbardea merged commit 9c5af7f into cockroachdb:release-20.1 Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants