-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-20.1: backupccl: cluster restore import and restore jobs as canceled #63773
Conversation
b8fa0ed
to
6158bda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @pbardea)
pkg/ccl/backupccl/restore_job.go, line 1358 at r1 (raw file):
} if err := db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
in 21.1 you've got a strong claim that no other instance is concurrently running. In 20.1 it's relatively best-effort. It's not likely but it is very possible. It would not hurt to re-read the details in this txn.
pkg/ccl/backupccl/restore_job.go, line 1459 at r2 (raw file):
} var updateStatusQuery strings.Builder
nit: I'd pull out the query string construction into a helper function or closure.
pkg/ccl/backupccl/restore_job.go, line 1470 at r2 (raw file):
fmt.Fprint(&updateStatusQuery, ")") fmt.Println(updateStatusQuery.String())
detritus?
pkg/ccl/backupccl/restore_test.go, line 13 at r1 (raw file):
import ( "context" fmt "fmt"
🤔
pkg/jobs/jobspb/jobs.proto, line 99 at r1 (raw file):
// SystemTablesRestored keeps track of dynamic states that need to happen only // once during the lifetime of a job. Note, that this state may be shared // between job versions, so updates to this map must be considered carefully.
what do you mean "job versions"?
Also, why is this in details and not progress? I imagine that it's because that ship has sailed but :(
pkg/jobs/jobspb/jobs.proto, line 102 at r1 (raw file):
// It maps system table names to whether or not they have already been // restored. map<string, bool> system_tables_restored = 17;
This has been backported to every branch after this with the same definition, right?
Previously, if a cluster restore were to restart (e.g. in the case of the coordinator crashing) after restoring the system tables, but before completing, the job would attempt to restore the system table data again and potentially fail on non-idempotent system table restoration functions (e.g. restoring jobs). This commit, adds a new SystemTablesRestored field to the restore details to keep track of each system table that it has restored and maintains progress. This is required, since not all system tables may need to be restored atomically in the future. Release note (bug fix): Fix a bug where cluster restore would sometimes (very rarely) fail after retrying.
6158bda
to
50585b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @dt)
pkg/ccl/backupccl/restore_job.go, line 1358 at r1 (raw file):
Previously, ajwerner wrote…
in 21.1 you've got a strong claim that no other instance is concurrently running. In 20.1 it's relatively best-effort. It's not likely but it is very possible. It would not hurt to re-read the details in this txn.
Moved the reading of the details to be contained within the txn.
pkg/ccl/backupccl/restore_job.go, line 1459 at r2 (raw file):
Previously, ajwerner wrote…
nit: I'd pull out the query string construction into a helper function or closure.
Done.
pkg/ccl/backupccl/restore_job.go, line 1470 at r2 (raw file):
Previously, ajwerner wrote…
detritus?
Yep, thanks for catching. Removed.
pkg/ccl/backupccl/restore_test.go, line 13 at r1 (raw file):
Previously, ajwerner wrote…
🤔
Something seems to be automatically adding these. Perhaps it's something up with my VSCode setup. Looks like we have quite a few violations throughout the codebase.
Perhaps there's a way to lint against these redundant import aliases.
pkg/jobs/jobspb/jobs.proto, line 99 at r1 (raw file):
Previously, ajwerner wrote…
what do you mean "job versions"?
Also, why is this in details and not progress? I imagine that it's because that ship has sailed but :(
I think I originally meant "job implementation versions", changed to "cluster versions" for more clarity.
Ya, it is unfortunate. We already have a few of these state-tracking progress metrics in details and didn't break the momentum here. Will try to keep in mind moving forward though.
pkg/jobs/jobspb/jobs.proto, line 102 at r1 (raw file):
Previously, ajwerner wrote…
This has been backported to every branch after this with the same definition, right?
Yep:
20.2 -
cockroach/pkg/jobs/jobspb/jobs.proto
Line 169 in 0f4525d
map<string, bool> system_tables_restored = 17; |
21.1 -
cockroach/pkg/jobs/jobspb/jobs.proto
Line 192 in 0e1e28e
map<string, bool> system_tables_restored = 17; |
50585b7
to
93c6eff
Compare
IMPORT and RESTORE may write non-transactionally, so their writes cannot be trusted to be included in every backup. As such, they should be restored in a reverting state to attempt to undo any of their untrusted writes. Release note (bug fix): IMPORT and RESTORE jobs are now restored as reverting so that they cleanup after themselves. Previously, some of the writes of the jobs while they were running may have been missed by backup.
93c6eff
to
8f06e69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 5 files at r3, 1 of 2 files at r4.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @dt)
Thanks for the review! |
First commit is a backport of #60458.
IMPORT and RESTORE may write non-transactionally, so their writes cannot
be trusted to be included in every backup. As such, they should be
restored in a reverting state to attempt to undo any of their untrusted
writes.
Release note (bug fix): IMPORT and RESTORE jobs are now restored as
reverting so that they cleanup after themselves. Previously, some of the
writes of the jobs while they were running may have been missed by
backup.