New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: ensure restore on success is run once #43933
Conversation
d8c50c5
to
56c0c35
Compare
pkg/ccl/backupccl/restore.go
Outdated
@@ -1579,6 +1579,9 @@ type restoreResumer struct { | |||
tables []*sqlbase.TableDescriptor | |||
latestStats []*stats.TableStatisticProto | |||
execCfg *sql.ExecutorConfig | |||
// A collection of stages that should only be run once. | |||
statsInserted bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't you need to persist these, in the job detail or progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. looks like this was the wrong branch. re-pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I think you might need to actually update the job too (UpdateProgress or UpdatePayload depending on where you put it) to save your change to the in-mem version you're updating here. You could do it in the same txn ideally I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's how IMPORT does something similar (ensuring that all the descs get their state changed): https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/importccl/import_stmt.go#L900
56c0c35
to
9dc99cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pbardea and @spaskob)
pkg/ccl/backupccl/restore.go, line 1787 at r2 (raw file):
return nil } defer func() { details.statsInserted = true }()
I am confused why this is deferred. Should you call it only in case InsertNewStats succeeds? And also where do you actually persist the job details? Don't you need a call to job.update(...)
here's an example of how to use update
https://github.com/cockroachdb/cockroach/blob/master/pkg/jobs/jobs.go#L181
…On Mon, Jan 13, 2020 at 4:49 PM David Taylor ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pkg/ccl/backupccl/restore.go
<#43933 (comment)>
:
> @@ -1579,6 +1579,9 @@ type restoreResumer struct {
tables []*sqlbase.TableDescriptor
latestStats []*stats.TableStatisticProto
execCfg *sql.ExecutorConfig
+ // A collection of stages that should only be run once.
+ statsInserted bool
👍
I think you might need to actually update the job too to save your change
to the in-mem version you're updating here. You could do it in the same txn
ideally I think?
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#43933?email_source=notifications&email_token=AANL634SGZYEICIKO63LTLDQ5TOX7A5CNFSM4KGIG5EKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCRSP4WA#discussion_r366047052>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANL633HMO6SILEBKAUTE6DQ5TOX7ANCNFSM4KGIG5EA>
.
|
dc5ff1b
to
7f928ef
Compare
RFAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pbardea)
pkg/ccl/backupccl/restore.go, line 1786 at r3 (raw file):
} txn := r.execCfg.DB.NewTxn(ctx, "insert-stats")
Is there a reason to use this over the .Txn(..., fn)
that takes the retry-able closure and does the error handling?
It seems that jobs today do not ensure that the OnSuccess callback is called exactly once. This PR moves the cleanup stages of RESTORE, formerly located in the OnSuccess callback to be the final steps of Resume. This should help ensure that these stages are run once and only once. Release note (bug fix): Ensure that RESTORE cleanup is run exactly once.
7f928ef
to
62a2cde
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)
pkg/ccl/backupccl/restore.go, line 1787 at r2 (raw file):
Previously, spaskob (Spas Bojanov) wrote…
I am confused why this is deferred. Should you call it only in case InsertNewStats succeeds? And also where do you actually persist the job details? Don't you need a call to job.update(...)
Yep, as per our offline discussion I wasn't sure if returning an error from resume would ensure that it would not be run again (and therefore continuously retry to re-insert stats). But it should be safe to just return an error and mark this stage as complete in the same txn that performs the insertion. Same applies for publishing tables below.
pkg/ccl/backupccl/restore.go, line 1786 at r3 (raw file):
Previously, dt (David Taylor) wrote…
Is there a reason to use this over the
.Txn(..., fn)
that takes the retry-able closure and does the error handling?
Nope - updated to use the retry-able closure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)
TFTRs! |
Build failed |
|
yes I have seen this test flake
…On Wed, Jan 15, 2020 at 10:15 AM Paul Bardea ***@***.***> wrote:
TestTruncateWhileColumnBackfill failed and TeamCity says it looks flaky.
Trying again.
bors r+
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43933?email_source=notifications&email_token=AANL632NXY3633ER62F5EO3Q54SBFA5CNFSM4KGIG5EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJAU6MQ#issuecomment-574705458>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANL636TRFCMYZD5AGCIWPTQ54SBFANCNFSM4KGIG5EA>
.
|
Build failed |
Third time's the charm? |
Build failed (retrying...) |
43720: coldata: fix behavior of Vec.Append in some cases when NULLs are present r=yuzefovich a=yuzefovich We would always Get and then Set a value while Append'ing without paying attention to whether the value is actually NULL. This can lead to problems in case of flat bytes if the necessary invariant is unmaintained. Now this is fixed by explicitly enforcing the invariant. Additionally, this commit ensures that the destination slice has the desired capacity before appending one value at a time (in case of a present selection vector). I tried approach with paying attention to whether the value is NULL before appending it and saw a significant performance hit, so I think this approach is the least evil. Fixes: #42774. Release note: None 43933: backupccl: ensure restore on success is run once r=pbardea a=pbardea It seems that jobs today do not ensure that the OnSuccess callback is called exactly once. This PR moves the cleanup stages of RESTORE, formerly located in the OnSuccess callback to be the final steps of Resume. This should help ensure that these stages are run once and only once. Release note (bug fix): Ensure that RESTORE cleanup is run exactly once. 44013: roachtest: skip acceptance/version-upgrade because flaky r=andreimatei a=andreimatei Very flaky, apparently because of some problem with a recent migration. Touches #43957, #44005 Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Paul Bardea <pbardea@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
bors r+ |
Already running a review |
Build succeeded |
It seems that jobs today do not ensure that the OnSuccess callback is
called exactly once. This PR moves the cleanup stages of RESTORE,
formerly located in the OnSuccess callback to be the final steps of
Resume. This should help ensure that these stages are run once and only
once.
Release note (bug fix): Ensure that RESTORE cleanup is run exactly once.