backupccl: ensure restore on success is run once #43933

pbardea · 2020-01-13T20:19:50Z

It seems that jobs today do not ensure that the OnSuccess callback is
called exactly once. This PR moves the cleanup stages of RESTORE,
formerly located in the OnSuccess callback to be the final steps of
Resume. This should help ensure that these stages are run once and only
once.

Release note (bug fix): Ensure that RESTORE cleanup is run exactly once.

cockroach-teamcity · 2020-01-13T20:19:57Z

This change is

dt · 2020-01-13T21:45:49Z

pkg/ccl/backupccl/restore.go

@@ -1579,6 +1579,9 @@ type restoreResumer struct {
 	tables      []*sqlbase.TableDescriptor
 	latestStats []*stats.TableStatisticProto
 	execCfg     *sql.ExecutorConfig
+	// A collection of stages that should only be run once.
+	statsInserted   bool


don't you need to persist these, in the job detail or progress?

yep. looks like this was the wrong branch. re-pushed.

👍
I think you might need to actually update the job too (UpdateProgress or UpdatePayload depending on where you put it) to save your change to the in-mem version you're updating here. You could do it in the same txn ideally I think?

Here's how IMPORT does something similar (ensuring that all the descs get their state changed): https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/importccl/import_stmt.go#L900

spaskob

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pbardea and @spaskob)

pkg/ccl/backupccl/restore.go, line 1787 at r2 (raw file):

		return nil
	}
	defer func() { details.statsInserted = true }()

I am confused why this is deferred. Should you call it only in case InsertNewStats succeeds? And also where do you actually persist the job details? Don't you need a call to job.update(...)

spaskob · 2020-01-13T21:53:27Z

here's an example of how to use update https://github.com/cockroachdb/cockroach/blob/master/pkg/jobs/jobs.go#L181

…

On Mon, Jan 13, 2020 at 4:49 PM David Taylor ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pkg/ccl/backupccl/restore.go <#43933 (comment)> : > @@ -1579,6 +1579,9 @@ type restoreResumer struct { tables []*sqlbase.TableDescriptor latestStats []*stats.TableStatisticProto execCfg *sql.ExecutorConfig + // A collection of stages that should only be run once. + statsInserted bool 👍 I think you might need to actually update the job too to save your change to the in-mem version you're updating here. You could do it in the same txn ideally I think? — You are receiving this because your review was requested. Reply to this email directly, view it on GitHub <#43933?email_source=notifications&email_token=AANL634SGZYEICIKO63LTLDQ5TOX7A5CNFSM4KGIG5EKYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCRSP4WA#discussion_r366047052>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANL633HMO6SILEBKAUTE6DQ5TOX7ANCNFSM4KGIG5EA> .

pbardea · 2020-01-14T16:36:13Z

RFAL

dt

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @pbardea)

pkg/ccl/backupccl/restore.go, line 1786 at r3 (raw file):

	}

	txn := r.execCfg.DB.NewTxn(ctx, "insert-stats")

Is there a reason to use this over the .Txn(..., fn) that takes the retry-able closure and does the error handling?

It seems that jobs today do not ensure that the OnSuccess callback is called exactly once. This PR moves the cleanup stages of RESTORE, formerly located in the OnSuccess callback to be the final steps of Resume. This should help ensure that these stages are run once and only once. Release note (bug fix): Ensure that RESTORE cleanup is run exactly once.

pbardea

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)

pkg/ccl/backupccl/restore.go, line 1787 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

I am confused why this is deferred. Should you call it only in case InsertNewStats succeeds? And also where do you actually persist the job details? Don't you need a call to job.update(...)

Yep, as per our offline discussion I wasn't sure if returning an error from resume would ensure that it would not be run again (and therefore continuously retry to re-insert stats). But it should be safe to just return an error and mark this stage as complete in the same txn that performs the insertion. Same applies for publishing tables below.

pkg/ccl/backupccl/restore.go, line 1786 at r3 (raw file):

Previously, dt (David Taylor) wrote…

Is there a reason to use this over the .Txn(..., fn) that takes the retry-able closure and does the error handling?

Nope - updated to use the retry-able closure.

dt

Reviewed 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)

spaskob

LGTM

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)

pbardea · 2020-01-15T15:01:46Z

TFTRs!
bors r+

craig · 2020-01-15T15:13:53Z

Build failed

GitHub CI (Cockroach)

pbardea · 2020-01-15T15:15:29Z

TestTruncateWhileColumnBackfill failed and TeamCity says it looks flaky. Trying again.
bors r+

spaskob · 2020-01-15T15:16:08Z

yes I have seen this test flake

…

On Wed, Jan 15, 2020 at 10:15 AM Paul Bardea ***@***.***> wrote: TestTruncateWhileColumnBackfill failed and TeamCity says it looks flaky. Trying again. bors r+ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#43933?email_source=notifications&email_token=AANL632NXY3633ER62F5EO3Q54SBFA5CNFSM4KGIG5EKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJAU6MQ#issuecomment-574705458>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANL636TRFCMYZD5AGCIWPTQ54SBFANCNFSM4KGIG5EA> .

craig · 2020-01-15T15:35:52Z

Build failed

GitHub CI (Cockroach)

pbardea · 2020-01-15T16:00:04Z

Third time's the charm?
bors r+

craig · 2020-01-15T16:44:50Z

Build failed (retrying...)

GitHub CI (Cockroach)

43720: coldata: fix behavior of Vec.Append in some cases when NULLs are present r=yuzefovich a=yuzefovich We would always Get and then Set a value while Append'ing without paying attention to whether the value is actually NULL. This can lead to problems in case of flat bytes if the necessary invariant is unmaintained. Now this is fixed by explicitly enforcing the invariant. Additionally, this commit ensures that the destination slice has the desired capacity before appending one value at a time (in case of a present selection vector). I tried approach with paying attention to whether the value is NULL before appending it and saw a significant performance hit, so I think this approach is the least evil. Fixes: #42774. Release note: None 43933: backupccl: ensure restore on success is run once r=pbardea a=pbardea It seems that jobs today do not ensure that the OnSuccess callback is called exactly once. This PR moves the cleanup stages of RESTORE, formerly located in the OnSuccess callback to be the final steps of Resume. This should help ensure that these stages are run once and only once. Release note (bug fix): Ensure that RESTORE cleanup is run exactly once. 44013: roachtest: skip acceptance/version-upgrade because flaky r=andreimatei a=andreimatei Very flaky, apparently because of some problem with a recent migration. Touches #43957, #44005 Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Paul Bardea <pbardea@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

pbardea · 2020-01-15T17:49:02Z

bors r+

craig · 2020-01-15T17:49:04Z

Already running a review

craig · 2020-01-15T17:57:03Z

Build succeeded

GitHub CI (Cockroach)

pbardea force-pushed the restore-on-success branch from d8c50c5 to 56c0c35 Compare January 13, 2020 20:25

pbardea marked this pull request as ready for review January 13, 2020 21:44

pbardea requested review from dt and spaskob January 13, 2020 21:44

dt reviewed Jan 13, 2020

View reviewed changes

pbardea force-pushed the restore-on-success branch from 56c0c35 to 9dc99cf Compare January 13, 2020 21:47

spaskob suggested changes Jan 13, 2020

View reviewed changes

pbardea force-pushed the restore-on-success branch 5 times, most recently from dc5ff1b to 7f928ef Compare January 14, 2020 16:36

dt reviewed Jan 14, 2020

View reviewed changes

pbardea force-pushed the restore-on-success branch from 7f928ef to 62a2cde Compare January 14, 2020 18:55

pbardea commented Jan 14, 2020

View reviewed changes

dt approved these changes Jan 14, 2020

View reviewed changes

spaskob approved these changes Jan 15, 2020

View reviewed changes

craig bot merged commit 62a2cde into cockroachdb:master Jan 15, 2020

pbardea deleted the restore-on-success branch April 27, 2020 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: ensure restore on success is run once #43933

backupccl: ensure restore on success is run once #43933

pbardea commented Jan 13, 2020

cockroach-teamcity commented Jan 13, 2020

dt Jan 13, 2020

pbardea Jan 13, 2020 •

edited

dt Jan 13, 2020 •

edited

dt Jan 13, 2020

spaskob left a comment

spaskob commented Jan 13, 2020 via email

pbardea commented Jan 14, 2020

dt left a comment

pbardea left a comment

dt left a comment

spaskob left a comment

pbardea commented Jan 15, 2020

craig bot commented Jan 15, 2020

pbardea commented Jan 15, 2020

spaskob commented Jan 15, 2020 via email

craig bot commented Jan 15, 2020

pbardea commented Jan 15, 2020

craig bot commented Jan 15, 2020

pbardea commented Jan 15, 2020

craig bot commented Jan 15, 2020

craig bot commented Jan 15, 2020

backupccl: ensure restore on success is run once #43933

backupccl: ensure restore on success is run once #43933

Conversation

pbardea commented Jan 13, 2020

cockroach-teamcity commented Jan 13, 2020

dt Jan 13, 2020

Choose a reason for hiding this comment

pbardea Jan 13, 2020 • edited

Choose a reason for hiding this comment

dt Jan 13, 2020 • edited

Choose a reason for hiding this comment

dt Jan 13, 2020

Choose a reason for hiding this comment

spaskob left a comment

Choose a reason for hiding this comment

spaskob commented Jan 13, 2020 via email

pbardea commented Jan 14, 2020

dt left a comment

Choose a reason for hiding this comment

pbardea left a comment

Choose a reason for hiding this comment

dt left a comment

Choose a reason for hiding this comment

spaskob left a comment

Choose a reason for hiding this comment

pbardea commented Jan 15, 2020

craig bot commented Jan 15, 2020

Build failed

pbardea commented Jan 15, 2020

spaskob commented Jan 15, 2020 via email

craig bot commented Jan 15, 2020

Build failed

pbardea commented Jan 15, 2020

craig bot commented Jan 15, 2020

Build failed (retrying...)

pbardea commented Jan 15, 2020

craig bot commented Jan 15, 2020

craig bot commented Jan 15, 2020

Build succeeded

pbardea Jan 13, 2020 •

edited

dt Jan 13, 2020 •

edited