Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: ensure restore on success is run once #43933

Merged
merged 1 commit into from Jan 15, 2020

Conversation

pbardea
Copy link
Contributor

@pbardea pbardea commented Jan 13, 2020

It seems that jobs today do not ensure that the OnSuccess callback is
called exactly once. This PR moves the cleanup stages of RESTORE,
formerly located in the OnSuccess callback to be the final steps of
Resume. This should help ensure that these stages are run once and only
once.

Release note (bug fix): Ensure that RESTORE cleanup is run exactly once.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pbardea pbardea marked this pull request as ready for review January 13, 2020 21:44
@pbardea pbardea requested review from dt and spaskob January 13, 2020 21:44
@@ -1579,6 +1579,9 @@ type restoreResumer struct {
tables []*sqlbase.TableDescriptor
latestStats []*stats.TableStatisticProto
execCfg *sql.ExecutorConfig
// A collection of stages that should only be run once.
statsInserted bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't you need to persist these, in the job detail or progress?

Copy link
Contributor Author

@pbardea pbardea Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. looks like this was the wrong branch. re-pushed.

Copy link
Member

@dt dt Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
I think you might need to actually update the job too (UpdateProgress or UpdatePayload depending on where you put it) to save your change to the in-mem version you're updating here. You could do it in the same txn ideally I think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's how IMPORT does something similar (ensuring that all the descs get their state changed): https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/importccl/import_stmt.go#L900

Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pbardea and @spaskob)


pkg/ccl/backupccl/restore.go, line 1787 at r2 (raw file):

		return nil
	}
	defer func() { details.statsInserted = true }()

I am confused why this is deferred. Should you call it only in case InsertNewStats succeeds? And also where do you actually persist the job details? Don't you need a call to job.update(...)

@spaskob
Copy link
Contributor

spaskob commented Jan 13, 2020 via email

@pbardea pbardea force-pushed the restore-on-success branch 5 times, most recently from dc5ff1b to 7f928ef Compare January 14, 2020 16:36
@pbardea
Copy link
Contributor Author

pbardea commented Jan 14, 2020

RFAL

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pbardea)


pkg/ccl/backupccl/restore.go, line 1786 at r3 (raw file):

	}

	txn := r.execCfg.DB.NewTxn(ctx, "insert-stats")

Is there a reason to use this over the .Txn(..., fn) that takes the retry-able closure and does the error handling?

It seems that jobs today do not ensure that the OnSuccess callback is
called exactly once. This PR moves the cleanup stages of RESTORE,
formerly located in the OnSuccess callback to be the final steps of
Resume. This should help ensure that these stages are run once and only
once.

Release note (bug fix): Ensure that RESTORE cleanup is run exactly once.
Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)


pkg/ccl/backupccl/restore.go, line 1787 at r2 (raw file):

Previously, spaskob (Spas Bojanov) wrote…

I am confused why this is deferred. Should you call it only in case InsertNewStats succeeds? And also where do you actually persist the job details? Don't you need a call to job.update(...)

Yep, as per our offline discussion I wasn't sure if returning an error from resume would ensure that it would not be run again (and therefore continuously retry to re-insert stats). But it should be safe to just return an error and mark this stage as complete in the same txn that performs the insertion. Same applies for publishing tables below.


pkg/ccl/backupccl/restore.go, line 1786 at r3 (raw file):

Previously, dt (David Taylor) wrote…

Is there a reason to use this over the .Txn(..., fn) that takes the retry-able closure and does the error handling?

Nope - updated to use the retry-able closure.

Copy link
Member

@dt dt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)

Copy link
Contributor

@spaskob spaskob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @spaskob)

@pbardea
Copy link
Contributor Author

pbardea commented Jan 15, 2020

TFTRs!
bors r+

@craig
Copy link
Contributor

craig bot commented Jan 15, 2020

Build failed

@pbardea
Copy link
Contributor Author

pbardea commented Jan 15, 2020

TestTruncateWhileColumnBackfill failed and TeamCity says it looks flaky. Trying again.
bors r+

@spaskob
Copy link
Contributor

spaskob commented Jan 15, 2020 via email

@craig
Copy link
Contributor

craig bot commented Jan 15, 2020

Build failed

@pbardea
Copy link
Contributor Author

pbardea commented Jan 15, 2020

Third time's the charm?
bors r+

@craig
Copy link
Contributor

craig bot commented Jan 15, 2020

Build failed (retrying...)

craig bot pushed a commit that referenced this pull request Jan 15, 2020
43720: coldata: fix behavior of Vec.Append in some cases when NULLs are present r=yuzefovich a=yuzefovich

We would always Get and then Set a value while Append'ing without paying
attention to whether the value is actually NULL. This can lead to
problems in case of flat bytes if the necessary invariant is
unmaintained. Now this is fixed by explicitly enforcing the invariant.
Additionally, this commit ensures that the destination slice has the
desired capacity before appending one value at a time (in case of
a present selection vector).

I tried approach with paying attention to whether the value is NULL
before appending it and saw a significant performance hit, so I think
this approach is the least evil.

Fixes: #42774.

Release note: None

43933: backupccl: ensure restore on success is run once r=pbardea a=pbardea

It seems that jobs today do not ensure that the OnSuccess callback is
called exactly once. This PR moves the cleanup stages of RESTORE,
formerly located in the OnSuccess callback to be the final steps of
Resume. This should help ensure that these stages are run once and only
once.

Release note (bug fix): Ensure that RESTORE cleanup is run exactly once.

44013: roachtest: skip acceptance/version-upgrade because flaky r=andreimatei a=andreimatei

Very flaky, apparently because of some problem with a recent migration.
Touches #43957, #44005

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Paul Bardea <pbardea@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
@pbardea
Copy link
Contributor Author

pbardea commented Jan 15, 2020

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 15, 2020

Already running a review

@craig
Copy link
Contributor

craig bot commented Jan 15, 2020

Build succeeded

@craig craig bot merged commit 62a2cde into cockroachdb:master Jan 15, 2020
@pbardea pbardea deleted the restore-on-success branch April 27, 2020 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants