Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: fix bug where restore would try to start rolled-back jobs #56021

Merged
merged 1 commit into from
Nov 4, 2020

Conversation

pbardea
Copy link
Contributor

@pbardea pbardea commented Oct 27, 2020

Previously, if the transaction which publishes the tables ingested
during a restore is restated, we accumulate jobs which no longer exist
in a slice to start. The work function run inside the transaction should
be idempotent and only expose jobs that have actually been created.

Release note (bug fix): Fix an error that may occur at the end of a
restoration of a backup that had ongoing schema change jobs.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pbardea
Copy link
Contributor Author

pbardea commented Oct 28, 2020

This diff was tested by applying the following patch:

diff --git a/pkg/ccl/backupccl/restore_job.go b/pkg/ccl/backupccl/restore_job.go
index 76bf0f2bd2..12afb76897 100644
--- a/pkg/ccl/backupccl/restore_job.go
+++ b/pkg/ccl/backupccl/restore_job.go
@@ -13,6 +13,7 @@ import (
        "context"
        "fmt"
        "math"
+       "math/rand"
        "runtime"
        "sync/atomic"
        "time"
@@ -1133,6 +1134,11 @@ func (r *restoreResumer) publishTables(ctx context.Context) error {
                        return errors.Wrap(err, "updating job details after publishing tables")
                }

+               if rand.Float64() < 0.9 {
+                       return txn.GenerateForcedRetryableError(ctx, "boom")
+               }
+
                return nil
        })
        if err != nil {

and stressing TestRestoreMidSchemaChange before and after the fix (which previously failed instantly). I would like to introduce a more general way to inject txn retries during testing, possibly something like #56074, but for the sake of backporting I think that should be addressed separately.

@pbardea pbardea marked this pull request as ready for review October 28, 2020 15:36
@pbardea
Copy link
Contributor Author

pbardea commented Nov 2, 2020

TFTR!
bors r=miretskiy

@craig
Copy link
Contributor

craig bot commented Nov 2, 2020

Build failed:

Previously, if the transaction which publishes the tables ingested
during a restore is restated, we accumulate jobs which no longer exist
in a slice to start. The work function run inside the transaction should
be idempotent and only expose jobs that have actually been created.

Release note (bug fix): Fix an error that may occur at the end of a
restoration of a backup that had ongoing schema change jobs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants