-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backupccl: issue protected timestamps during on restore spans #91991
Conversation
5474f85
to
5f11e14
Compare
job paused at pausepoint | ||
|
||
# ensure foo has adopted the ttl | ||
sleep ms=5000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I manually conduct the test in demo, the ttl was reconciled to the restored table, so i’m unsure why this job succeeds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random thoughts:
-
You could move the sleep to before the restore and then add a step here which prints out the span configurations, just to make sure the gc.ttl has been reconciled.
-
Does the error require that GC has actually run on the relevant ranges? Perhaps that isn't happening in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also be the pts cache timing: SET CLUSTER SETTING kv.protectedts.poll_interval = '1s'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah just to second steven's comments I think this test will be better suited as a non-DD test that uses some of the utility methods in utils_test.go
. Such as waitForReplicaFieldToBeSet
where you can wait for the replica to have the spanconfig you expect it to have. TestExcludeDataFromBackupDoesNotHoldupGC
is an example where this util is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yeah. that was my initial strategy, but a some of those unfamiliar utils were not quite working on the offline table. I'll take a closer look today and try to modify them to read directly from the system tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking more about this, I’ve realized the original bug is quite challenging to repro in a unit test. If we do indeed trust the PTS system, is it sufficient to merely check that during restore, we properly add a PTS and clean it up after the restore is done?
To directly repro the bug, we need to induce and complete a gc job on a range processing a really slow addstable request. But does this repro really add more coverage? I think we can trust that the PTS system will prevent the gc threshold from advancing on the range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see example of the alternative approach in TestProtectRestoreSpans
pkg/ccl/backupccl/testdata/backup-restore/restore-protect-spans
Outdated
Show resolved
Hide resolved
pkg/ccl/backupccl/restore_job.go
Outdated
@@ -1689,7 +1742,9 @@ func (r *restoreResumer) doResume(ctx context.Context, execCtx interface{}) erro | |||
return err | |||
} | |||
} | |||
|
|||
if err := r.execCfg.ProtectedTimestampManager.Unprotect(ctx, r.job.ID()); err != nil { | |||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to the restore if this fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. Instead we now log a warning, to follow in backup's footsteps. I believe there's some background reconciler that will remove the pts if this thing fails as well.
5f11e14
to
945a07b
Compare
945a07b
to
1cffdca
Compare
d6f4172
to
0b4d567
Compare
latest CI revealed an interesting problem with the jobs protected timestamp manager. going to rework that now. |
8c3471e
to
236e105
Compare
I took a long look at the stress race failure picked up in Bazel Extended CI on my Description of failure:
My investigation:
My proposed next steps:
|
@adityamaru @stevendanna friendly ping on this! |
I'm quite confident the stress-race failures are caused by #92848. When i disable |
} | ||
target = ptpb.MakeSchemaObjectsTarget(tableIDs) | ||
} | ||
protectedTime := hlc.Timestamp{WallTime: timeutil.Now().UnixNano()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be protecting at details.EndTime
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, good q. I don't think it matters. The whole reason we're adding this pts infra is to ensure the gc threshold does not creep past the addsstable batch timestamp. For all flavors of restore, the batchTimestamp will always be greater than a time.Now() set before ingestion or the details.Endtime
.
I've added a comment to reflect this strategy. Lemme know if you disagree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your reasoning that this will work but it is easier to reason about subsystems that work with time if they're all running as of a given snapshot which is what the EndTime
defines for us. See for example #92788 and our existing backup code both of which work on timestamps we have deliberately picked during job record creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, i thought of another problem with details.endtime
: it doesn't get set for non-AOST restores. Would you prefer using endtime for aost restores and time.Now() otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh thats a bummer, let's just leave it as is then. Anything else seems more bother than its worth considering we noop on resumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. thanks for reviewing!
236e105
to
a6cbcd5
Compare
Fixes cockroachdb#91148 Release note: None
a6cbcd5
to
0bfc1b0
Compare
bors r=adityamaru |
Build failed (retrying...): |
This PR was included in a batch that was canceled, it will be automatically retried |
Build succeeded: |
Fixes #91148
Release note: None