Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: issue protected timestamps during on restore spans #91991

Merged
merged 1 commit into from
Dec 6, 2022

Conversation

msbutler
Copy link
Collaborator

Fixes #91148

Release note: None

@msbutler msbutler self-assigned this Nov 16, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

job paused at pausepoint

# ensure foo has adopted the ttl
sleep ms=5000
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I manually conduct the test in demo, the ttl was reconciled to the restored table, so i’m unsure why this job succeeds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thoughts:

  • You could move the sleep to before the restore and then add a step here which prints out the span configurations, just to make sure the gc.ttl has been reconciled.

  • Does the error require that GC has actually run on the relevant ranges? Perhaps that isn't happening in this test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could also be the pts cache timing: SET CLUSTER SETTING kv.protectedts.poll_interval = '1s'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah just to second steven's comments I think this test will be better suited as a non-DD test that uses some of the utility methods in utils_test.go. Such as waitForReplicaFieldToBeSet where you can wait for the replica to have the spanconfig you expect it to have. TestExcludeDataFromBackupDoesNotHoldupGC is an example where this util is used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yeah. that was my initial strategy, but a some of those unfamiliar utils were not quite working on the offline table. I'll take a closer look today and try to modify them to read directly from the system tables.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking more about this, I’ve realized the original bug is quite challenging to repro in a unit test. If we do indeed trust the PTS system, is it sufficient to merely check that during restore, we properly add a PTS and clean it up after the restore is done?



To directly repro the bug, we need to induce and complete a gc job on a range processing a really slow addstable request. But does this repro really add more coverage? I think we can trust that the PTS system will prevent the gc threshold from advancing on the range.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see example of the alternative approach in TestProtectRestoreSpans

@@ -1689,7 +1742,9 @@ func (r *restoreResumer) doResume(ctx context.Context, execCtx interface{}) erro
return err
}
}

if err := r.execCfg.ProtectedTimestampManager.Unprotect(ctx, r.job.ID()); err != nil {
return err
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the restore if this fails?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. Instead we now log a warning, to follow in backup's footsteps. I believe there's some background reconciler that will remove the pts if this thing fails as well.

@msbutler msbutler marked this pull request as ready for review November 20, 2022 22:54
@msbutler msbutler requested review from a team as code owners November 20, 2022 22:54
@msbutler msbutler requested a review from a team November 20, 2022 22:54
@msbutler msbutler force-pushed the butler-protect-restore branch 2 times, most recently from d6f4172 to 0b4d567 Compare November 21, 2022 12:58
@msbutler
Copy link
Collaborator Author

latest CI revealed an interesting problem with the jobs protected timestamp manager. going to rework that now.

@msbutler msbutler force-pushed the butler-protect-restore branch 4 times, most recently from 8c3471e to 236e105 Compare November 30, 2022 16:34
@msbutler
Copy link
Collaborator Author

msbutler commented Nov 30, 2022

I took a long look at the stress race failure picked up in Bazel Extended CI on my TestProtectRestoreSpans test, and while I can't pinpoint a root cause, I know the failure has nothing to do with protected timestamps.

Description of failure:

  • when we restore system tables in the cluster restore subtest, roleIDSeqRestoreFunc() panics bc no ids exist in the system.users table. This implies that we either didn't backup the users system table (bad) or didn't restore it properly (also bad). This race doesn't happen immediately -- usually takes 24 runs to observe it on my gceworker. UPDATE: I've confirmed this occurs because we fail to back up the user table properly.

My investigation:

  • to repro the failure, I first had to disable tenants, as there's a different failure related to server startup. I plan to file a different issue.
  • after removing the PTS logic from restore_job.go and my test, I could reproduce the failure, implying this failure is independent of my patch.
  • Update: I can repro this panic on a ./dev stress of TestDataDriven/restore-schema-only-multiregion.
  • if I skip the cluster restore subtest, the test passes with flying colors
  • if i only run the cluster restore subtest, a different stress failure (described in panic: default zone config missing unexpectedly [now has a reliable repro] #43951) bubbles up, though with less frequency, so I'm unsure if they're related.

My proposed next steps:

  • disable stress race on this test and file separate issues for: the tenant server startup race and this mysterious oid race.

@msbutler
Copy link
Collaborator Author

msbutler commented Dec 1, 2022

@adityamaru @stevendanna friendly ping on this!

@msbutler
Copy link
Collaborator Author

msbutler commented Dec 2, 2022

I'm quite confident the stress-race failures are caused by #92848. When i disable admission.elastic_cpu.enabled, stress race does not seem to catch any problems.

pkg/ccl/backupccl/restore_job.go Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
pkg/ccl/backupccl/restore_job.go Outdated Show resolved Hide resolved
}
target = ptpb.MakeSchemaObjectsTarget(tableIDs)
}
protectedTime := hlc.Timestamp{WallTime: timeutil.Now().UnixNano()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be protecting at details.EndTime right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, good q. I don't think it matters. The whole reason we're adding this pts infra is to ensure the gc threshold does not creep past the addsstable batch timestamp. For all flavors of restore, the batchTimestamp will always be greater than a time.Now() set before ingestion or the details.Endtime.

I've added a comment to reflect this strategy. Lemme know if you disagree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your reasoning that this will work but it is easier to reason about subsystems that work with time if they're all running as of a given snapshot which is what the EndTime defines for us. See for example #92788 and our existing backup code both of which work on timestamps we have deliberately picked during job record creation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, i thought of another problem with details.endtime: it doesn't get set for non-AOST restores. Would you prefer using endtime for aost restores and time.Now() otherwise?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh thats a bummer, let's just leave it as is then. Anything else seems more bother than its worth considering we noop on resumptions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. thanks for reviewing!

@msbutler
Copy link
Collaborator Author

msbutler commented Dec 5, 2022

bors r=adityamaru

@craig
Copy link
Contributor

craig bot commented Dec 5, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Dec 5, 2022

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Contributor

craig bot commented Dec 6, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

backupccl: issue protected timestamp for restore spans
4 participants