backupccl: issue protected timestamps during on restore spans #91991

msbutler · 2022-11-16T16:08:12Z

Release note: None

cockroach-teamcity · 2022-11-16T16:08:22Z

This change is

msbutler · 2022-11-16T18:38:14Z

pkg/ccl/backupccl/testdata/backup-restore/restore-protect-spans

+job paused at pausepoint
+
+# ensure foo has adopted the ttl
+sleep ms=5000


When I manually conduct the test in demo, the ttl was reconciled to the restored table, so i’m unsure why this job succeeds.

Random thoughts:

You could move the sleep to before the restore and then add a step here which prints out the span configurations, just to make sure the gc.ttl has been reconciled.

Does the error require that GC has actually run on the relevant ranges? Perhaps that isn't happening in this test?

It could also be the pts cache timing: SET CLUSTER SETTING kv.protectedts.poll_interval = '1s'

Yeah just to second steven's comments I think this test will be better suited as a non-DD test that uses some of the utility methods in utils_test.go. Such as waitForReplicaFieldToBeSet where you can wait for the replica to have the spanconfig you expect it to have. TestExcludeDataFromBackupDoesNotHoldupGC is an example where this util is used.

ah, yeah. that was my initial strategy, but a some of those unfamiliar utils were not quite working on the offline table. I'll take a closer look today and try to modify them to read directly from the system tables.

After thinking more about this, I’ve realized the original bug is quite challenging to repro in a unit test. If we do indeed trust the PTS system, is it sufficient to merely check that during restore, we properly add a PTS and clean it up after the restore is done?

  To directly repro the bug, we need to induce and complete a gc job on a range processing a really slow addstable request. But does this repro really add more coverage? I think we can trust that the PTS system will prevent the gc threshold from advancing on the range.

see example of the alternative approach in TestProtectRestoreSpans

pkg/ccl/backupccl/testdata/backup-restore/restore-protect-spans

stevendanna · 2022-11-16T22:57:54Z

pkg/ccl/backupccl/restore_job.go

@@ -1689,7 +1742,9 @@ func (r *restoreResumer) doResume(ctx context.Context, execCtx interface{}) erro
 			return err
 		}
 	}
-
+	if err := r.execCfg.ProtectedTimestampManager.Unprotect(ctx, r.job.ID()); err != nil {
+		return err


What happens to the restore if this fails?

good point. Instead we now log a warning, to follow in backup's footsteps. I believe there's some background reconciler that will remove the pts if this thing fails as well.

pkg/ccl/backupccl/backup_test.go

msbutler · 2022-11-21T20:44:17Z

latest CI revealed an interesting problem with the jobs protected timestamp manager. going to rework that now.

msbutler · 2022-11-30T16:35:16Z

I took a long look at the stress race failure picked up in Bazel Extended CI on my TestProtectRestoreSpans test, and while I can't pinpoint a root cause, I know the failure has nothing to do with protected timestamps.

Description of failure:

when we restore system tables in the cluster restore subtest, roleIDSeqRestoreFunc() panics bc no ids exist in the system.users table. This implies that we either didn't backup the users system table (bad) or didn't restore it properly (also bad). This race doesn't happen immediately -- usually takes 24 runs to observe it on my gceworker. UPDATE: I've confirmed this occurs because we fail to back up the user table properly.

My investigation:

to repro the failure, I first had to disable tenants, as there's a different failure related to server startup. I plan to file a different issue.
after removing the PTS logic from restore_job.go and my test, I could reproduce the failure, implying this failure is independent of my patch.
Update: I can repro this panic on a ./dev stress of TestDataDriven/restore-schema-only-multiregion.
if I skip the cluster restore subtest, the test passes with flying colors
if i only run the cluster restore subtest, a different stress failure (described in panic: default zone config missing unexpectedly [now has a reliable repro] #43951) bubbles up, though with less frequency, so I'm unsure if they're related.

My proposed next steps:

disable stress race on this test and file separate issues for: the tenant server startup race and this mysterious oid race.

msbutler · 2022-12-01T21:40:37Z

@adityamaru @stevendanna friendly ping on this!

msbutler · 2022-12-02T16:19:25Z

I'm quite confident the stress-race failures are caused by #92848. When i disable admission.elastic_cpu.enabled, stress race does not seem to catch any problems.

pkg/ccl/backupccl/restore_job.go

adityamaru · 2022-12-05T16:05:12Z

pkg/ccl/backupccl/restore_job.go

+		}
+		target = ptpb.MakeSchemaObjectsTarget(tableIDs)
+	}
+	protectedTime := hlc.Timestamp{WallTime: timeutil.Now().UnixNano()}


We should be protecting at details.EndTime right?

hm, good q. I don't think it matters. The whole reason we're adding this pts infra is to ensure the gc threshold does not creep past the addsstable batch timestamp. For all flavors of restore, the batchTimestamp will always be greater than a time.Now() set before ingestion or the details.Endtime.

I've added a comment to reflect this strategy. Lemme know if you disagree.

I agree with your reasoning that this will work but it is easier to reason about subsystems that work with time if they're all running as of a given snapshot which is what the EndTime defines for us. See for example #92788 and our existing backup code both of which work on timestamps we have deliberately picked during job record creation.

hm, i thought of another problem with details.endtime: it doesn't get set for non-AOST restores. Would you prefer using endtime for aost restores and time.Now() otherwise?

oh thats a bummer, let's just leave it as is then. Anything else seems more bother than its worth considering we noop on resumptions.

sounds good. thanks for reviewing!

Fixes cockroachdb#91148 Release note: None

msbutler · 2022-12-05T19:50:49Z

bors r=adityamaru

craig · 2022-12-05T20:45:29Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-12-05T21:37:27Z

This PR was included in a batch that was canceled, it will be automatically retried

craig · 2022-12-06T00:11:36Z

Build succeeded:

Bazel Essential CI (Cockroach)

msbutler self-assigned this Nov 16, 2022

msbutler force-pushed the butler-protect-restore branch from 5474f85 to 5f11e14 Compare November 16, 2022 18:32

msbutler commented Nov 16, 2022

View reviewed changes

pkg/ccl/backupccl/testdata/backup-restore/restore-protect-spans Outdated Show resolved Hide resolved

stevendanna reviewed Nov 16, 2022

View reviewed changes

adityamaru self-requested a review November 17, 2022 14:38

msbutler force-pushed the butler-protect-restore branch from 5f11e14 to 945a07b Compare November 19, 2022 02:41

msbutler commented Nov 19, 2022

View reviewed changes

pkg/ccl/backupccl/backup_test.go Outdated Show resolved Hide resolved

msbutler force-pushed the butler-protect-restore branch from 945a07b to 1cffdca Compare November 20, 2022 22:53

msbutler marked this pull request as ready for review November 20, 2022 22:54

msbutler requested review from a team as code owners November 20, 2022 22:54

msbutler requested a review from a team November 20, 2022 22:54

msbutler force-pushed the butler-protect-restore branch 2 times, most recently from d6f4172 to 0b4d567 Compare November 21, 2022 12:58

msbutler force-pushed the butler-protect-restore branch 4 times, most recently from 8c3471e to 236e105 Compare November 30, 2022 16:34

msbutler mentioned this pull request Nov 30, 2022

ccl/backupccl: TestDataDriven failed #92646

Closed

adityamaru suggested changes Dec 5, 2022

View reviewed changes

adityamaru self-requested a review December 5, 2022 16:06

msbutler force-pushed the butler-protect-restore branch from 236e105 to a6cbcd5 Compare December 5, 2022 18:12

backupccl: protect restore spans

0bfc1b0

Fixes cockroachdb#91148 Release note: None

msbutler force-pushed the butler-protect-restore branch from a6cbcd5 to 0bfc1b0 Compare December 5, 2022 18:52

adityamaru approved these changes Dec 5, 2022

View reviewed changes

craig bot merged commit bba3e52 into cockroachdb:master Dec 6, 2022

msbutler mentioned this pull request Dec 12, 2022

backupccl: flake in TestClusterRestoreFailCleanup/retry-during-custom-system-table-restore/settings #93460

Closed

msbutler deleted the butler-protect-restore branch January 24, 2023 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: issue protected timestamps during on restore spans #91991

backupccl: issue protected timestamps during on restore spans #91991

msbutler commented Nov 16, 2022

cockroach-teamcity commented Nov 16, 2022

msbutler Nov 16, 2022

stevendanna Nov 16, 2022

stevendanna Nov 17, 2022

adityamaru Nov 17, 2022

msbutler Nov 17, 2022

msbutler Nov 18, 2022

msbutler Nov 19, 2022

stevendanna Nov 16, 2022

msbutler Nov 20, 2022

msbutler commented Nov 21, 2022

msbutler commented Nov 30, 2022 •

edited

msbutler commented Dec 1, 2022

msbutler commented Dec 2, 2022

adityamaru Dec 5, 2022

msbutler Dec 5, 2022

adityamaru Dec 5, 2022

msbutler Dec 5, 2022

adityamaru Dec 5, 2022

msbutler Dec 5, 2022

msbutler commented Dec 5, 2022

craig bot commented Dec 5, 2022

craig bot commented Dec 5, 2022

craig bot commented Dec 6, 2022

backupccl: issue protected timestamps during on restore spans #91991

backupccl: issue protected timestamps during on restore spans #91991

Conversation

msbutler commented Nov 16, 2022

cockroach-teamcity commented Nov 16, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msbutler commented Nov 21, 2022

msbutler commented Nov 30, 2022 • edited

msbutler commented Dec 1, 2022

msbutler commented Dec 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msbutler commented Dec 5, 2022

craig bot commented Dec 5, 2022

craig bot commented Dec 5, 2022

craig bot commented Dec 6, 2022

msbutler commented Nov 30, 2022 •

edited