sql/gcjob: make index GC robust to descriptors being deleted #86696

ajwerner · 2022-08-23T19:39:41Z

First commit is #86690

If the descriptor was deleted, the GC job should exit gracefully.

Release justification: bug fix for backport

Release note (bug fix): In some scenarios, when a DROP INDEX was
run around the same time as a DROP TABLE or DROP DATABASE covering the same
data, the DROP INDEX gc job could get caught retrying indefinitely. This
has been fixed.

cockroach-teamcity · 2022-08-23T19:39:49Z

This change is

Xiang-Gu · 2022-08-29T15:19:13Z

pkg/sql/gcjob/index_garbage_collection.go

 	// Before deleting any indexes, ensure that old versions of the table descriptor
 	// are no longer in use. This is necessary in the case of truncate, where we
 	// schedule a GC Job in the transaction that commits the truncation.
 	parentDesc, err := sql.WaitToUpdateLeases(ctx, execCfg.LeaseManager, parentID)
+	if maybeHandleDeletedDescriptor(err) {


I feel like a eaiser to understand implementation is

if errors.Is(err, catalog.ErrDescriptorNotFound) { // If the descriptor has been removed, then we need to assume that the relevant // zone configs and data have been cleaned up by another process. handleDeletedDescriptor() return }

and accordingly, change the other function to

handleDeletedDescriptor := func() { log.Infof(ctx, "descriptor %d dropped, assuming another process has handled GC", parentID) for _, index := range droppedIndexes { markIndexGCed( ctx, index.IndexID, progress, jobspb.SchemaChangeGCProgress_CLEARED, ) } return true }

But I don't really have a strong opinion on this; either way is perfectly fine with me.

Xiang-Gu · 2022-08-29T15:20:21Z

pkg/sql/gcjob/index_garbage_collection.go

@@ -84,11 +102,28 @@ func gcIndexes(
 	if log.V(2) {
 		log.Infof(ctx, "GC is being considered on table %d for indexes indexes: %+v", parentID, droppedIndexes)
 	}
+	maybeHandleDeletedDescriptor := func(err error) (done bool) {


reused so maybe consider making it a standalone function?

Xiang-Gu · 2022-08-29T15:45:57Z

pkg/sql/gcjob_test/gc_job_test.go

+	jobID := <-gcJobID
+	go func() {
+		k := catalogkeys.MakeDescMetadataKey(codec, tableID)
+		_, err := kvDB.Del(ctx, k)
+		errCh <- err
+	}()


Is the ordering of those two steps reversed?
The DROP INDEX stmt above will ensure we stall the index-gc job (due to the testing knob). We want to test the scenario where the the index-gc job failed to find the table descriptor, so shouldn't we drop the table first before we allow the index-gc job to proceed?

There's no guarantee of ordering between the two. I don't think it changes the test to say go first, but I'll do it to clarify intention.

I still didn't understand this one: gcJobID is a 0-capacity channel. so jobID := <- jcJobID will unblock the DROP INDEX first and then we schedule a go routine that will delete the descriptor from system.descriptor table, which makes the DROP INDEX gc-job may or may not observe the deletion.

Shouldn't we make sure that we deleted the descriptor before we "let go" of the DROP INDEX index-gc job, which ensures that the gc-index job will observe the deletion?

The job has a bunch of work to do before it gets to the point of checking whether or not the descriptor is there. I guess what I'd say is if I switched the order, it would probably not affect the test much. Both statements more or less just make a goroutine runnable. I promise this test failed before this patch.

If this test were better, it'd intercept behavior of the gc job and delete the descriptor at specific moments, but I didn't feel that it was worth it, so I just ran the test on stress for a few minutes and called it a day.

I made the test better.

Xiang-Gu · 2022-09-06T14:14:17Z

@ajwerner I didn't forget about this one; Let me know once you added more commentary on the test case, and I'll take a look again.

If the descriptor was deleted, the GC job should exit gracefully. Fixes cockroachdb#86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed.

ajwerner · 2022-09-07T20:52:08Z

@Xiang-Gu I added more commentary, PTAL

Xiang-Gu · 2022-09-08T14:00:09Z

pkg/sql/gcjob_test/gc_job_test.go

+		// the DeleteRange operation. To do this, we install the below testing knob.
+		if !beforeDelRange {
+			knobs.Store = &kvserver.StoreTestingKnobs{
+				TestingRequestFilter: func(


When beforeDelRange=false, this knob is installed but this knob is called before evaluating the deleteRange request, which means we delete the descriptor before evaluating the deleteRange. This seems to not achieve what you said in the comment above "... descriptor being removed both before the initial DelRange, and after, when going to remove the zone config".

Correct, it's before evaluating DeleteRange, but after the code which looked up the descriptor in order to build the DeleteRange.

Got ya, thanks for explaining, LGTM!

Xiang-Gu · 2022-09-08T19:54:11Z

pkg/sql/gcjob_test/gc_job_test.go

+		// the DeleteRange operation. To do this, we install the below testing knob.
+		if !beforeDelRange {
+			knobs.Store = &kvserver.StoreTestingKnobs{
+				TestingRequestFilter: func(


Got ya, thanks for explaining, LGTM!

ajwerner · 2022-09-08T19:55:18Z

TFTR!

bors r+

craig · 2022-09-08T21:08:13Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-09-08T22:53:53Z

Build failed:

Bazel Essential CI (Cockroach)

ajwerner · 2022-09-09T04:51:12Z

bors r+

craig · 2022-09-09T06:10:07Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-09-09T08:41:12Z

Build failed:

Bazel Essential CI (Cockroach)

ajwerner · 2022-09-09T13:59:03Z

bors r+

craig · 2022-09-09T14:50:31Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-09-09T16:41:11Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2022-09-09T16:41:29Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from ed2e090 to blathers/backport-release-22.1-86696: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

ajwerner requested a review from a team August 23, 2022 19:39

ajwerner force-pushed the ajwerner/gcjob-index-robustness branch 2 times, most recently from 4ec7b76 to f181f82 Compare August 23, 2022 21:11

Xiang-Gu reviewed Aug 29, 2022

View reviewed changes

ajwerner force-pushed the ajwerner/gcjob-index-robustness branch 2 times, most recently from aef7663 to 0c38b19 Compare September 1, 2022 21:42

ajwerner force-pushed the ajwerner/gcjob-index-robustness branch from 0c38b19 to ed2e090 Compare September 7, 2022 20:51

ajwerner added backport-22.1.x 22.1 is EOL backport-22.2.x Flags PRs that need to be backported to 22.2. labels Sep 8, 2022

Xiang-Gu reviewed Sep 8, 2022

View reviewed changes

Xiang-Gu approved these changes Sep 8, 2022

View reviewed changes

craig bot merged commit aea3285 into cockroachdb:master Sep 9, 2022

blathers-crl bot mentioned this pull request Sep 9, 2022

release-22.2: sql/gcjob: make index GC robust to descriptors being deleted #87721

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql/gcjob: make index GC robust to descriptors being deleted #86696

sql/gcjob: make index GC robust to descriptors being deleted #86696

ajwerner commented Aug 23, 2022

cockroach-teamcity commented Aug 23, 2022

Xiang-Gu Aug 29, 2022

Xiang-Gu Aug 29, 2022 •

edited

Xiang-Gu Aug 29, 2022

ajwerner Aug 30, 2022

Xiang-Gu Sep 1, 2022

ajwerner Sep 1, 2022

ajwerner Sep 1, 2022

ajwerner Sep 7, 2022

Xiang-Gu commented Sep 6, 2022

ajwerner commented Sep 7, 2022

Xiang-Gu Sep 8, 2022

ajwerner Sep 8, 2022

Xiang-Gu Sep 8, 2022

Xiang-Gu Sep 8, 2022

ajwerner commented Sep 8, 2022

craig bot commented Sep 8, 2022

craig bot commented Sep 8, 2022

ajwerner commented Sep 9, 2022

craig bot commented Sep 9, 2022

craig bot commented Sep 9, 2022

ajwerner commented Sep 9, 2022

craig bot commented Sep 9, 2022

craig bot commented Sep 9, 2022

blathers-crl bot commented Sep 9, 2022

sql/gcjob: make index GC robust to descriptors being deleted #86696

sql/gcjob: make index GC robust to descriptors being deleted #86696

Conversation

ajwerner commented Aug 23, 2022

cockroach-teamcity commented Aug 23, 2022

Choose a reason for hiding this comment

Xiang-Gu Aug 29, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xiang-Gu commented Sep 6, 2022

ajwerner commented Sep 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajwerner commented Sep 8, 2022

craig bot commented Sep 8, 2022

craig bot commented Sep 8, 2022

ajwerner commented Sep 9, 2022

craig bot commented Sep 9, 2022

craig bot commented Sep 9, 2022

ajwerner commented Sep 9, 2022

craig bot commented Sep 9, 2022

craig bot commented Sep 9, 2022

blathers-crl bot commented Sep 9, 2022

Xiang-Gu Aug 29, 2022 •

edited