gcjob: index GC can retry perpetually when racing with table GC #86340

ajwerner · 2022-08-17T21:34:29Z

Describe the problem

In 22.1, we added exponential backoff and indefinite retries to jobs which cannot be canceled or fail. If you were to say, drop and index or truncate a table and then drop the table, we can hit errors due to the table being dropped or the table which referenced that table being dropped.

The problem is that there's no proper error handling in the index code of the GC job to detect missing descriptors and bail out. The code also ends up validating cross-references when it retrieves the descriptor in question, which means that we can get errors if a referenced descriptor is dropped (however, if a referenced descriptor is dropped, so is this one).

We need to add code to handle missing descriptor and assume that it means that the data and zones are gone.

Jira issue: CRDB-18704

This change removes descriptor cross-validation when waiting for one version. It also means that an `ErrDescriptorNotFound` now definitively means that the descriptor in question does not exist. Lastly, it extends the logic in the descs.Txn loop to fall back to just waiting for leases to drain if a descriptor is not found. This will be used as part of cockroachdb#86340. Release justification: Needed for bug fixes. Release note: None

If the descriptor was deleted, the GC job should exit gracefully. Fixes cockroachdb#86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed.

This change removes descriptor cross-validation when waiting for one version. It also means that an `ErrDescriptorNotFound` now definitively means that the descriptor in question does not exist. Lastly, it extends the logic in the descs.Txn loop to fall back to just waiting for leases to drain if a descriptor is not found. This will be used as part of cockroachdb#86340. Release justification: Needed for bug fixes. Release note: None

86690: sql/catalog/lease,descs: do not cross validate when waiting for leases r=ajwerner a=ajwerner This change removes descriptor cross-validation when waiting for one version. It also means that an `ErrDescriptorNotFound` now definitively means that the descriptor in question does not exist. Lastly, it extends the logic in the descs.Txn loop to fall back to just waiting for leases to drain if a descriptor is not found. This will be used as part of #86340. Release justification: Needed for bug fixes. Release note: None Co-authored-by: Andrew Werner <awerner32@gmail.com>

If the descriptor was deleted, the GC job should exit gracefully. Fixes cockroachdb#86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed.

80474: gcp: make per-chunk retry upload timeout configurable r=dt a=adityamaru This change adds a cluster setting `cloudstorage.gs.chunking.retry_timeout` that can be used to change the default per-chunk retry timeout that GCS imposes when chunking of file upload is enabled. The default value is set to 60 seconds, which is double of the default google sdk value of 30s. This change was motivated by sporadic occurrences of a 503 service unavailable error during backups. On its own this change is not expected to solve the resiliency issues of backup when the upload service is unavailable, but it is nice to have configurable setting nonetheless. Release note (sql change): `cloudstorage.gs.chunking.retry_timeout` is a cluster setting that can be used to configure the per-chunk retry timeout of files to Google Cloud Storage. The default value is 60 seconds. 86696: sql/gcjob: make index GC robust to descriptors being deleted r=ajwerner a=ajwerner First commit is #86690 If the descriptor was deleted, the GC job should exit gracefully. Fixes #86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed. 87378: builtins: stream consistency checker output r=pavelkalinnikov a=tbg Also makes it resilient to per-Range errors, which now no longer tank the entire operation. ```sql -- avoid buffering in cli \set display_format=csv; -- avoid rows getting buffered at server set avoid_buffering=true; -- compute as fast as possible SET CLUSTER SETTING server.consistency_check.max_rate = '1tb'; SELECT * FROM crdb_internal.check_consistency(false, '', ''); ``` Release justification: improvement for a debugging-related feature Release note: None 87657: Makefile: always use `cockroach-short` for file generation r=knz a=rickystewart This defaults to the full `cockroach` executable which requires pulling in all the UI stuff. Use `cockroach-short` to make generation require fewer dependencies. Release note: None 87701: testccl/workload/schemachange: skip random schema test r=ajwerner a=ajwerner This is very flakey. Some of it is due to #87672. Some of it was due to #85677. There are some issues with inserts which need to be fixed. Until this stabilizes, it's causing problems. Along the way, I'm marking #78478 as a GA blocker so we do actually fix it. Release note: None Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>

If the descriptor was deleted, the GC job should exit gracefully. Fixes #86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed.

ajwerner added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 17, 2022

ajwerner added this to Triage in SQL Foundations via automation Aug 23, 2022

ajwerner self-assigned this Aug 23, 2022

blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Aug 23, 2022

ajwerner mentioned this issue Aug 23, 2022

sql/catalog/lease,descs: do not cross validate when waiting for leases #86690

Merged

ajwerner mentioned this issue Aug 23, 2022

sql/gcjob: make index GC robust to descriptors being deleted #86696

Merged

postamar moved this from Triage to General in SQL Foundations Aug 30, 2022

craig bot closed this as completed in ed2e090 Sep 9, 2022

SQL Foundations automation moved this from General to Closed Sep 9, 2022

blathers-crl bot mentioned this issue Sep 9, 2022

release-22.2: sql/gcjob: make index GC robust to descriptors being deleted #87721

Merged

healthy-pod added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 17, 2023

blathers-crl bot added this to Done [after migration] in SQL Foundations May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcjob: index GC can retry perpetually when racing with table GC #86340

gcjob: index GC can retry perpetually when racing with table GC #86340

ajwerner commented Aug 17, 2022 •

edited by cockroach-jira-scripts

gcjob: index GC can retry perpetually when racing with table GC #86340

gcjob: index GC can retry perpetually when racing with table GC #86340

Comments

ajwerner commented Aug 17, 2022 • edited by cockroach-jira-scripts

ajwerner commented Aug 17, 2022 •

edited by cockroach-jira-scripts