Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcjob: index GC can retry perpetually when racing with table GC #86340

Closed
ajwerner opened this issue Aug 17, 2022 · 0 comments · Fixed by #86696
Closed

gcjob: index GC can retry perpetually when racing with table GC #86340

ajwerner opened this issue Aug 17, 2022 · 0 comments · Fixed by #86696
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Aug 17, 2022

Describe the problem

In 22.1, we added exponential backoff and indefinite retries to jobs which cannot be canceled or fail. If you were to say, drop and index or truncate a table and then drop the table, we can hit errors due to the table being dropped or the table which referenced that table being dropped.

The problem is that there's no proper error handling in the index code of the GC job to detect missing descriptors and bail out. The code also ends up validating cross-references when it retrieves the descriptor in question, which means that we can get errors if a referenced descriptor is dropped (however, if a referenced descriptor is dropped, so is this one).

We need to add code to handle missing descriptor and assume that it means that the data and zones are gone.

Jira issue: CRDB-18704

@ajwerner ajwerner added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 17, 2022
@ajwerner ajwerner added this to Triage in SQL Foundations via automation Aug 23, 2022
@ajwerner ajwerner self-assigned this Aug 23, 2022
@blathers-crl blathers-crl bot added the T-sql-schema-deprecated Use T-sql-foundations instead label Aug 23, 2022
ajwerner added a commit to ajwerner/cockroach that referenced this issue Aug 23, 2022
This change removes descriptor cross-validation when waiting for one version.
It also means that an `ErrDescriptorNotFound` now definitively means that the
descriptor in question does not exist. Lastly, it extends the logic in the
descs.Txn loop to fall back to just waiting for leases to drain if a descriptor
is not found. This will be used as part of cockroachdb#86340.

Release justification: Needed for bug fixes.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Aug 23, 2022
If the descriptor was deleted, the GC job should exit gracefully.

Fixes cockroachdb#86340

Release justification: bug fix for backport

Release note (bug fix): In some scenarios, when a DROP INDEX was
run around the same time as a DROP TABLE or DROP DATABASE covering the same
data, the `DROP INDEX` gc job could get caught retrying indefinitely. This
has been fixed.
@postamar postamar moved this from Triage to General in SQL Foundations Aug 30, 2022
ajwerner added a commit to ajwerner/cockroach that referenced this issue Aug 31, 2022
This change removes descriptor cross-validation when waiting for one version.
It also means that an `ErrDescriptorNotFound` now definitively means that the
descriptor in question does not exist. Lastly, it extends the logic in the
descs.Txn loop to fall back to just waiting for leases to drain if a descriptor
is not found. This will be used as part of cockroachdb#86340.

Release justification: Needed for bug fixes.

Release note: None
craig bot pushed a commit that referenced this issue Aug 31, 2022
86690: sql/catalog/lease,descs: do not cross validate when waiting for leases r=ajwerner a=ajwerner

This change removes descriptor cross-validation when waiting for one version.
It also means that an `ErrDescriptorNotFound` now definitively means that the
descriptor in question does not exist. Lastly, it extends the logic in the
descs.Txn loop to fall back to just waiting for leases to drain if a descriptor
is not found. This will be used as part of #86340.

Release justification: Needed for bug fixes.

Release note: None

Co-authored-by: Andrew Werner <awerner32@gmail.com>
ajwerner added a commit to ajwerner/cockroach that referenced this issue Sep 1, 2022
If the descriptor was deleted, the GC job should exit gracefully.

Fixes cockroachdb#86340

Release justification: bug fix for backport

Release note (bug fix): In some scenarios, when a DROP INDEX was
run around the same time as a DROP TABLE or DROP DATABASE covering the same
data, the `DROP INDEX` gc job could get caught retrying indefinitely. This
has been fixed.
craig bot pushed a commit that referenced this issue Sep 9, 2022
80474: gcp: make per-chunk retry upload timeout configurable r=dt a=adityamaru

This change adds a cluster setting `cloudstorage.gs.chunking.retry_timeout`
that can be used to change the default per-chunk retry timeout
that GCS imposes when chunking of file upload is enabled. The default
value is set to 60 seconds, which is double of the default google sdk
value of 30s.

This change was motivated by sporadic occurrences of a 503 service unavailable
error during backups. On its own this change is not expected to solve the
resiliency issues of backup when the upload service is unavailable, but it
is nice to have configurable setting nonetheless.

Release note (sql change): `cloudstorage.gs.chunking.retry_timeout`
is a cluster setting that can be used to configure the per-chunk retry
timeout of files to Google Cloud Storage. The default value is 60 seconds.

86696: sql/gcjob: make index GC robust to descriptors being deleted r=ajwerner a=ajwerner

First commit is #86690

If the descriptor was deleted, the GC job should exit gracefully.

Fixes #86340

Release justification: bug fix for backport

Release note (bug fix): In some scenarios, when a DROP INDEX was
run around the same time as a DROP TABLE or DROP DATABASE covering the same
data, the `DROP INDEX` gc job could get caught retrying indefinitely. This
has been fixed.

87378: builtins: stream consistency checker output r=pavelkalinnikov a=tbg

Also makes it resilient to per-Range errors, which now no longer
tank the entire operation.

```sql
-- avoid buffering in cli
\set display_format=csv;
-- avoid rows getting buffered at server
set avoid_buffering=true;
-- compute as fast as possible
SET CLUSTER SETTING server.consistency_check.max_rate = '1tb';

SELECT * FROM crdb_internal.check_consistency(false, '', '');
```

Release justification: improvement for a debugging-related feature
Release note: None


87657: Makefile: always use `cockroach-short` for file generation r=knz a=rickystewart

This defaults to the full `cockroach` executable which requires pulling in all the UI stuff. Use `cockroach-short` to make generation require fewer dependencies.

Release note: None

87701: testccl/workload/schemachange: skip random schema test r=ajwerner a=ajwerner

This is very flakey. Some of it is due to #87672. Some of it was due to #85677. There are some issues with inserts which need to be fixed. Until this stabilizes, it's causing problems.

Along the way, I'm marking #78478 as a GA blocker so we do actually fix it.

Release note: None

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
@craig craig bot closed this as completed in ed2e090 Sep 9, 2022
SQL Foundations automation moved this from General to Closed Sep 9, 2022
blathers-crl bot pushed a commit that referenced this issue Sep 9, 2022
If the descriptor was deleted, the GC job should exit gracefully.

Fixes #86340

Release justification: bug fix for backport

Release note (bug fix): In some scenarios, when a DROP INDEX was
run around the same time as a DROP TABLE or DROP DATABASE covering the same
data, the `DROP INDEX` gc job could get caught retrying indefinitely. This
has been fixed.
@healthy-pod healthy-pod added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 17, 2023
@blathers-crl blathers-crl bot added this to Done [after migration] in SQL Foundations May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
SQL Foundations
  
Done [after migration]
Development

Successfully merging a pull request may close this issue.

2 participants