-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gcjob: index GC can retry perpetually when racing with table GC #86340
Labels
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-sql-foundations
SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
Comments
ajwerner
added
the
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
label
Aug 17, 2022
ajwerner
added a commit
to ajwerner/cockroach
that referenced
this issue
Aug 23, 2022
This change removes descriptor cross-validation when waiting for one version. It also means that an `ErrDescriptorNotFound` now definitively means that the descriptor in question does not exist. Lastly, it extends the logic in the descs.Txn loop to fall back to just waiting for leases to drain if a descriptor is not found. This will be used as part of cockroachdb#86340. Release justification: Needed for bug fixes. Release note: None
ajwerner
added a commit
to ajwerner/cockroach
that referenced
this issue
Aug 23, 2022
If the descriptor was deleted, the GC job should exit gracefully. Fixes cockroachdb#86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed.
ajwerner
added a commit
to ajwerner/cockroach
that referenced
this issue
Aug 31, 2022
This change removes descriptor cross-validation when waiting for one version. It also means that an `ErrDescriptorNotFound` now definitively means that the descriptor in question does not exist. Lastly, it extends the logic in the descs.Txn loop to fall back to just waiting for leases to drain if a descriptor is not found. This will be used as part of cockroachdb#86340. Release justification: Needed for bug fixes. Release note: None
craig bot
pushed a commit
that referenced
this issue
Aug 31, 2022
86690: sql/catalog/lease,descs: do not cross validate when waiting for leases r=ajwerner a=ajwerner This change removes descriptor cross-validation when waiting for one version. It also means that an `ErrDescriptorNotFound` now definitively means that the descriptor in question does not exist. Lastly, it extends the logic in the descs.Txn loop to fall back to just waiting for leases to drain if a descriptor is not found. This will be used as part of #86340. Release justification: Needed for bug fixes. Release note: None Co-authored-by: Andrew Werner <awerner32@gmail.com>
ajwerner
added a commit
to ajwerner/cockroach
that referenced
this issue
Sep 1, 2022
If the descriptor was deleted, the GC job should exit gracefully. Fixes cockroachdb#86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed.
craig bot
pushed a commit
that referenced
this issue
Sep 9, 2022
80474: gcp: make per-chunk retry upload timeout configurable r=dt a=adityamaru This change adds a cluster setting `cloudstorage.gs.chunking.retry_timeout` that can be used to change the default per-chunk retry timeout that GCS imposes when chunking of file upload is enabled. The default value is set to 60 seconds, which is double of the default google sdk value of 30s. This change was motivated by sporadic occurrences of a 503 service unavailable error during backups. On its own this change is not expected to solve the resiliency issues of backup when the upload service is unavailable, but it is nice to have configurable setting nonetheless. Release note (sql change): `cloudstorage.gs.chunking.retry_timeout` is a cluster setting that can be used to configure the per-chunk retry timeout of files to Google Cloud Storage. The default value is 60 seconds. 86696: sql/gcjob: make index GC robust to descriptors being deleted r=ajwerner a=ajwerner First commit is #86690 If the descriptor was deleted, the GC job should exit gracefully. Fixes #86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed. 87378: builtins: stream consistency checker output r=pavelkalinnikov a=tbg Also makes it resilient to per-Range errors, which now no longer tank the entire operation. ```sql -- avoid buffering in cli \set display_format=csv; -- avoid rows getting buffered at server set avoid_buffering=true; -- compute as fast as possible SET CLUSTER SETTING server.consistency_check.max_rate = '1tb'; SELECT * FROM crdb_internal.check_consistency(false, '', ''); ``` Release justification: improvement for a debugging-related feature Release note: None 87657: Makefile: always use `cockroach-short` for file generation r=knz a=rickystewart This defaults to the full `cockroach` executable which requires pulling in all the UI stuff. Use `cockroach-short` to make generation require fewer dependencies. Release note: None 87701: testccl/workload/schemachange: skip random schema test r=ajwerner a=ajwerner This is very flakey. Some of it is due to #87672. Some of it was due to #85677. There are some issues with inserts which need to be fixed. Until this stabilizes, it's causing problems. Along the way, I'm marking #78478 as a GA blocker so we do actually fix it. Release note: None Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
blathers-crl bot
pushed a commit
that referenced
this issue
Sep 9, 2022
If the descriptor was deleted, the GC job should exit gracefully. Fixes #86340 Release justification: bug fix for backport Release note (bug fix): In some scenarios, when a DROP INDEX was run around the same time as a DROP TABLE or DROP DATABASE covering the same data, the `DROP INDEX` gc job could get caught retrying indefinitely. This has been fixed.
healthy-pod
added
T-sql-foundations
SQL Foundations Team (formerly SQL Schema + SQL Sessions)
and removed
T-sql-schema-deprecated
Use T-sql-foundations instead
labels
May 17, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-sql-foundations
SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Describe the problem
In 22.1, we added exponential backoff and indefinite retries to jobs which cannot be canceled or fail. If you were to say, drop and index or truncate a table and then drop the table, we can hit errors due to the table being dropped or the table which referenced that table being dropped.
The problem is that there's no proper error handling in the index code of the GC job to detect missing descriptors and bail out. The code also ends up validating cross-references when it retrieves the descriptor in question, which means that we can get errors if a referenced descriptor is dropped (however, if a referenced descriptor is dropped, so is this one).
We need to add code to handle missing descriptor and assume that it means that the data and zones are gone.
Jira issue: CRDB-18704
The text was updated successfully, but these errors were encountered: