Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/catalog/lease,descs: do not cross validate when waiting for leases #86690

Merged

Conversation

ajwerner
Copy link
Contributor

This change removes descriptor cross-validation when waiting for one version.
It also means that an ErrDescriptorNotFound now definitively means that the
descriptor in question does not exist. Lastly, it extends the logic in the
descs.Txn loop to fall back to just waiting for leases to drain if a descriptor
is not found. This will be used as part of #86340.

Release justification: Needed for bug fixes.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/lease-drain-do-not-cross-validate branch 2 times, most recently from cb17ec8 to 660a17a Compare August 23, 2022 19:36
@ajwerner ajwerner marked this pull request as ready for review August 23, 2022 19:39
@ajwerner ajwerner requested a review from a team August 23, 2022 19:39
Comment on lines +128 to +135
// Use the lower-level GetCrossReferencedDescriptorsForValidation to avoid
// performing cross-descriptor validation while waiting for leases to
// drain. On the one hand, it's somewhat expensive. More importantly,
// there are valid cases where descriptors can be removed or made invalid,
// and we don't particularly care about them when using this method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we give a more specific example in the comment of "valid cases where descriptors can be removed or made invalid" so that the reader knows better why we don't want to do cross-descriptor validation when waiting for one lease.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example.

Comment on lines -1307 to -1309
var version descpb.DescriptorVersion
if desc != nil {
version = desc.GetVersion()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that we now check on err!

But is it possible that desc and err are both nil, in which case log.Infof risks "nil pointer dereference" on desc.GetVersion()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added commentary to indicate that WaitForOneVersion won't return nil, nil

Copy link
Contributor

@Xiang-Gu Xiang-Gu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. I've left two nits here.

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

@@ -0,0 +1,10 @@
exec db=defaultdb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this test? In other words, how is it going to test the change that we no longer do cross-descriptor validation when "waitForOneVersion"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: it will fail with a descriptor not found error without changes in this PR.

It's because we call cleanupTempSystemTables() here in a txn, where we run DDLs using an internal executor. Prior to this PR, if I migrate that to using descs.TxnWithExector() (which is the correct interface to run DDLs with internal executor), it will give a descriptor not found error. This is because, in descs.TxnWithExecutor(), when running WaitForOneVersion, we weren’t waiting for the true descriptor id, rather we were waiting for something which referenced that id, and it was deleted.

See more discussion here (internal link)

Copy link
Contributor

@postamar postamar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a welcome change.

@postamar
Copy link
Contributor

Should #87067 reach a suitable state soon, I suggest rebasing this change onto it, the call to GetCrossReferencedDescriptorsForValidation would instead be one to MaybeGetDescriptorByIDUnvalidated.

This change removes descriptor cross-validation when waiting for one version.
It also means that an `ErrDescriptorNotFound` now definitively means that the
descriptor in question does not exist. Lastly, it extends the logic in the
descs.Txn loop to fall back to just waiting for leases to drain if a descriptor
is not found. This will be used as part of cockroachdb#86340.

Release justification: Needed for bug fixes.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/lease-drain-do-not-cross-validate branch from 660a17a to d106bd6 Compare August 31, 2022 00:19
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Comment on lines +128 to +135
// Use the lower-level GetCrossReferencedDescriptorsForValidation to avoid
// performing cross-descriptor validation while waiting for leases to
// drain. On the one hand, it's somewhat expensive. More importantly,
// there are valid cases where descriptors can be removed or made invalid,
// and we don't particularly care about them when using this method.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an example.

Comment on lines -1307 to -1309
var version descpb.DescriptorVersion
if desc != nil {
version = desc.GetVersion()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added commentary to indicate that WaitForOneVersion won't return nil, nil

@ajwerner
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 31, 2022

Build failed:

@ajwerner
Copy link
Contributor Author

flake seems unrelated

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 31, 2022

Build succeeded:

@craig craig bot merged commit 3f025a3 into cockroachdb:master Aug 31, 2022
craig bot pushed a commit that referenced this pull request Sep 9, 2022
80474: gcp: make per-chunk retry upload timeout configurable r=dt a=adityamaru

This change adds a cluster setting `cloudstorage.gs.chunking.retry_timeout`
that can be used to change the default per-chunk retry timeout
that GCS imposes when chunking of file upload is enabled. The default
value is set to 60 seconds, which is double of the default google sdk
value of 30s.

This change was motivated by sporadic occurrences of a 503 service unavailable
error during backups. On its own this change is not expected to solve the
resiliency issues of backup when the upload service is unavailable, but it
is nice to have configurable setting nonetheless.

Release note (sql change): `cloudstorage.gs.chunking.retry_timeout`
is a cluster setting that can be used to configure the per-chunk retry
timeout of files to Google Cloud Storage. The default value is 60 seconds.

86696: sql/gcjob: make index GC robust to descriptors being deleted r=ajwerner a=ajwerner

First commit is #86690

If the descriptor was deleted, the GC job should exit gracefully.

Fixes #86340

Release justification: bug fix for backport

Release note (bug fix): In some scenarios, when a DROP INDEX was
run around the same time as a DROP TABLE or DROP DATABASE covering the same
data, the `DROP INDEX` gc job could get caught retrying indefinitely. This
has been fixed.

87378: builtins: stream consistency checker output r=pavelkalinnikov a=tbg

Also makes it resilient to per-Range errors, which now no longer
tank the entire operation.

```sql
-- avoid buffering in cli
\set display_format=csv;
-- avoid rows getting buffered at server
set avoid_buffering=true;
-- compute as fast as possible
SET CLUSTER SETTING server.consistency_check.max_rate = '1tb';

SELECT * FROM crdb_internal.check_consistency(false, '', '');
```

Release justification: improvement for a debugging-related feature
Release note: None


87657: Makefile: always use `cockroach-short` for file generation r=knz a=rickystewart

This defaults to the full `cockroach` executable which requires pulling in all the UI stuff. Use `cockroach-short` to make generation require fewer dependencies.

Release note: None

87701: testccl/workload/schemachange: skip random schema test r=ajwerner a=ajwerner

This is very flakey. Some of it is due to #87672. Some of it was due to #85677. There are some issues with inserts which need to be fixed. Until this stabilizes, it's causing problems.

Along the way, I'm marking #78478 as a GA blocker so we do actually fix it.

Release note: None

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants