workload: treat UndefinedTable as retryable during op generation#167148
workload: treat UndefinedTable as retryable during op generation#167148shghasemi wants to merge 1 commit intocockroachdb:masterfrom
Conversation
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
fqazi
left a comment
There was a problem hiding this comment.
@fqazi made 1 comment.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on golgeek, shailendra-patel, and shghasemi).
-- commits line 4 at r1:
I think the problem is the dropped table being returned from pg_depend, do we know why thats happening?
Queries against pg_depend can fail with error 42P01 (UndefinedTable) when a referenced descriptor is concurrently dropped. We can treat this as a retryable error instead of a fatal workload failure. Fixes cockroachdb#166377 Release note: None
311bada to
bce6eea
Compare
The old commit message was a bit misleading. Here's what I found in the logs. I think a concurrent query on pgCatalogDependTable is failing because the table is being dropped. #166377 (comment) My guess is that this happens because the cockroach/pkg/sql/pg_catalog.go Lines 1924 to 1929 in 11dc408 |
fqazi
left a comment
There was a problem hiding this comment.
@fqazi made 1 comment.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on shghasemi).
Previously, shghasemi wrote…
The old commit message was a bit misleading. Here's what I found in the logs. I think a concurrent query on pgCatalogDependTable is failing because the table is being dropped. #166377 (comment)
My guess is that this happens because the
internalLookupCtxhas no leasing of the descriptors. So in a rare case, we might end up initializing theinternalLookupCtxright before table drop completes. Then populatingpgCatalogDependTableruns into an error because it sees the old fk reference:
cockroach/pkg/sql/pg_catalog.go
Lines 1924 to 1929 in 11dc408
The fallback function is what should do the lookup in that case:
cockroach/pkg/sql/information_schema.go
Line 2830 in 844de22
|
Oh the backport wasn't in the failing branch: #165274. So, I think we are good to close the issue off, since it should be fixed in the 26.1.2 release. |
spilchen
left a comment
There was a problem hiding this comment.
@spilchen made 1 comment.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on fqazi and shghasemi).
Previously, fqazi (Faizan Qazi) wrote…
The fallback function is what should do the lookup in that case:
. So, I'm surprised that didn't save 🤔cockroach/pkg/sql/information_schema.go
Line 2830 in 844de22
Is there a check for dropped descriptors anywhere in pg_depend? I remember we had problems with dropped descriptors in pg_constraint and added a specific skip. A new test was added to TestAlterTableDMLInjection that queried that table during a drop column. We may be able to use the same approach for pg_depend to get a more reliable repro.
|
@fqazi You are right! This fix wasn't in the failing branch, and it should resolve the problem. @spilchen TestAlterTableDMLInjection can't repro this issue. What I understand is that "query" will be run at each stage of the schema change. We need the query to start at an early stage and continue at a later stage to reproduce this issue. |
|
@shghasemi I think DML injection may not be sufficient to reproduce because there is a WaitForOneVersion between stages. You probably want a hook that lets you inject statements before that under stress if we want an independent repo. |
Queries against pg_depend that use REGCLASS casts can fail with 42P01 (UndefinedTable) when a referenced descriptor is concurrently dropped, producing an invalid relation name. We can treat this as a retryable error instead of a fatal workload failure.
Fixes #166377
Release note: None