Skip to content

workload: treat UndefinedTable as retryable during op generation#167148

Closed
shghasemi wants to merge 1 commit intocockroachdb:masterfrom
shghasemi:deflake-opgen
Closed

workload: treat UndefinedTable as retryable during op generation#167148
shghasemi wants to merge 1 commit intocockroachdb:masterfrom
shghasemi:deflake-opgen

Conversation

@shghasemi
Copy link
Copy Markdown
Contributor

Queries against pg_depend that use REGCLASS casts can fail with 42P01 (UndefinedTable) when a referenced descriptor is concurrently dropped, producing an invalid relation name. We can treat this as a retryable error instead of a fatal workload failure.

Fixes #166377

Release note: None

@shghasemi shghasemi requested a review from a team March 31, 2026 18:13
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Mar 31, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@shghasemi shghasemi self-assigned this Mar 31, 2026
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@shghasemi shghasemi marked this pull request as ready for review March 31, 2026 19:16
@shghasemi shghasemi requested a review from a team as a code owner March 31, 2026 19:16
@shghasemi shghasemi requested review from golgeek and shailendra-patel and removed request for a team March 31, 2026 19:16
Copy link
Copy Markdown
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fqazi made 1 comment.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on golgeek, shailendra-patel, and shghasemi).


-- commits line 4 at r1:
I think the problem is the dropped table being returned from pg_depend, do we know why thats happening?

Queries against pg_depend can fail with error 42P01 (UndefinedTable)
when a referenced descriptor is concurrently dropped. We can treat
this as a retryable error instead of a fatal workload failure.

Fixes cockroachdb#166377

Release note: None
@shghasemi
Copy link
Copy Markdown
Contributor Author

shghasemi commented Apr 1, 2026

@fqazi made 1 comment.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on golgeek, shailendra-patel, and shghasemi).

-- commits line 4 at r1: I think the problem is the dropped table being returned from pg_depend, do we know why thats happening?

The old commit message was a bit misleading. Here's what I found in the logs. I think a concurrent query on pgCatalogDependTable is failing because the table is being dropped. #166377 (comment)

My guess is that this happens because the internalLookupCtx has no leasing of the descriptors. So in a rare case, we might end up initializing the internalLookupCtx right before table drop completes. Then populating pgCatalogDependTable runs into an error because it sees the old fk reference:

for _, fk := range table.OutboundForeignKeys() {
// Foreign keys don't have a single linked index. Pick the first one
// that matches on the referenced table.
referencedTable, err := tableLookup.getTableByID(fk.GetReferencedTableID())
if err != nil {
return err

Copy link
Copy Markdown
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fqazi made 1 comment.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on shghasemi).


-- commits line 4 at r1:

Previously, shghasemi wrote…

The old commit message was a bit misleading. Here's what I found in the logs. I think a concurrent query on pgCatalogDependTable is failing because the table is being dropped. #166377 (comment)

My guess is that this happens because the internalLookupCtx has no leasing of the descriptors. So in a rare case, we might end up initializing the internalLookupCtx right before table drop completes. Then populating pgCatalogDependTable runs into an error because it sees the old fk reference:

for _, fk := range table.OutboundForeignKeys() {
// Foreign keys don't have a single linked index. Pick the first one
// that matches on the referenced table.
referencedTable, err := tableLookup.getTableByID(fk.GetReferencedTableID())
if err != nil {
return err

The fallback function is what should do the lookup in that case:

lCtx := newInternalLookupCtx(c.OrderedDescriptors(), dbContext, p.Descriptors().GetLookupContextFallbackFn(ctx, p.Txn()))
. So, I'm surprised that didn't save 🤔

@fqazi
Copy link
Copy Markdown
Collaborator

fqazi commented Apr 1, 2026

Oh the backport wasn't in the failing branch: #165274. So, I think we are good to close the issue off, since it should be fixed in the 26.1.2 release.

Copy link
Copy Markdown
Contributor

@spilchen spilchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spilchen made 1 comment.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on fqazi and shghasemi).


-- commits line 4 at r1:

Previously, fqazi (Faizan Qazi) wrote…

The fallback function is what should do the lookup in that case:

lCtx := newInternalLookupCtx(c.OrderedDescriptors(), dbContext, p.Descriptors().GetLookupContextFallbackFn(ctx, p.Txn()))
. So, I'm surprised that didn't save 🤔

Is there a check for dropped descriptors anywhere in pg_depend? I remember we had problems with dropped descriptors in pg_constraint and added a specific skip. A new test was added to TestAlterTableDMLInjection that queried that table during a drop column. We may be able to use the same approach for pg_depend to get a more reliable repro.

@shghasemi
Copy link
Copy Markdown
Contributor Author

shghasemi commented Apr 1, 2026

@fqazi You are right! This fix wasn't in the failing branch, and it should resolve the problem.

@spilchen TestAlterTableDMLInjection can't repro this issue. What I understand is that "query" will be run at each stage of the schema change. We need the query to start at an early stage and continue at a later stage to reproduce this issue.

		{
			desc: "drop table referenced by fk while querying pg_depend",
			setup: []string{
				"CREATE TABLE tbl_src (val INT PRIMARY KEY)",
				"CREATE TABLE tbl_ref (ref_val INT REFERENCES tbl_src (val))",
			},
			schemaChange: "DROP TABLE tbl_src CASCADE",
			query:        "SELECT * FROM pg_catalog.pg_depend",
		},

@shghasemi shghasemi closed this Apr 1, 2026
@fqazi
Copy link
Copy Markdown
Collaborator

fqazi commented Apr 1, 2026

@shghasemi I think DML injection may not be sufficient to reproduce because there is a WaitForOneVersion between stages. You probably want a hook that lets you inject statements before that under stress if we want an independent repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: schemachange/random-load failed

4 participants