Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: DistSQL processors can use a Leaf txn without collecting its metadata #41222

Open
andreimatei opened this issue Sep 30, 2019 · 8 comments
Open
Labels
A-sql-builtins SQL built-in functions and semantics thereof. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team
Projects

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Sep 30, 2019

When using a Leaf txn, someone needs to collect its metadata and pass it through to the DistSQL receiver which merges it with the Root's metadata. A handful of processors (e.g. the TableReader) do this by collecting the leaf metadata when draining. However, this was only done for processors directly use the transaction.
But I believe any processor can use the transaction through its "render expressions" and such by invoking built-in functions. These functions can use the transaction through the EvalCtx. And so I think it might be possible for a txn to be used without anyone collecting its metadata.

Perhaps at the moment that's not actually happening because processors share their transaction objects (and also share it with the EvalCtx), but I'm moving towards less sharing. So the thing, if not broken already, is very fragile.

Separately, our collecting of txn metadata is very haphazard. It's possible for multiple processors to collect the same piece of metadata from a shared txn. We should figure out a more principled story.

Jira issue: CRDB-5476

@andreimatei andreimatei added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 30, 2019
@knz knz added the A-sql-builtins SQL built-in functions and semantics thereof. label Oct 1, 2019
@knz
Copy link
Contributor

knz commented Oct 1, 2019

@andreimatei can you explain more what the effects of "not collecting the metadata" can be. Does this mean that we can "lose a read" and fail to populate the ts cache, or something similar? Is it possible to observe transaction anomalies, for example a query to system.descriptors via a built-in function fail to mark the txn to serialize properly with concurrent DDL? Are there other risks?

(I am trying to assess severity for the release)

cc @jordanlewis

@andreimatei
Copy link
Contributor Author

Technically the consequences of not collecting that metadata are that the transaction can succeed in refreshing without actually refreshing the reads performed through the leaf transaction whose metadata was not collected. This can result in write skew - so a failure of serializability (albeit not the worst one conceivable).

Now, I'm not completely sure that problems can manifest themselves today. In order for badness to occur, you need to a) run one of these builtin functions that uses the txn in a leaf txn and b) not collect the metadata for that leaf txn. It might be the case that today we plan everything that uses such builtins entirely on the gateway (I'm not sure). Or even if we don't plan them on the gateway, as long as there is a TableReader using the same leaf as the builtin, then we'll end up collecting the metadata. So I'm not sure that you can actually construct a scenario where bad things happen. That's why I haven't assigned a severity. We shouldn't consider this a release blocker. But I do think the situation is precarious at best.

I've filed the issue so I can reference it from new code in #41102. That PR might have been making things worse by introducing more leaves on the gateway. But now the future of that PR seems uncertain.

@knz
Copy link
Contributor

knz commented Oct 1, 2019

The main thing I'm worried about is things like pg_table_is_visible and others that look at descriptors. These need proper isolation and consistency in relation to DDL occuring in unrelated txns, otherwise ORMs will fail in mysterious and hard-to-debug ways.

@github-actions
Copy link

github-actions bot commented Jun 4, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@knz
Copy link
Contributor

knz commented Jun 4, 2021

still current

@knz knz added this to Triage in SQL Queries via automation Jun 4, 2021
@RaduBerinde RaduBerinde moved this from Triage to Backlog in SQL Queries Jun 8, 2021
@RaduBerinde RaduBerinde removed this from [GENERAL BACKLOG] Bugs/Test Failures in BACKLOG, NO NEW ISSUES: SQL Execution Jun 8, 2021
@jlinder jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021
@mgartner mgartner moved this from Backlog to New Backlog in SQL Queries Mar 2, 2023
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@knz
Copy link
Contributor

knz commented Sep 19, 2023

@yuzefovich @rharding6373 @DrewKimball I think we fixed this right?

@yuzefovich
Copy link
Member

No, I think it's still current. In particular, this part

But I believe any processor can use the transaction through its "render expressions" and such by invoking built-in functions. These functions can use the transaction through the EvalCtx. And so I think it might be possible for a txn to be used without anyone collecting its metadata.

requires further investigation to see whether there are cases like this.

I believe that

Perhaps at the moment that's not actually happening because processors share their transaction objects (and also share it with the EvalCtx), but I'm moving towards less sharing. So the thing, if not broken already, is very fragile.

saves us in vast majority of cases. Namely, we generally use a single Txn object for the whole flow (modulo the streamer component), and most likely there will be a processor in that flow that will collect the txn's metadata even if the txn is used by another processor. However, it doesn't seem impossible to have a plan where that doesn't happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-builtins SQL built-in functions and semantics thereof. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team
Projects
Status: Backlog
SQL Queries
New Backlog
Development

No branches or pull requests

4 participants