Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: hazardous use of descriptors in BIND outside of explicit transaction #64140

Closed
ajwerner opened this issue Apr 23, 2021 · 12 comments
Closed
Assignees
Labels
A-sql-executor SQL txn logic C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Apr 23, 2021

Describe the problem

In SQL you can bind arguments to a prepared statement outside of a transaction. This generally happens in the binary wire protocol if I understand correctly. The client will prepare a statement and then it will bind the arguments to an unnamed portal and then it will execute the portal. The prepare is safe and fine because we'll re-acquire leases inside the implicit transaction when checking if the plan is stale. The bind, however, has a hazardous and unsafe properties.

When the type of argument to a bind is either a user-defined type or refers to an OID type by name, we resolve the type and value during while doing the BIND. This is problematic because it will use the transaction currently sitting on the connExecutor. This is generally the previously committed transaction (I suspect there may be cases where this is nil but I don't know). This happens here (with the regclass/regtype resolution happening underneath tree.ParseDOid):

typ, ok := types.OidToType[t]
if !ok {
var err error
typ, err = ex.planner.ResolveTypeByOID(ctx, t)
if err != nil {
return nil, err
}
}
d, err := pgwirebase.DecodeDatum(
ex.planner.EvalContext(),
typ,
qArgFormatCodes[i],
arg,
)

At that point we'll attach a deadline due to leasing to an already committed transaction. If we attach the following sane invariant to the deadline code we get failures right and left:

--- a/pkg/kv/txn.go
+++ b/pkg/kv/txn.go
@@ -695,6 +695,9 @@ func (txn *Txn) UpdateDeadlineMaybe(ctx context.Context, deadline hlc.Timestamp)
 
        txn.mu.Lock()
        defer txn.mu.Unlock()
+       if txn.status().IsFinalized() {
+               panic(errors.WithContextTags(errors.AssertionFailedf("UpdateDeadlineMaybe() called on finalized txn"), ctx))
+       }

Worse yet are cases where we cannot use a cached, leased descriptor, like a number of system tables. For example, the following test fails rather surprisingly with this error:

func TestRegclassBind(t *testing.T) {
	defer leaktest.AfterTest(t)()
	defer log.Scope(t).Close(t)

	ctx := context.Background()
	tc := testcluster.StartTestCluster(t, 1, base.TestClusterArgs{})
	defer tc.Stopper().Stop(ctx)
	_, err := tc.ServerConn(0).Exec("SELECT $1::REGCLASS::INT", "system.descriptor")
	require.NoError(t, err)
}
    system_table_test.go:48: 
        	Error Trace:	system_table_test.go:48
        	Error:      	Received unexpected error:
        	            	pq: error in argument for $1: TransactionStatusError: client already committed or rolled back the transaction. Trying to execute: 1 Get (REASON_UNKNOWN)
        	Test:       	TestRegclassBind

Possible Solution

It seems to me like we should move the conn executor into an open state when binding (or when creating a portal?). It feels weird to me to have a bound portal outside of a transaction. I may not know enough to know though. We do this already for example when processing an ExecCmd.

Additional context

There are some edge cases where I think we may miss renames or interpret arguments as the wrong type. It's more just a general soundness problem.

Jira issue: CRDB-6916

@ajwerner ajwerner added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-sql-executor SQL txn logic labels Apr 23, 2021
@ajwerner ajwerner added this to Triage in SQL Queries via automation Apr 23, 2021
@rytaft
Copy link
Collaborator

rytaft commented Apr 27, 2021

@ajwerner I'm assigning you since it looks like you have a PR in progress -- feel free to let us know if you'd like someone else to take a look

@rytaft rytaft moved this from Triage to Reactive in SQL Queries Apr 27, 2021
@ajwerner
Copy link
Contributor Author

@rytaft I took a stab on Friday. I don't think that approach has legs. It's a bit of a pickle I don't see an easy way out of. Fortunately it seems to mostly cause problems in edge cases involving regclass or user-defined types concurrent with schema changes. I'll place it back in a triage state if my approach doesn't pan out.

@rytaft rytaft moved this from Reactive to 21.2 Milestone A in SQL Queries Apr 28, 2021
@fqazi
Copy link
Collaborator

fqazi commented May 4, 2021

There are scenarios where the transaction can be nil, which is also problematic @ajwerner. I'm wondering if we should even hold onto leases in that case? It can lead to problems with the read timestamp related failures inside UpdateDeadline

@yuzefovich yuzefovich moved this from 21.2 Milestone A to 21.2 May Milestone in SQL Queries May 11, 2021
@rafiss
Copy link
Collaborator

rafiss commented May 12, 2021

for the case of resolving a name into an OID/regclass, perhaps we could special-case the placeholder typing logic as a follow-up to this PR #60949

to guide our solution for that and the other issues, i will throw in a plug for our datadriven pgwire tests. you can write the exact sequence of pgwire commands you want and run them against either cockroachdb or postgres.

e.g. look at pkg/sql/pgwire/testdata/pgtest/portals and try running make test PKG=./pkg/sql/pgwire TESTS=TestPGTest TESTFLAGS="-addr=localhost:5432 -user=rafiss" to run the test against postgres

@ajwerner
Copy link
Contributor Author

[written in parallel with the above comment]

Discussed offline with @rafiss. In the short term, the main thing we want to fix is the panic when you PREPARE/BIND something referencing UDTs or regclasses as the first thing you do on a connection. We can hack that by literally just putting a committed txn in place if txn is nil.

The next step is likely to revert part of #60949 to not resolve regclass or regtype values to their numeric value. If we did that, then I think we'd fix nearly all of the correctness issues. The remaining correctness issue would be to not convert the enum values from their logical to their physical representation until execution. I think that'd solve ~all of the correctness problems.

It still leaves us in a weird place where we can hit odd type mismatch errors when swapping type names (I don't care all that much about that case). The more glaring issue it leaves us with is a bizarre use of the descs.Collection while there is no open transaction. As we move forward, the descs.Collection is gaining a tighter coupling to the transaction. We'll need to negotiate that somehow.

I'll pick up the panic fix and leave the rest for follow-up.

@rafiss
Copy link
Collaborator

rafiss commented May 12, 2021

i was curious myself about what postgres did for regclass placeholders so i tried it out. it turns out that postgres will handle SELECT $1::REGCLASS::INT by typing $1 as a regclass (oid=2205). this was the test i used

send
Query {"String": "CREATE TABLE t (a INT PRIMARY KEY)"}
----

until
ReadyForQuery
----
{"Type":"CommandComplete","CommandTag":"CREATE TABLE"}
{"Type":"ReadyForQuery","TxStatus":"I"}


# 83 = ASCII 'S' for Statement
# 84 = ASCII 'T'
# ParameterFormatCodes = [0] for text format
send
Parse {"Name": "s7", "Query": "SELECT $1::REGCLASS::INT8"}
Describe {"ObjectType": 83, "Name": "s7"}
Bind {"DestinationPortal": "p7", "PreparedStatement": "s7", "ParameterFormatCodes": [0], "Parameters": [[84]]}
Execute {"Portal": "p7"}
Sync
----

until
ReadyForQuery
----
{"Type":"ParseComplete"}
{"Type":"ParameterDescription","ParameterOIDs":[2205]}
{"Type":"RowDescription","Fields":[{"Name":"int8","TableOID":0,"TableAttributeNumber":0,"DataTypeOID":23,"DataTypeSize":8,"TypeModifier":-1,"Format":0}]}
{"Type":"BindComplete"}
{"Type":"DataRow","Values":[{"text":"26494"}]}
{"Type":"CommandComplete","CommandTag":"SELECT 1"}
{"Type":"ReadyForQuery","TxStatus":"I"}

so i think reverting that part of #60949 could just bring back the incompatibility that caused us to write it in the first place -- a client might get confused that the placeholder is a different type than it expects.

the other thing i noticed is that we should try to do better error handling for cases where the regclass resolution fails. e.g. if we try to bind a non-existent table name for the placeholder, cockroachdb returns error in argument for $1: TransactionStatusError: client already committed or rolled back the transaction. Trying to execute: 1 Get (REASON_UNKNOWN) but postgres returns relation "blah" does not exist

anyway, for the fixes for regclass placeholders, feel free to hand off to sql-experience.

@ajwerner
Copy link
Contributor Author

the other thing i noticed is that we should try to do better error handling for cases where the regclass resolution fails. e.g. if we try to bind a non-existent table name for the placeholder, cockroachdb returns error in argument for $1: TransactionStatusError: client already committed or rolled back the transaction. Trying to execute: 1 Get (REASON_UNKNOWN) but postgres returns relation "blah" does not exist

Heh this is exactly the issue! If the descriptor doesn't exist for leasing we attempt to read it from the store using the bogus finalized transaction :). If we fixed the root cause here, we'd return the right error.

@ajwerner
Copy link
Contributor Author

so i think reverting that part of #60949 could just bring back the incompatibility that caused us to write it in the first place -- a client might get confused that the placeholder is a different type than it expects.

I think what I'd want is to change is that where we call DecodeDatum today we always return a a datum which means that for regclass, we have no choice but to return the integer oid. What I'd prefer is if we could instead return a TypedExpr which is the cast of the text to the right type. Ultimately we need a TypedExpr and not a Datum. That way we could defer the cast until execution.

The above approach doesn't solve the type problems.

@rafiss
Copy link
Collaborator

rafiss commented May 13, 2021

That makes a lot of sense in retrospect!

ajwerner added a commit to ajwerner/cockroach that referenced this issue May 13, 2021
This further paper's over the rough edges of not resetting the planner
state to keep it in sync with the connExecutor state. The test used
to panic before this change.

Touches cockroachdb#64140.

Release note (bug fix): Fixed a bug which could cause a panic when
running a EXECUTE of a previously PREPARE'd statement with a REGCLASS,
REGTYPE parameter or a user-defined type argument after running BEGIN
AS OF SYSTEM TIME with an invalid timestamp.
craig bot pushed a commit that referenced this issue May 13, 2021
65083: kvserver: speed up TestRollbackSyncRangedIntentResolution r=tbg a=erikgrinaker

Release note: None

65103: schemaexpr: remove expr_filter.go r=mgartner a=mgartner

This commit removes the `pkg/sql/catalog/schemaexpr/expr_filter.go`
file. The `schemaexpr` package is intended to contain logic for dealing
with expressions defined in a schema, such as check constraints,
computed columns, and partial index predicates. The two exported
functions in the file, `RemapIVarsInTypedExpr` and `RunFilter` did not
fit this theme. They have been moved elsewhere.

Release note: None

65108: sql: minor fixes to fix panics related to type resolution and bad planner usage r=otan a=ajwerner

See individual commits. The first commit matters but does not leave one with a good feeling. The second one is more copacetic.

Relates to #64140. 

Fixes #64975.

Release note (bug fix): Fixed a bug which could cause a panic when
running a EXECUTE of a previously PREPARE'd statement with a REGCLASS,
REGTYPE parameter or a user-defined type argument after running BEGIN
AS OF SYSTEM TIME with an invalid timestamp.

Release note (bug fix): Fixed a bug which could cause a panic when issuing
a query referencing a user-defined type as a placeholder as the first operation
on a new connection.

65130: cli/flags.go: fix typo in comment r=rauchenstein a=knz

Thanks to @stevendanna for spotting this.

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
ajwerner added a commit to ajwerner/cockroach that referenced this issue May 13, 2021
This further paper's over the rough edges of not resetting the planner
state to keep it in sync with the connExecutor state. The test used
to panic before this change.

Touches cockroachdb#64140.

Release note (bug fix): Fixed a bug which could cause a panic when
running a EXECUTE of a previously PREPARE'd statement with a REGCLASS,
REGTYPE parameter or a user-defined type argument after running BEGIN
AS OF SYSTEM TIME with an invalid timestamp.
ajwerner added a commit to ajwerner/cockroach that referenced this issue May 14, 2021
This further paper's over the rough edges of not resetting the planner
state to keep it in sync with the connExecutor state. The test used
to panic before this change.

Touches cockroachdb#64140.

Release note (bug fix): Fixed a bug which could cause a panic when
running a EXECUTE of a previously PREPARE'd statement with a REGCLASS,
REGTYPE parameter or a user-defined type argument after running BEGIN
AS OF SYSTEM TIME with an invalid timestamp.
ajwerner added a commit to ajwerner/cockroach that referenced this issue May 24, 2021
The code which deals with resolving placeholder types and OIDs is totally
bogus when it comes to the transaction and leasing life-cycle. This has been
discussed at some length in cockroachdb#64140. The enum test in cockroachdb#64725 surfaced another
problematic case whereby the bad bind is combined with an AS OF SYSTEM TIME.

We deal with this case by dropping any leases which might have been acquired
by the bind operation. There are some edge cases where the types may mismatch,
the values may be bogus, or an OID may be resolved to the wrong value. However,
the first few of these *should* be detected and *should* be hard to hit. We're
going to need to do something about all of this in 21.2 anyway, howver, this
fix here is good for backport to 21.1 and 20.2.

Also, this change converted the fatal to an assertion failure error.

Fixes cockroachdb#65136.
Relates to cockroachdb#64140.

Release note (bug fix): Fixed a bug whereby using an enum value as a
placeholder in an AS OF SYSTEM TIME query which precedes a recent change
to that enum could result in a fatal error.
ajwerner added a commit to ajwerner/cockroach that referenced this issue May 24, 2021
The code which deals with resolving placeholder types and OIDs is totally
bogus when it comes to the transaction and leasing life-cycle. This has been
discussed at some length in cockroachdb#64140. The enum test in cockroachdb#64725 surfaced another
problematic case whereby the bad bind is combined with an AS OF SYSTEM TIME.

We deal with this case by dropping any leases which might have been acquired
by the bind operation. There are some edge cases where the types may mismatch,
the values may be bogus, or an OID may be resolved to the wrong value. However,
the first few of these *should* be detected and *should* be hard to hit. We're
going to need to do something about all of this in 21.2 anyway, howver, this
fix here is good for backport to 21.1 and 20.2.

Also, this change converted the fatal to an assertion failure error.

Fixes cockroachdb#65136.
Relates to cockroachdb#64140.

Release note (bug fix): Fixed a bug whereby using an enum value as a
placeholder in an AS OF SYSTEM TIME query which precedes a recent change
to that enum could result in a fatal error.
craig bot pushed a commit that referenced this issue May 24, 2021
65620: sql: prevent assertion due to bad use of leases in bind r=andreimatei a=ajwerner

The code which deals with resolving placeholder types and OIDs is totally
bogus when it comes to the transaction and leasing life-cycle. This has been
discussed at some length in #64140. The enum test in #64725 surfaced another
problematic case whereby the bad bind is combined with an AS OF SYSTEM TIME.

We deal with this case by dropping any leases which might have been acquired
by the bind operation. There are some edge cases where the types may mismatch,
the values may be bogus, or an OID may be resolved to the wrong value. However,
the first few of these *should* be detected and *should* be hard to hit. We're
going to need to do something about all of this in 21.2 anyway, howver, this
fix here is good for backport to 21.1 and 20.2.

Also, this change converted the fatal to an assertion failure error.

Fixes #65136.
Relates to #64140.

Release note (bug fix): Fixed a bug whereby using an enum value as a
placeholder in an AS OF SYSTEM TIME query which precedes a recent change
to that enum could result in a fatal error.

Co-authored-by: Andrew Werner <awerner32@gmail.com>
@rytaft
Copy link
Collaborator

rytaft commented Jun 8, 2021

Moving this over to SQL Experience since it seems like the remaining issues will fall to that team. Send it back if you need any help from SQL Queries - thanks!

@rytaft rytaft added this to Triage in SQL Sessions - Deprecated via automation Jun 8, 2021
@rytaft rytaft removed this from 21.2 May Milestone in SQL Queries Jun 8, 2021
@jlinder jlinder added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jun 16, 2021
@ajwerner
Copy link
Contributor Author

@jameswsj10 your work to avoid the fallback reading from store by using export requests to find historical versions should be able to address the weirdness in #64140 (comment). It's just another band-aid over a bigger problem, but it's a good one.

@rafiss rafiss moved this from Triage to Shorter term backlog in SQL Sessions - Deprecated Sep 20, 2021
craig bot pushed a commit that referenced this issue Oct 22, 2021
71632: sql: make sure pgwire bind always happens in a transaction r=otan a=rafiss

fixes #70378
and maybe #64140

The approach of using a transaction matches what we do for pgwire Parse
messages already. This is important to make sure that user-defined types
are leased correctly.

This also updated the SQL EXECUTE command to resolve user-defined types
so that it gets the latest changes.

Release note (bug fix): Adding new values to a user-defined enum type
will previously would cause a prepared statement using that type to not
work. This now works as expected.

71836: clusterversion: introduce 22.1 development versions r=celiala a=celiala

Fix: #69828

Looking at past Version values for `Start{XX_X}`, I think the `Version` value for `Key: Start22_1` should instead be `Major: 21, Minor: 2, ...`.


Start21_1 (from #70268):
```
// v21.1 versions. Internal versions defined here-on-forth must be even.
{
  Key:     Start21_1,
  Version: roachpb.Version{Major: 20, Minor: 2, Internal: 2},
},
```

Start 21_2:
```
{
  Key:     Start21_2,
  Version: roachpb.Version{Major: 21, Minor: 1, Internal: 1102},
},
```

Release justification: Non-production code change.
Release note: None

71856: bazel: skip recompilation when using dev+bazel r=irfansharif a=irfansharif

Fixes #71835. When switching between using `dev` and `bazel` raw, I kept
seeing our C++ protobuf dependency getting recompiled (slowly).
It appears that the bazel build for protobuf has a dependency on $PATH
(see bazelbuild/intellij#1169 and bazelbuild/bazel#7095). Specifying
[`--incompatible_strict_action_env`](https://docs.bazel.build/versions/main/command-line-reference.html#flag--incompatible_strict_action_env) pins PATH and avoids the build cache
thrashing we were seeing before.

Release note: None

Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Celia La <celiala456@gmail.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
@rafiss
Copy link
Collaborator

rafiss commented Mar 15, 2022

I think #76792 completely resolves this, but please reopen if there are still concerns.

@rafiss rafiss closed this as completed Mar 15, 2022
SQL Sessions - Deprecated automation moved this from Shorter term backlog to Done Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-executor SQL txn logic C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants