Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: enable referring to columns symbolically #10729

Merged
merged 1 commit into from Nov 22, 2016
Merged

Conversation

knz
Copy link
Contributor

@knz knz commented Nov 16, 2016

This patch introduces a new semantic concept that is useful for
pretty-printing the AST back to valid SQL and again back to AST; this
will be useful during development work on optimizations. The
motivation is detailed below.

A side effect is that it also introduces a user-visible feature.
Because it is visible to user it will need to be documented, but some
caveats apply. More details are given below too.

Motivation

During optimizations, column references (IndexedVar) can move around
and be substituted in ways that may lose track of the original name
used in the query to name the column. More specifically, we can move a
reference to a position where there is no name known for the column
yet. For example we want to go from:

SELECT * FROM (SELECT * FROM (VALUES (1), (2))) AS a(x) WHERE x > 2

to:

SELECT * FROM (SELECT * FROM ... WHERE ??? > 2) AS a(x)

Now although we can move the IndexedVar to the position in the ???
it is not so clear which name to give this indexed var if we
pretty-print the resulting tree -- in this example the name "x" does
not exist at that position.

This patch makes it possible to side-step this issue, by introducing a
new SQL syntax that refers to columns at the "current level" by ordinal
position: "@n" where N is an integer literal.

With this patch the example above can then be reliably serialized to:

SELECT * FROM (SELECT * FROM ... WHERE @1 > 2) AS a(x)

User-visible changes

SQL already has a traditional, limited way to do refer to column
numerically in some syntactic positions, specifically ORDER BY and
GROUP BY. For example, in SELECT a + b FROM foo ORDER BY 1, "1"
refers to the first value rendered, i.e. a + b.

These are called "column ordinals"; they are supported in some SQL
engines, sometimes for backward compatibility, sometimes because of
historical reasons.

The feature added in this patch complements and extends this mechanism.

However, the use of column ordinals by client applications is also
customarily strongly discouraged.
The use of the new column ordinal
references added in this patch should be equally discouraged. The
reason why is that they are not robust against schema updates. Say, a
table is initially created with columns a, b, c in this order. Then
a query is designed to refer to column a by position, with number 1.
Then later, independently a DB admin changes the schema and removes
column a, and adds a new version of column a with e.g. a different
type. Now the schema is b, c, a, and all the queries that expect to
refer to a by position 1 are now broken. The new feature in this
patch is also subject to this limitation. It is intended primarily for
use during development when the schema updates are tightly controlled
by the operator manipulating the query.

Meanwhile, since the feature is visible to users it should still be
(minimally) documented. The salient aspects that should be
communicated are:

  1. don't use this feature in client applications unless you 100%
    understand the limitation described above.

  2. the @ notation refers to a column number in the data source, not
    in the rendered columns
    . The data source is the thing named after
    FROM. For example, suppose a table foo has columns a and b
    in this order. Then the query

    SELECT b, a FROM foo WHERE @2 = 123

    is equivalent to SELECT b, a FROM foo WHERE b = 123.

  3. point 2 above means that there is a difference between the new
    column ordinal references and the traditional SQL ordinals, which
    can be illustrated as follows. With SQL ordinals, the query

    SELECT b, a FROM foo ORDER BY 1

    sorts with column b, because this is the first value rendered
    (columns after SELECT); whereas

    SELECT b, a FROM foo ORDER BY @1

    sorts with column a, because this is the first column in the data
    source (columns after FROM).

This feature is needed for #10632.

cc @nvanbenschoten.


This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Nov 16, 2016

I just realized I forgot to add the tests. Please comment on the general direction, I'll add the tests in parallel.

@RaduBerinde
Copy link
Member

Nice, I like this idea! Seems like something we should use for distsql (I'm currently using $0, $1, ..). Though I would still need to have my own visitor which does just the OrdinalReference->IndexedVar conversion.


Review status: 0 of 7 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@nvanbenschoten
Copy link
Member

LGTM so far, but as you mentioned this needs lots of testing.

Also, this is introducing new syntax into SQL, so I'd like to think through the syntax as a user-facing feature here as well. A full RFC isn't necessary, but do you mind writing up a brief description of these new Ordinal References (something like what you'd find in our docs). I get the motivation for their introduction in regard to #10632, but talking through their introduction as a user-facing feature would help justify this approach over others.


Reviewed 7 of 7 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/sql/parser/expr.go, line 548 at r1 (raw file):

func (DefaultVal) ResolvedType() Type { return nil }

var _ VariableExpr = &Placeholder{}

Looks like this assertion got separated from its declaration.


pkg/sql/parser/sql.y, line 32 at r1 (raw file):

const maxUint = ^uint(0)
const maxInt = int(^uint(0) >> 1)

s/^uint(0)/maxUint/?


Comments from Reviewable

@RaduBerinde
Copy link
Member

RaduBerinde commented Nov 17, 2016

Didn't we conclude that it makes more sense to make all node types pointers? E.g. OrdinalReference.Walk returns a new instance.

Anyway, I was thinking that it would be great if we didn't need to modify trees to convert between OrdinalReferences and IndexedVars. Couldn't we simply use IndexedVar for this? We could make an IndexedVar with nil container act as an OrdinalReference (i.e. format as @(idx+1), and disallow evaluation). We would still to populate the container and check the indices, but we wouldn't need to change the tree itself.

@knz
Copy link
Contributor Author

knz commented Nov 18, 2016

@RaduBerinde thanks for your suggestions.
Indeed a pointer type was needed, that is fixed.

Regarding using IndexedVar I will try it and see if it fits. I like the idea!

@nvanbenschoten: I will add some more text in the commit message to explain with examples.


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


pkg/sql/parser/expr.go, line 548 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Looks like this assertion got separated from its declaration.

Done.

pkg/sql/parser/sql.y, line 32 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/^uint(0)/maxUint/?

Yes that's what I meant. Thanks.

Comments from Reviewable

@knz knz force-pushed the col-ref branch 4 times, most recently from b98aa48 to 1be19c9 Compare November 20, 2016 19:24
@knz knz added the docs-todo label Nov 20, 2016
@knz
Copy link
Contributor Author

knz commented Nov 20, 2016

@RaduBerinde I have simplified to use only IndexedVar as you suggested. PTAL.

@nvanbenschoten I have extended the commit message (and the header of this PR) with your suggestions. PTAL.

@knz
Copy link
Contributor Author

knz commented Nov 20, 2016

Note that there are two commits now but the first one will disappear after #10799 gets merged.

@RaduBerinde
Copy link
Member

Awesome!! :lgtm_strong:


Review status: 0 of 22 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@RaduBerinde
Copy link
Member

Review status: 0 of 23 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


pkg/sql/parser/indexed_vars.go, line 93 at r3 (raw file):

}

// IndexedVarHelper is a structure that helps with initialization of IndexVars.

IndexedVars


pkg/sql/parser/indexed_vars.go, line 101 at r3 (raw file):

// BindIfUnbound attaches an IndexedVar to an existing container.
// This is needed for standalone column ordinals created during parsing.
func (h *IndexedVarHelper) BindIfUnbound(ivar *IndexedVar) error {

It feels a bit off that we need to allocate a set of IndexedVars (as part of IndexedVarHelper) just to bind a container to existing IndexedVars. I don't have a better suggestion though (given that we need to support IndexedVarUsed)..


Comments from Reviewable

@nvanbenschoten
Copy link
Member

:lgtm: I like this approach better than adding a new Expr type.


Review status: 0 of 23 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/sql/parser/expr.go, line 548 at r1 (raw file):

Previously, knz (kena) wrote… > Done.
The move isn't necessary now that you removed the new expression type. I'd revert the change.

Comments from Reviewable

@knz knz force-pushed the col-ref branch 2 times, most recently from 24399fc to 0929a77 Compare November 21, 2016 08:09
@knz
Copy link
Contributor Author

knz commented Nov 21, 2016

TFYRs!


Review status: 0 of 23 files reviewed at latest revision, 3 unresolved discussions.


pkg/sql/parser/expr.go, line 548 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote… > The move isn't necessary now that you removed the new expression type. I'd revert the change.
Done.

pkg/sql/parser/indexed_vars.go, line 93 at r3 (raw file):

Previously, RaduBerinde wrote… > IndexedVars
Done.

pkg/sql/parser/indexed_vars.go, line 101 at r3 (raw file):

Previously, RaduBerinde wrote… > It feels a bit off that we need to allocate a set of `IndexedVar`s (as part of `IndexedVarHelper`) just to bind a container to existing `IndexedVars`. I don't have a better suggestion though (given that we need to support IndexedVarUsed)..
Do not forget that the helper is also supporting the indexedvars created by name resolution (`a` -> IndexedVar), and this will be the more common case. I do not feel anything off here.

Comments from Reviewable

@knz knz force-pushed the col-ref branch 2 times, most recently from b147b8b to ee6d44a Compare November 21, 2016 21:36
This patch introduces a new semantic concept that is useful for
pretty-printing the AST back to valid SQL and again back to AST; this
will be useful during development work on optimizations. The
motivation is detailed below.

A side effect is that it also introduces a user-visible feature.
Because it is visible to user it will need to be documented, but some
caveats apply. More details are given below too.

**Motivation**

During optimizations, column references (IndexedVar) can move around
and be substituted in ways that may lose track of the original name
used in the query to name the column. More specifically, we can move a
reference to a position where there is no name known for the column
*yet*. For example we want to go from:

  `SELECT * FROM (SELECT * FROM (VALUES (1), (2))) AS a(x) WHERE x > 2`

to:

  `SELECT * FROM (SELECT * FROM ... WHERE ??? > 2) AS a(x)`

Now although we can move the IndexedVar to the position in the `???`
it is not so clear which *name* to give this indexed var if we
pretty-print the resulting tree -- in this example the name "x" does
not exist at that position.

This patch makes it possible to side-step this issue, by introducing a
new SQL syntax that refers to columns at the "current level" by ordinal
position: "@n" where N is an integer literal.

With this patch the example above can then be reliably serialized to:

  `SELECT * FROM (SELECT * FROM ... WHERE @1 > 2) AS a(x)`

**User-visible changes**

SQL already has a traditional, limited way to do refer to column
numerically in some syntactic positions, specifically ORDER BY and
GROUP BY.  For example, in `SELECT a + b FROM foo ORDER BY 1`, "1"
refers to the first value rendered, i.e. `a + b`.

These are called "column ordinals"; they are supported in *some* SQL
engines, sometimes for backward compatibility, sometimes because of
historical reasons.

The feature added in this patch complements and extends this mechanism.

**However, the use of column ordinals by client applications is also
customarily strongly discouraged.** The use of the new column ordinal
references added in this patch should be equally discouraged. The
reason why is that they are not robust against schema updates. Say, a
table is initially created with columns `a, b, c` in this order. Then
a query is designed to refer to column `a` by position, with number
1. Then later, independently a DB admin changes the schema and removes
column `a`, and adds a new version of column `a` with e.g. a different
type. Now the schema is `b, c, a`, and all the queries that expect to
refer to `a` by position 1 are now broken. The new feature in this
patch is also subject to this limitation. It is intended primarily for
use during development when the schema updates are tightly controlled
by the operator manipulating the query.

Meanwhile, since the feature is visible to users it should still be
(minimally) documented. The salient aspects that should be
communicated are:

1) don't use this feature in client applications unless you 100%
   understand the limitation described above.

2) **the @ notation refers to a column number in the data source, not
   in the rendered columns**. The data source is the thing named after
   FROM.  For example, suppose a table `foo` has columns `a` and `b`
   in this order. Then the query

     `SELECT b, a FROM foo WHERE @2 = 123`

   is equivalent to `SELECT b, a FROM foo WHERE b = 123`.

3) point 2 above means that there is a difference between the new
   column ordinal references and the traditional SQL ordinals, which
   can be illustrated as follows. With SQL ordinals, the query

     `SELECT b, a FROM foo ORDER BY 1`

   sorts with column `b`, because this is the first value rendered
   (columns after SELECT); whereas

     `SELECT b, a FROM foo ORDER BY @1`

   sorts with column `a`, because this is the first column in the data
   source (columns after FROM).
@knz knz merged commit 172f1d8 into cockroachdb:master Nov 22, 2016
@knz knz deleted the col-ref branch November 22, 2016 00:31
mgartner added a commit to mgartner/cockroach that referenced this pull request Dec 16, 2022
This commit disallows ordinal column references by default. Any
statements using them will result in an error. The session setting
`allow_ordinal_column_references` can be set to true to revert to
previous behavior.

Note that numeric tuple indexing (e.g., `SELECT ((1,2,3)).@2`) is still
allowed.

This deprecation is motivated by the inconsistent and surprising
behavior that ordinal column references provide. Ordinal column
references (e.g., `SELECT @1 FROM t`) are a vestigial and undocumented
feature originally motivated to aid the heuristic planner. The PR that
added them, cockroachdb#10729, discourages their use by users because "they are not
robust against schema changes". It also points out a subtle difference
between ordinal column references and SQL standard ordinals that could
confuse users. As an example, the following statements are not
equivalent:

```sql
SELECT @2 FROM t ORDER BY @1;

SELECT @2 FROM t ORDER BY 1;
```

In the current implementation, an ordinal column reference `@n` resolves
to the n-th column in the current scope's slice of columns. This has
several implications:

1. Ordinal columns can refer to any column, not just columns of data
   sources in `FROM` clauses, as was originally intended. This makes it
   hard to reason about the resolution of an ordinal column reference.
   For example, it's somewhat surprising that the result of the `SELECT`
   below is not `(10, 1)`.

```
CREATE TABLE t (a INT, b INT);
INSERT INTO t VALUES (1, 10);

SELECT @2, @1 FROM (SELECT @2, @1 FROM t);
  ?column? | ?column?
-----------+-----------
         1 |       10
(1 row)
```

2. The ordering of columns in the scope's slice is not guaranteed to be
   consistent, so any reasonable ordinal column resolution occurs
   more-or-less by chance. As an example of unexpected behavior,
   consider:

```
CREATE TABLE t (a INT PRIMARY KEY, INDEX ((a+10)));
INSERT INTO t(a) VALUES (1);
ALTER TABLE t ADD COLUMN b INT;

SELECT @1, @2 FROM t;
  ?column? | ?column?
-----------+-----------
         1 |       11
(1 row)
```

Most users would expect the result of the `SELECT` statement to be
`(1, NULL)` because `@1` should resolve to `a` and `@2` should resolve
to `b`. Instead, the ordinal column reference `@2` circumvents logic
that keeps inaccessible columns from being referenced, and resolves to
the virtual computed column backing the secondary index.

Epic: None

Release note (sql change): Ordinal column references (e.g.,
`SELECT @1, @2 FROM t`) are now deprecated. By default, statements using
this syntax will now result in an error. They can be allowed using the
session setting `SET allow_ordinal_column_references=true`. Support for
ordinal column references is scheduled to be removed in version 23.2.
craig bot pushed a commit that referenced this pull request Dec 20, 2022
93754: sql: deprecate ordinal column references r=mgartner a=mgartner

#### sql: deprecate ordinal column references

This commit disallows ordinal column references by default. Any
statements using them will result in an error. The session setting
`allow_ordinal_column_references` can be set to true to revert to
previous behavior.

Note that numeric tuple indexing (e.g., `SELECT ((1,2,3)).`@2`)` is still
allowed.

This deprecation is motivated by the inconsistent and surprising
behavior that ordinal column references provide. Ordinal column
references (e.g., `SELECT `@1` FROM t`) are a vestigial and undocumented
feature originally motivated to aid the heuristic planner. The PR that
added them, #10729, discourages their use by users because "they are not
robust against schema changes". It also points out a subtle difference
between ordinal column references and SQL standard ordinals that could
confuse users. As an example, the following statements are not
equivalent:

```sql
SELECT `@2` FROM t ORDER BY `@1;`

SELECT `@2` FROM t ORDER BY 1;
```

In the current implementation, an ordinal column reference ``@n`` resolves
to the n-th column in the current scope's slice of columns. This has
several implications:

1. Ordinal columns can refer to any column, not just columns of data
   sources in `FROM` clauses, as was originally intended. This makes it
   hard to reason about the resolution of an ordinal column reference.
   For example, it's somewhat surprising that the result of the `SELECT`
   below is not `(10, 1)`.

```
CREATE TABLE t (a INT, b INT);
INSERT INTO t VALUES (1, 10);

SELECT `@2,` `@1` FROM (SELECT `@2,` `@1` FROM t);
  ?column? | ?column?
-----------+-----------
         1 |       10
(1 row)
```

2. The ordering of columns in the scope's slice is not guaranteed to be
   consistent, so any reasonable ordinal column resolution occurs
   more-or-less by chance. As an example of unexpected behavior,
   consider:

```
CREATE TABLE t (a INT PRIMARY KEY, INDEX ((a+10)));
INSERT INTO t(a) VALUES (1);
ALTER TABLE t ADD COLUMN b INT;

SELECT `@1,` `@2` FROM t;
  ?column? | ?column?
-----------+-----------
         1 |       11
(1 row)
```

Most users would expect the result of the `SELECT` statement to be
`(1, NULL)` because ``@1`` should resolve to `a` and ``@2`` should resolve
to `b`. Instead, the ordinal column reference ``@2`` circumvents logic
that keeps inaccessible columns from being referenced, and resolves to
the virtual computed column backing the secondary index.

Epic: None

Release note (sql change): Ordinal column references (e.g.,
`SELECT `@1,` `@2` FROM t`) are now deprecated. By default, statements using
this syntax will now result in an error. They can be allowed using the
session setting `SET allow_ordinal_column_references=true`. Support for
ordinal column references is scheduled to be removed in version 23.2.


Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants