Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
sql: enable referring to columns symbolically
This patch introduces a new semantic concept that is useful for pretty-printing the AST back to valid SQL and again back to AST; this will be useful during development work on optimizations. The motivation is detailed below. A side effect is that it also introduces a user-visible feature. Because it is visible to user it will need to be documented, but some caveats apply. More details are given below too. **Motivation** During optimizations, column references (IndexedVar) can move around and be substituted in ways that may lose track of the original name used in the query to name the column. More specifically, we can move a reference to a position where there is no name known for the column *yet*. For example we want to go from: `SELECT * FROM (SELECT * FROM (VALUES (1), (2))) AS a(x) WHERE x > 2` to: `SELECT * FROM (SELECT * FROM ... WHERE ??? > 2) AS a(x)` Now although we can move the IndexedVar to the position in the `???` it is not so clear which *name* to give this indexed var if we pretty-print the resulting tree -- in this example the name "x" does not exist at that position. This patch makes it possible to side-step this issue, by introducing a new SQL syntax that refers to columns at the "current level" by ordinal position: "@n" where N is an integer literal. With this patch the example above can then be reliably serialized to: `SELECT * FROM (SELECT * FROM ... WHERE @1 > 2) AS a(x)` **User-visible changes** SQL already has a traditional, limited way to do refer to column numerically in some syntactic positions, specifically ORDER BY and GROUP BY. For example, in `SELECT a + b FROM foo ORDER BY 1`, "1" refers to the first value rendered, i.e. `a + b`. These are called "column ordinals"; they are supported in *some* SQL engines, sometimes for backward compatibility, sometimes because of historical reasons. The feature added in this patch complements and extends this mechanism. **However, the use of column ordinals by client applications is also customarily strongly discouraged.** The use of the new column ordinal references added in this patch should be equally discouraged. The reason why is that they are not robust against schema updates. Say, a table is initially created with columns `a, b, c` in this order. Then a query is designed to refer to column `a` by position, with number 1. Then later, independently a DB admin changes the schema and removes column `a`, and adds a new version of column `a` with e.g. a different type. Now the schema is `b, c, a`, and all the queries that expect to refer to `a` by position 1 are now broken. The new feature in this patch is also subject to this limitation. It is intended primarily for use during development when the schema updates are tightly controlled by the operator manipulating the query. Meanwhile, since the feature is visible to users it should still be (minimally) documented. The salient aspects that should be communicated are: 1) don't use this feature in client applications unless you 100% understand the limitation described above. 2) **the @ notation refers to a column number in the data source, not in the rendered columns**. The data source is the thing named after FROM. For example, suppose a table `foo` has columns `a` and `b` in this order. Then the query `SELECT b, a FROM foo WHERE @2 = 123` is equivalent to `SELECT b, a FROM foo WHERE b = 123`. 3) point 2 above means that there is a difference between the new column ordinal references and the traditional SQL ordinals, which can be illustrated as follows. With SQL ordinals, the query `SELECT b, a FROM foo ORDER BY 1` sorts with column `b`, because this is the first value rendered (columns after SELECT); whereas `SELECT b, a FROM foo ORDER BY @1` sorts with column `a`, because this is the first column in the data source (columns after FROM).
- Loading branch information