rfc: computed columns by justinj · Pull Request #20735 · cockroachdb/cockroach

justinj · 2017-12-14T20:59:42Z

Release note: None

cockroach-teamcity · 2017-12-14T20:59:46Z

This change is

petermattis · 2017-12-14T21:43:53Z

Excited to see this RFC. A few questions below. Nothing major.

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.

docs/RFCS/20171214_computed_columns.md, line 17 at r1 (raw file):

columns which are only computed when read (*non-materialized*).

Computed indexes are considered out of scope. Many of the benefits can be had

Given that this RFC is about computed columns, I'm not sure why this paragraph is needed here. Perhaps move it to the end of the document under a Related work section. Also, it isn't obvious to me why an index on a non-materialized computed is easier than a computed index. Can you flesh briefly expand on that?

docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

The grammar for a computed column is `column_name <type> AS <expr>
[PERSISTED]`, where `<expr>` is a pure function of non-computed columns in the

We use the term STORING for additional columns to be stored with an index. That's an argument for using the term STORED here.

docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

[PERSISTED]`, where `<expr>` is a pure function of non-computed columns in the
same table. `PERSISTED` indicates a materialized computed column and is
required if that columns appears in the primary key.

Why does the primary key require materialized columns?

docs/RFCS/20171214_computed_columns.md, line 53 at r1 (raw file):

    WHEN country IN ('ca', 'mx', 'us') THEN 'north_america'
    WHEN country IN ('au', 'nz') THEN 'australia'
  END,

This column will need to be PERSISTED given its use in the primary key, right?

docs/RFCS/20171214_computed_columns.md, line 124 at r1 (raw file):

In addition, users are stopped from dropping a column if it’s referenced by a
computed column. Computed column expressions for dependents of a column are

Also seems bad to drop a computed column if it is referenced by a partition.

docs/RFCS/20171214_computed_columns.md, line 131 at r1 (raw file):

- Adding this feature means the story of what a column is is complicated
  slightly. This could have implications for upcoming features, like the query
  optimizer.

I don't see this as being a significant complication. For non-materialized columns, query planning would likely just expand any references to the column to be the underlying expression.

docs/RFCS/20171214_computed_columns.md, line 152 at r1 (raw file):

| ---------------------- | --------------------------------------------------------------- | ---------------------------------------------------------------- |
| CockroachDB (Proposed) | `inventory_value INT AS qty_available * unit_price`             | `inventory_value INT AS qty_available * unit_price PERSISTED`    |
| MySQL                  | `inventory_value INT AS (qty_available * unit_price) [VIRTUAL]` | `inventory_value INT AS qty_available * unit_price STORED`       |

Are the parentheses really needed for the Non-materialized MySQL syntax? You're not showing them for the Materialized syntax.

Comments from Reviewable

justinj · 2017-12-14T22:31:41Z

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.

docs/RFCS/20171214_computed_columns.md, line 17 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Given that this RFC is about computed columns, I'm not sure why this paragraph is needed here. Perhaps move it to the end of the document under a Related work section. Also, it isn't obvious to me why an index on a non-materialized computed is easier than a computed index. Can you flesh briefly expand on that?

I think the rationale for this paragraph was that the problems this feature solves for JSON were often discussed in terms of computed indexes before this, and this closes the loop on those conversations. It might make more sense to move it to the alternatives section though!

docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We use the term STORING for additional columns to be stored with an index. That's an argument for using the term STORED here.

I don't have any strong feelings on this and I don't think Dan does either, STORED is fine with me but I'll hold off on changing it until anyone else who might have an opinion has a chance to weigh in.

docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Why does the primary key require materialized columns?

It has to be materialized so that it can form the primary index. This is as opposed to a non-materialized column that is part of a secondary index - it doesn't have to be materialized for its existence in the primary index, but it will have to be materialized within the secondary index. It's probably debatable if "materialized" has the same connotation when the value exists in the key or the value of a kv-entry.

docs/RFCS/20171214_computed_columns.md, line 53 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This column will need to be PERSISTED given its use in the primary key, right?

You are correct! Good catch.

docs/RFCS/20171214_computed_columns.md, line 124 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Also seems bad to drop a computed column if it is referenced by a partition.

I believe this is covered under whatever usual behaviour exists around dropping columns referenced by partitions, even if they aren't computed.

docs/RFCS/20171214_computed_columns.md, line 131 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I don't see this as being a significant complication. For non-materialized columns, query planning would likely just expand any references to the column to be the underlying expression.

Ok, sounds good. I'll remove this as a concern.

docs/RFCS/20171214_computed_columns.md, line 152 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Are the parentheses really needed for the Non-materialized MySQL syntax? You're not showing them for the Materialized syntax.

Good catch - they're required in both cases.

Comments from Reviewable

benesch · 2017-12-14T22:31:51Z

Reviewed 1 of 1 files at r1.
Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending.

docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We use the term STORING for additional columns to be stored with an index. That's an argument for using the term STORED here.

Views can also be MATERIALIZED in Postgres. I'd vote for either of those over PERSISTED.

docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Why does the primary key require materialized columns?

(And if it is required, shouldn't it be required for any columns used in a CREATE INDEX, too?)

docs/RFCS/20171214_computed_columns.md, line 38 at r1 (raw file):

## Partitioning

[Partitioning](https://github.com/cockroachdb/cockroach/blob/aa61db043e9c54c0b83a405cd76ce0ec7cc6a35d/docs/RFCS/20170921_sql_partitioning.md)

nit: consider pulling this out into a reference-style link

docs/RFCS/20171214_computed_columns.md, line 67 at r1 (raw file):

## JSON

When the primary source of truth for a table lives in a JSON blob, it can be desirable to put an index on a particular field of a JSON document. In particular, computed columns allow for the following use-case: a two column table, with a primary key column and a payload column, whose primary key is computed as some field from the payload column. This alleviates the need for the client to manually separate their JSON blobs from their primary keys.

nit: consider wrapping

docs/RFCS/20171214_computed_columns.md, line 89 at r1 (raw file):

- Computed columns behave like any other column, with the exception that they
  cannot be written to directly. Performing an `INSERT` or `UPDATE` which
  specifies a computed column is an error.

I think you should be able to specify DEFAULT. This is how MySQL works. Consider:

CREATE TABLE t (a INT, b INT AS (a + 1) STORED, c);
INSERT INTO t (a, c) VALUES (1, 2)  -- Great! Easy!
INSERT INTO t VALUES (1, ???, 2) -- Eek!

Yes, implicit column lists are evil, but we should still support an escape hatch so you can specify the columns after a persisted column in a VALUES clause.

INSERT INTO t VALUES (1, DEFAULT, 2)

docs/RFCS/20171214_computed_columns.md, line 152 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Are the parentheses really needed for the Non-materialized MySQL syntax? You're not showing them for the Materialized syntax.

In my tests they're required for both.

Comments from Reviewable

benesch · 2017-12-14T22:40:39Z

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending.

docs/RFCS/20171214_computed_columns.md, line 17 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

I think the rationale for this paragraph was that the problems this feature solves for JSON were often discussed in terms of computed indexes before this, and this closes the loop on those conversations. It might make more sense to move it to the alternatives section though!

The way I understood it is that with computed indices, you might expect the SELECT query in the following example to use the index:

CREATE TABLE t (a INT, b INT);
CREATE INDEX ON t ((a + b));
SELECT * FROM t WHERE b + a > 10;

When in fact it's not trivial to prove that a + b and b + a are equivalent. (Ok, well, it is in this case, but not in general.)

If you force users to create a stored column instead

CREATE TABLE t (a INT, b INT, c INT AS (a + b) STORED);
CREATE INDEX ON t (c);

it's obvious to the user and the optimizer that SELECT * FROM t WHERE c > 10 will use the index, but there's less expectation that SELECT * FROM t WHERE a + b > 10 will use the index.

docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Views can also be MATERIALIZED in Postgres. I'd vote for either of those over PERSISTED.

Didn't realize there was precedent for both STORED and PERSISTED. I'll also vote for STORED.

docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

(And if it is required, shouldn't it be required for any columns used in a CREATE INDEX, too?)

Yeah, in my mind it's perfectly reasonable that a column used in an index is essentially STORED, regardless of whether STORED was specified. I'll cast a vote in favor of allowing any computed column, STORED or not, to be used in any index, primary or not. Then STORED just indicates whether you want to trade space for CPU on non-indexed computed columns (i.e., when the column is really expensive to compute).

docs/RFCS/20171214_computed_columns.md, line 124 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

I believe this is covered under whatever usual behaviour exists around dropping columns referenced by partitions, even if they aren't computed.

I don't think that logic exists yet :S. It should!

Comments from Reviewable

vivekmenezes · 2017-12-20T02:49:59Z

+
+```protobuf
+message ColumnDescriptor {
+  ...


what will its name be?

vivekmenezes · 2017-12-20T02:55:19Z

+CREATE TABLE documents (
+  id STRING PRIMARY KEY AS payload->>'id' PERSISTED,
+  payload JSONB
+)


users might prefer using the postgres variant

CREATE TABLE documents ( payload JSONB UNIQUE (payload->>'id') )

I see the JSON use case as not necessarily needing the computed column but a computed index.

vivekmenezes · 2017-12-20T02:59:07Z

+
+The grammar for a computed column is `column_name <type> AS <expr>
+[PERSISTED]`, where `<expr>` is a pure function of non-computed columns in the
+same table. `PERSISTED` indicates a materialized computed column and is


Do we really need PERSISTED ?I ask because it appears that we're only using it for indexes.

jordanlewis · 2017-12-20T03:55:07Z

Great RFC!

Will you support ALTER TABLE ALTER COLUMN on these kinds of columns? What will the limitations be?

Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 17 unresolved discussions, all commit checks successful.

docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Didn't realize there was precedent for both STORED and PERSISTED. I'll also vote for STORED.

For this syntax bikeshedding to be successful, I think it's important to examine what other databases do.

MySQL: [ PERSISTED | STORED | VIRTUAL ]. PERSISTED and STORED are synonyms. VIRTUAL is the default.
SQL Server: PERSISTED vs nothing.
Oracle: No support for persisted computed columns.
Postgres: No support for computed columns at all.

Looks like PERSISTED is the winner for frequency, but I buy the argument about consistency - STORED makes sense since we say STORING elsewhere.

I think we should support STORED. I also think we should support VIRTUAL as the spelled-out version of no modifier.

(whoops, didn't realize you did this analysis at the bottom - please mention that up here!)

docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Yeah, in my mind it's perfectly reasonable that a column used in an index is essentially STORED, regardless of whether STORED was specified. I'll cast a vote in favor of allowing any computed column, STORED or not, to be used in any index, primary or not. Then STORED just indicates whether you want to trade space for CPU on non-indexed computed columns (i.e., when the column is really expensive to compute).

nit: s/columns/column/.

docs/RFCS/20171214_computed_columns.md, line 67 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

nit: consider wrapping

+1

docs/RFCS/20171214_computed_columns.md, line 89 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

I think you should be able to specify DEFAULT. This is how MySQL works. Consider:
CREATE TABLE t (a INT, b INT AS (a + 1) STORED, c);
INSERT INTO t (a, c) VALUES (1, 2)  -- Great! Easy!
INSERT INTO t VALUES (1, ???, 2) -- Eek! 
Yes, implicit column lists are evil, but we should still support an escape hatch so you can specify the columns after a persisted column in a VALUES clause.
INSERT INTO t VALUES (1, DEFAULT, 2)

Agreed that DEFAULT would be a nice to have, but it seems okay to live without it, since you can't accidentally screw up. INSERT INTO t VALUES (1, 2) should fail (not enough values for table), and INSERT INTO t VALUES (1,2,3) should fail (can't insert into a computed column).

docs/RFCS/20171214_computed_columns.md, line 152 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

In my tests they're required for both.

MySQL also supports the PERSISTED syntax.

docs/RFCS/20171214_computed_columns.md, line 14 at r2 (raw file):

supported in several major databases, including MySQL, Oracle, and SQL Server.
This RFC covers both the version of computed columns that are physically stored
alongside other columns (referred to here as *materialized*) and computed

Can we pick a terminology here? the industry standard seems to be stored/persisted vs virtual. I suggest we do the same.

docs/RFCS/20171214_computed_columns.md, line 37 at r2 (raw file):

requires that partitions are defined using columns that are a prefix of the
primary key. In the case of geo-partitioning, some applications will want to
collapse the number of possible values in this columns, to make certain classes

nit: "this columns" seems to be incorrect grammar.

docs/RFCS/20171214_computed_columns.md, line 93 at r2 (raw file):

# Reference-level explanation
## Detailed Design

Just a note - don't forget to update information_schema.columns with the fact that the columns are virtual/persisted!

docs/RFCS/20171214_computed_columns.md, line 99 at r2 (raw file):

Previously, vivekmenezes wrote…

what will its name be?

Seems like computed? It's at the end of the message.

docs/RFCS/20171214_computed_columns.md, line 102 at r2 (raw file):

  message Computed {
    expr string = 1;
    bool materialized = 2;

Please keep the terminology consistent as mentioned above. (and throughout - I think we should globally s/materialized/stored/ if we decided on stored as the syntax word)

Comments from Reviewable

knz · 2017-12-21T11:11:48Z

LGTM modulo resolution of the points already raised. 👍 on STORED/VIRTUAL vs PERSISTED/nothing.

Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.

docs/RFCS/20171214_computed_columns.md, line 141 at r2 (raw file):

  Oracle implement it slightly differently. Postgres does not have this
  feature.
- Computed indexes can also solve many of the same problems as computed

"Computed indexes solve the same problems as computed indexes" - tautology

Comments from Reviewable

knz · 2017-12-21T11:11:57Z

Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.

Comments from Reviewable

justinj · 2018-01-05T16:41:19Z

Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.

docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

For this syntax bikeshedding to be successful, I think it's important to examine what other databases do.

MySQL: [ PERSISTED | STORED | VIRTUAL ]. PERSISTED and STORED are synonyms. VIRTUAL is the default.
SQL Server: PERSISTED vs nothing.
Oracle: No support for persisted computed columns.
Postgres: No support for computed columns at all.

Looks like PERSISTED is the winner for frequency, but I buy the argument about consistency - STORED makes sense since we say STORING elsewhere.

I think we should support STORED. I also think we should support VIRTUAL as the spelled-out version of no modifier.

(whoops, didn't realize you did this analysis at the bottom - please mention that up here!)

Done.

docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

nit: s/columns/column/.

We discussed a bit in person - I think we've settled on the position that since a column in the primary key will need to be "stored" in the primary index regardless of whether it's specified, we'd prefer to require them to be stored in that case. This matches MySQL's behaviour.

docs/RFCS/20171214_computed_columns.md, line 38 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

nit: consider pulling this out into a reference-style link

Done.

docs/RFCS/20171214_computed_columns.md, line 67 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

+1

Done.

docs/RFCS/20171214_computed_columns.md, line 89 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Agreed that DEFAULT would be a nice to have, but it seems okay to live without it, since you can't accidentally screw up. INSERT INTO t VALUES (1, 2) should fail (not enough values for table), and INSERT INTO t VALUES (1,2,3) should fail (can't insert into a computed column).

I actually missed this syntax! That's useful and I've included it now.

docs/RFCS/20171214_computed_columns.md, line 14 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Can we pick a terminology here? the industry standard seems to be stored/persisted vs virtual. I suggest we do the same.

Done.

docs/RFCS/20171214_computed_columns.md, line 29 at r2 (raw file):

Previously, vivekmenezes wrote…

Do we really need PERSISTED ?I ask because it appears that we're only using it for indexes.

It's true for our uses in indexes we could get away with only one of persisted/not persisted, and we're only planning to implement persisted for the time being (just because it seems easier to implement), this RFC covers both for completeness.

docs/RFCS/20171214_computed_columns.md, line 37 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

nit: "this columns" seems to be incorrect grammar.

Done.

docs/RFCS/20171214_computed_columns.md, line 69 at r2 (raw file):

Previously, vivekmenezes wrote…

users might prefer using the postgres variant
CREATE TABLE documents (
   payload JSONB
   UNIQUE (payload->>'id')
 )
I see the JSON use case as not necessarily needing the computed column but a computed index.

Yeah, that's correct, but we decided that for JSON this was a simpler implementation at the moment, and has some overlap with the features needed by partitioning.

docs/RFCS/20171214_computed_columns.md, line 102 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Please keep the terminology consistent as mentioned above. (and throughout - I think we should globally s/materialized/stored/ if we decided on stored as the syntax word)

Done.

docs/RFCS/20171214_computed_columns.md, line 141 at r2 (raw file):

Previously, knz (kena) wrote…

"Computed indexes solve the same problems as computed indexes" - tautology

Fixed.

Comments from Reviewable

jordanlewis · 2018-01-05T16:47:55Z

Review status: 0 of 1 files reviewed at latest revision, 17 unresolved discussions, some commit checks pending.

docs/RFCS/20171214_computed_columns.md, line 15 at r3 (raw file):

This RFC covers both the version of computed columns that are physically stored
alongside other columns (referred to here as *stored*) and computed
columns which are only computed when read (*non-stored*).

what about virtual? and s/non-stored/virtual/ throughout so our terminology is consistent with our syntax.

docs/RFCS/20171214_computed_columns.md, line 28 at r3 (raw file):

The grammar for a computed column is `column_name <type> AS <expr>
<VIRTUAL|STORED>`, where `<expr>` is a pure function of non-computed columns in the

Shouldn't this be [VIRTUAL | STORED], since it's optional to specify these modifiers? I thought the default with no modifier would be VIRTUAL.

Comments from Reviewable

jordanlewis · 2018-01-05T16:48:04Z

Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks pending.

Comments from Reviewable

vivekmenezes · 2018-01-05T19:09:20Z

LGTM!

Release note: None

justinj · 2018-01-09T18:14:48Z

Entering final comment period - targeting merge on Thursday!

Review status: all files reviewed at latest revision, 17 unresolved discussions, all commit checks successful.

docs/RFCS/20171214_computed_columns.md, line 15 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

what about virtual? and s/non-stored/virtual/ throughout so our terminology is consistent with our syntax.

Done.

docs/RFCS/20171214_computed_columns.md, line 28 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Shouldn't this be [VIRTUAL | STORED], since it's optional to specify these modifiers? I thought the default with no modifier would be VIRTUAL.

Yes, good point!

Comments from Reviewable

benesch · 2018-01-09T21:14:46Z

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.

Comments from Reviewable

justinj · 2018-01-12T00:21:01Z

Thanks for the reviews everyone!

justinj requested a review from a team as a code owner December 14, 2017 20:59

justinj force-pushed the cc-rfc branch from 55104db to c26ba99 Compare December 14, 2017 22:28

benesch mentioned this pull request Dec 19, 2017

sql: implement computed columns #20882

Closed

vivekmenezes reviewed Dec 20, 2017

View reviewed changes

justinj force-pushed the cc-rfc branch from c26ba99 to e9ef8c9 Compare January 5, 2018 16:41

rfc: computed columns

a37eaee

Release note: None

justinj force-pushed the cc-rfc branch from e9ef8c9 to a37eaee Compare January 9, 2018 18:14

justinj merged commit 79db1cb into cockroachdb:master Jan 12, 2018

justinj deleted the cc-rfc branch January 12, 2018 00:21

Conversation

justinj commented Dec 14, 2017

Uh oh!

cockroach-teamcity commented Dec 14, 2017

Uh oh!

petermattis commented Dec 14, 2017

Uh oh!

justinj commented Dec 14, 2017

Uh oh!

benesch commented Dec 14, 2017

Uh oh!

benesch commented Dec 14, 2017

Uh oh!

vivekmenezes Dec 20, 2017

Choose a reason for hiding this comment

Uh oh!

vivekmenezes Dec 20, 2017

Choose a reason for hiding this comment

Uh oh!

vivekmenezes Dec 20, 2017

Choose a reason for hiding this comment

Uh oh!

jordanlewis commented Dec 20, 2017

Uh oh!

knz commented Dec 21, 2017

Uh oh!

knz commented Dec 21, 2017

Uh oh!

justinj commented Jan 5, 2018

Uh oh!

jordanlewis commented Jan 5, 2018

Uh oh!

jordanlewis commented Jan 5, 2018

Uh oh!

vivekmenezes commented Jan 5, 2018

Uh oh!

justinj commented Jan 9, 2018

Uh oh!

benesch commented Jan 9, 2018

Uh oh!

justinj commented Jan 12, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants