Skip to content

rfc: computed columns#20735

Merged
justinj merged 1 commit intocockroachdb:masterfrom
justinj:cc-rfc
Jan 12, 2018
Merged

rfc: computed columns#20735
justinj merged 1 commit intocockroachdb:masterfrom
justinj:cc-rfc

Conversation

@justinj
Copy link
Copy Markdown
Contributor

@justinj justinj commented Dec 14, 2017

Release note: None

@justinj justinj requested a review from a team as a code owner December 14, 2017 20:59
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Excited to see this RFC. A few questions below. Nothing major.


Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


docs/RFCS/20171214_computed_columns.md, line 17 at r1 (raw file):

columns which are only computed when read (*non-materialized*).

Computed indexes are considered out of scope. Many of the benefits can be had

Given that this RFC is about computed columns, I'm not sure why this paragraph is needed here. Perhaps move it to the end of the document under a Related work section. Also, it isn't obvious to me why an index on a non-materialized computed is easier than a computed index. Can you flesh briefly expand on that?


docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

The grammar for a computed column is `column_name <type> AS <expr>
[PERSISTED]`, where `<expr>` is a pure function of non-computed columns in the

We use the term STORING for additional columns to be stored with an index. That's an argument for using the term STORED here.


docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

[PERSISTED]`, where `<expr>` is a pure function of non-computed columns in the
same table. `PERSISTED` indicates a materialized computed column and is
required if that columns appears in the primary key.

Why does the primary key require materialized columns?


docs/RFCS/20171214_computed_columns.md, line 53 at r1 (raw file):

    WHEN country IN ('ca', 'mx', 'us') THEN 'north_america'
    WHEN country IN ('au', 'nz') THEN 'australia'
  END,

This column will need to be PERSISTED given its use in the primary key, right?


docs/RFCS/20171214_computed_columns.md, line 124 at r1 (raw file):

In addition, users are stopped from dropping a column if it’s referenced by a
computed column. Computed column expressions for dependents of a column are

Also seems bad to drop a computed column if it is referenced by a partition.


docs/RFCS/20171214_computed_columns.md, line 131 at r1 (raw file):

- Adding this feature means the story of what a column is is complicated
  slightly. This could have implications for upcoming features, like the query
  optimizer.

I don't see this as being a significant complication. For non-materialized columns, query planning would likely just expand any references to the column to be the underlying expression.


docs/RFCS/20171214_computed_columns.md, line 152 at r1 (raw file):

| ---------------------- | --------------------------------------------------------------- | ---------------------------------------------------------------- |
| CockroachDB (Proposed) | `inventory_value INT AS qty_available * unit_price`             | `inventory_value INT AS qty_available * unit_price PERSISTED`    |
| MySQL                  | `inventory_value INT AS (qty_available * unit_price) [VIRTUAL]` | `inventory_value INT AS qty_available * unit_price STORED`       |

Are the parentheses really needed for the Non-materialized MySQL syntax? You're not showing them for the Materialized syntax.


Comments from Reviewable

@justinj
Copy link
Copy Markdown
Contributor Author

justinj commented Dec 14, 2017

Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, some commit checks pending.


docs/RFCS/20171214_computed_columns.md, line 17 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Given that this RFC is about computed columns, I'm not sure why this paragraph is needed here. Perhaps move it to the end of the document under a Related work section. Also, it isn't obvious to me why an index on a non-materialized computed is easier than a computed index. Can you flesh briefly expand on that?

I think the rationale for this paragraph was that the problems this feature solves for JSON were often discussed in terms of computed indexes before this, and this closes the loop on those conversations. It might make more sense to move it to the alternatives section though!


docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We use the term STORING for additional columns to be stored with an index. That's an argument for using the term STORED here.

I don't have any strong feelings on this and I don't think Dan does either, STORED is fine with me but I'll hold off on changing it until anyone else who might have an opinion has a chance to weigh in.


docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Why does the primary key require materialized columns?

It has to be materialized so that it can form the primary index. This is as opposed to a non-materialized column that is part of a secondary index - it doesn't have to be materialized for its existence in the primary index, but it will have to be materialized within the secondary index. It's probably debatable if "materialized" has the same connotation when the value exists in the key or the value of a kv-entry.


docs/RFCS/20171214_computed_columns.md, line 53 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This column will need to be PERSISTED given its use in the primary key, right?

You are correct! Good catch.


docs/RFCS/20171214_computed_columns.md, line 124 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Also seems bad to drop a computed column if it is referenced by a partition.

I believe this is covered under whatever usual behaviour exists around dropping columns referenced by partitions, even if they aren't computed.


docs/RFCS/20171214_computed_columns.md, line 131 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I don't see this as being a significant complication. For non-materialized columns, query planning would likely just expand any references to the column to be the underlying expression.

Ok, sounds good. I'll remove this as a concern.


docs/RFCS/20171214_computed_columns.md, line 152 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Are the parentheses really needed for the Non-materialized MySQL syntax? You're not showing them for the Materialized syntax.

Good catch - they're required in both cases.


Comments from Reviewable

@benesch
Copy link
Copy Markdown
Contributor

benesch commented Dec 14, 2017

Reviewed 1 of 1 files at r1.
Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending.


docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We use the term STORING for additional columns to be stored with an index. That's an argument for using the term STORED here.

Views can also be MATERIALIZED in Postgres. I'd vote for either of those over PERSISTED.


docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Why does the primary key require materialized columns?

(And if it is required, shouldn't it be required for any columns used in a CREATE INDEX, too?)


docs/RFCS/20171214_computed_columns.md, line 38 at r1 (raw file):

## Partitioning

[Partitioning](https://github.com/cockroachdb/cockroach/blob/aa61db043e9c54c0b83a405cd76ce0ec7cc6a35d/docs/RFCS/20170921_sql_partitioning.md)

nit: consider pulling this out into a reference-style link


docs/RFCS/20171214_computed_columns.md, line 67 at r1 (raw file):

## JSON

When the primary source of truth for a table lives in a JSON blob, it can be desirable to put an index on a particular field of a JSON document. In particular, computed columns allow for the following use-case: a two column table, with a primary key column and a payload column, whose primary key is computed as some field from the payload column. This alleviates the need for the client to manually separate their JSON blobs from their primary keys.

nit: consider wrapping


docs/RFCS/20171214_computed_columns.md, line 89 at r1 (raw file):

- Computed columns behave like any other column, with the exception that they
  cannot be written to directly. Performing an `INSERT` or `UPDATE` which
  specifies a computed column is an error.

I think you should be able to specify DEFAULT. This is how MySQL works. Consider:

CREATE TABLE t (a INT, b INT AS (a + 1) STORED, c);
INSERT INTO t (a, c) VALUES (1, 2)  -- Great! Easy!
INSERT INTO t VALUES (1, ???, 2) -- Eek! 

Yes, implicit column lists are evil, but we should still support an escape hatch so you can specify the columns after a persisted column in a VALUES clause.

INSERT INTO t VALUES (1, DEFAULT, 2)

docs/RFCS/20171214_computed_columns.md, line 152 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Are the parentheses really needed for the Non-materialized MySQL syntax? You're not showing them for the Materialized syntax.

In my tests they're required for both.


Comments from Reviewable

@benesch
Copy link
Copy Markdown
Contributor

benesch commented Dec 14, 2017

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending.


docs/RFCS/20171214_computed_columns.md, line 17 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

I think the rationale for this paragraph was that the problems this feature solves for JSON were often discussed in terms of computed indexes before this, and this closes the loop on those conversations. It might make more sense to move it to the alternatives section though!

The way I understood it is that with computed indices, you might expect the SELECT query in the following example to use the index:

CREATE TABLE t (a INT, b INT);
CREATE INDEX ON t ((a + b));
SELECT * FROM t WHERE b + a > 10;

When in fact it's not trivial to prove that a + b and b + a are equivalent. (Ok, well, it is in this case, but not in general.)

If you force users to create a stored column instead

CREATE TABLE t (a INT, b INT, c INT AS (a + b) STORED);
CREATE INDEX ON t (c);

it's obvious to the user and the optimizer that SELECT * FROM t WHERE c > 10 will use the index, but there's less expectation that SELECT * FROM t WHERE a + b > 10 will use the index.


docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Views can also be MATERIALIZED in Postgres. I'd vote for either of those over PERSISTED.

Didn't realize there was precedent for both STORED and PERSISTED. I'll also vote for STORED.


docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

(And if it is required, shouldn't it be required for any columns used in a CREATE INDEX, too?)

Yeah, in my mind it's perfectly reasonable that a column used in an index is essentially STORED, regardless of whether STORED was specified. I'll cast a vote in favor of allowing any computed column, STORED or not, to be used in any index, primary or not. Then STORED just indicates whether you want to trade space for CPU on non-indexed computed columns (i.e., when the column is really expensive to compute).


docs/RFCS/20171214_computed_columns.md, line 124 at r1 (raw file):

Previously, justinj (Justin Jaffray) wrote…

I believe this is covered under whatever usual behaviour exists around dropping columns referenced by partitions, even if they aren't computed.

I don't think that logic exists yet :S. It should!


Comments from Reviewable


```protobuf
message ColumnDescriptor {
...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will its name be?

CREATE TABLE documents (
id STRING PRIMARY KEY AS payload->>'id' PERSISTED,
payload JSONB
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users might prefer using the postgres variant

CREATE TABLE documents (
   payload JSONB
   UNIQUE (payload->>'id')
 )

I see the JSON use case as not necessarily needing the computed column but a computed index.

Comment thread docs/RFCS/20171214_computed_columns.md Outdated

The grammar for a computed column is `column_name <type> AS <expr>
[PERSISTED]`, where `<expr>` is a pure function of non-computed columns in the
same table. `PERSISTED` indicates a materialized computed column and is
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need PERSISTED ?I ask because it appears that we're only using it for indexes.

@jordanlewis
Copy link
Copy Markdown
Member

Great RFC!

Will you support ALTER TABLE ALTER COLUMN on these kinds of columns? What will the limitations be?


Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 17 unresolved discussions, all commit checks successful.


docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Didn't realize there was precedent for both STORED and PERSISTED. I'll also vote for STORED.

For this syntax bikeshedding to be successful, I think it's important to examine what other databases do.

MySQL: [ PERSISTED | STORED | VIRTUAL ]. PERSISTED and STORED are synonyms. VIRTUAL is the default.
SQL Server: PERSISTED vs nothing.
Oracle: No support for persisted computed columns.
Postgres: No support for computed columns at all.

Looks like PERSISTED is the winner for frequency, but I buy the argument about consistency - STORED makes sense since we say STORING elsewhere.

I think we should support STORED. I also think we should support VIRTUAL as the spelled-out version of no modifier.

(whoops, didn't realize you did this analysis at the bottom - please mention that up here!)


docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Yeah, in my mind it's perfectly reasonable that a column used in an index is essentially STORED, regardless of whether STORED was specified. I'll cast a vote in favor of allowing any computed column, STORED or not, to be used in any index, primary or not. Then STORED just indicates whether you want to trade space for CPU on non-indexed computed columns (i.e., when the column is really expensive to compute).

nit: s/columns/column/.


docs/RFCS/20171214_computed_columns.md, line 67 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

nit: consider wrapping

+1


docs/RFCS/20171214_computed_columns.md, line 89 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

I think you should be able to specify DEFAULT. This is how MySQL works. Consider:

CREATE TABLE t (a INT, b INT AS (a + 1) STORED, c);
INSERT INTO t (a, c) VALUES (1, 2)  -- Great! Easy!
INSERT INTO t VALUES (1, ???, 2) -- Eek! 

Yes, implicit column lists are evil, but we should still support an escape hatch so you can specify the columns after a persisted column in a VALUES clause.

INSERT INTO t VALUES (1, DEFAULT, 2)

Agreed that DEFAULT would be a nice to have, but it seems okay to live without it, since you can't accidentally screw up. INSERT INTO t VALUES (1, 2) should fail (not enough values for table), and INSERT INTO t VALUES (1,2,3) should fail (can't insert into a computed column).


docs/RFCS/20171214_computed_columns.md, line 152 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

In my tests they're required for both.

MySQL also supports the PERSISTED syntax.


docs/RFCS/20171214_computed_columns.md, line 14 at r2 (raw file):

supported in several major databases, including MySQL, Oracle, and SQL Server.
This RFC covers both the version of computed columns that are physically stored
alongside other columns (referred to here as *materialized*) and computed

Can we pick a terminology here? the industry standard seems to be stored/persisted vs virtual. I suggest we do the same.


docs/RFCS/20171214_computed_columns.md, line 37 at r2 (raw file):

requires that partitions are defined using columns that are a prefix of the
primary key. In the case of geo-partitioning, some applications will want to
collapse the number of possible values in this columns, to make certain classes

nit: "this columns" seems to be incorrect grammar.


docs/RFCS/20171214_computed_columns.md, line 93 at r2 (raw file):

# Reference-level explanation
## Detailed Design

Just a note - don't forget to update information_schema.columns with the fact that the columns are virtual/persisted!


docs/RFCS/20171214_computed_columns.md, line 99 at r2 (raw file):

Previously, vivekmenezes wrote…

what will its name be?

Seems like computed? It's at the end of the message.


docs/RFCS/20171214_computed_columns.md, line 102 at r2 (raw file):

  message Computed {
    expr string = 1;
    bool materialized = 2;

Please keep the terminology consistent as mentioned above. (and throughout - I think we should globally s/materialized/stored/ if we decided on stored as the syntax word)


Comments from Reviewable

@knz
Copy link
Copy Markdown
Contributor

knz commented Dec 21, 2017

LGTM modulo resolution of the points already raised. 👍 on STORED/VIRTUAL vs PERSISTED/nothing.


Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.


docs/RFCS/20171214_computed_columns.md, line 141 at r2 (raw file):

  Oracle implement it slightly differently. Postgres does not have this
  feature.
- Computed indexes can also solve many of the same problems as computed

"Computed indexes solve the same problems as computed indexes" - tautology


Comments from Reviewable

@knz
Copy link
Copy Markdown
Contributor

knz commented Dec 21, 2017

Reviewed 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.


Comments from Reviewable

@justinj
Copy link
Copy Markdown
Contributor Author

justinj commented Jan 5, 2018

Review status: all files reviewed at latest revision, 18 unresolved discussions, all commit checks successful.


docs/RFCS/20171214_computed_columns.md, line 32 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

For this syntax bikeshedding to be successful, I think it's important to examine what other databases do.

MySQL: [ PERSISTED | STORED | VIRTUAL ]. PERSISTED and STORED are synonyms. VIRTUAL is the default.
SQL Server: PERSISTED vs nothing.
Oracle: No support for persisted computed columns.
Postgres: No support for computed columns at all.

Looks like PERSISTED is the winner for frequency, but I buy the argument about consistency - STORED makes sense since we say STORING elsewhere.

I think we should support STORED. I also think we should support VIRTUAL as the spelled-out version of no modifier.

(whoops, didn't realize you did this analysis at the bottom - please mention that up here!)

Done.


docs/RFCS/20171214_computed_columns.md, line 34 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

nit: s/columns/column/.

We discussed a bit in person - I think we've settled on the position that since a column in the primary key will need to be "stored" in the primary index regardless of whether it's specified, we'd prefer to require them to be stored in that case. This matches MySQL's behaviour.


docs/RFCS/20171214_computed_columns.md, line 38 at r1 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

nit: consider pulling this out into a reference-style link

Done.


docs/RFCS/20171214_computed_columns.md, line 67 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

+1

Done.


docs/RFCS/20171214_computed_columns.md, line 89 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Agreed that DEFAULT would be a nice to have, but it seems okay to live without it, since you can't accidentally screw up. INSERT INTO t VALUES (1, 2) should fail (not enough values for table), and INSERT INTO t VALUES (1,2,3) should fail (can't insert into a computed column).

I actually missed this syntax! That's useful and I've included it now.


docs/RFCS/20171214_computed_columns.md, line 14 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Can we pick a terminology here? the industry standard seems to be stored/persisted vs virtual. I suggest we do the same.

Done.


docs/RFCS/20171214_computed_columns.md, line 29 at r2 (raw file):

Previously, vivekmenezes wrote…

Do we really need PERSISTED ?I ask because it appears that we're only using it for indexes.

It's true for our uses in indexes we could get away with only one of persisted/not persisted, and we're only planning to implement persisted for the time being (just because it seems easier to implement), this RFC covers both for completeness.


docs/RFCS/20171214_computed_columns.md, line 37 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

nit: "this columns" seems to be incorrect grammar.

Done.


docs/RFCS/20171214_computed_columns.md, line 69 at r2 (raw file):

Previously, vivekmenezes wrote…

users might prefer using the postgres variant

CREATE TABLE documents (
   payload JSONB
   UNIQUE (payload->>'id')
 )

I see the JSON use case as not necessarily needing the computed column but a computed index.

Yeah, that's correct, but we decided that for JSON this was a simpler implementation at the moment, and has some overlap with the features needed by partitioning.


docs/RFCS/20171214_computed_columns.md, line 102 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Please keep the terminology consistent as mentioned above. (and throughout - I think we should globally s/materialized/stored/ if we decided on stored as the syntax word)

Done.


docs/RFCS/20171214_computed_columns.md, line 141 at r2 (raw file):

Previously, knz (kena) wrote…

"Computed indexes solve the same problems as computed indexes" - tautology

Fixed.


Comments from Reviewable

@jordanlewis
Copy link
Copy Markdown
Member

Review status: 0 of 1 files reviewed at latest revision, 17 unresolved discussions, some commit checks pending.


docs/RFCS/20171214_computed_columns.md, line 15 at r3 (raw file):

This RFC covers both the version of computed columns that are physically stored
alongside other columns (referred to here as *stored*) and computed
columns which are only computed when read (*non-stored*).

what about virtual? and s/non-stored/virtual/ throughout so our terminology is consistent with our syntax.


docs/RFCS/20171214_computed_columns.md, line 28 at r3 (raw file):

The grammar for a computed column is `column_name <type> AS <expr>
<VIRTUAL|STORED>`, where `<expr>` is a pure function of non-computed columns in the

Shouldn't this be [VIRTUAL | STORED], since it's optional to specify these modifiers? I thought the default with no modifier would be VIRTUAL.


Comments from Reviewable

@jordanlewis
Copy link
Copy Markdown
Member

:lgtm:


Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks pending.


Comments from Reviewable

@vivekmenezes
Copy link
Copy Markdown
Contributor

LGTM!

Release note: None
@justinj
Copy link
Copy Markdown
Contributor Author

justinj commented Jan 9, 2018

Entering final comment period - targeting merge on Thursday!


Review status: all files reviewed at latest revision, 17 unresolved discussions, all commit checks successful.


docs/RFCS/20171214_computed_columns.md, line 15 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

what about virtual? and s/non-stored/virtual/ throughout so our terminology is consistent with our syntax.

Done.


docs/RFCS/20171214_computed_columns.md, line 28 at r3 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Shouldn't this be [VIRTUAL | STORED], since it's optional to specify these modifiers? I thought the default with no modifier would be VIRTUAL.

Yes, good point!


Comments from Reviewable

@benesch
Copy link
Copy Markdown
Contributor

benesch commented Jan 9, 2018

:lgtm:


Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.


Comments from Reviewable

@justinj
Copy link
Copy Markdown
Contributor Author

justinj commented Jan 12, 2018

Thanks for the reviews everyone!

@justinj justinj merged commit 79db1cb into cockroachdb:master Jan 12, 2018
@justinj justinj deleted the cc-rfc branch January 12, 2018 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants