rfc: SQL column groups by danhhz · Pull Request #6712 · cockroachdb/cockroach

danhhz · 2016-05-16T18:51:43Z

This replaces #3435. The first revision is exactly as #3435 left off to make diffs more obvious.

This change is

petermattis · 2016-05-16T19:47:47Z

Realize this is going to get some additional attention. Should add a discussion about implicitly grouping columns vs adding some additional syntax.

Previously, paperstreet (Dan Harrison) wrote…

[DO NOT MERGE] rfc: SQL column groups

This replaces #3435. The first revision is exactly as #3435 left off to make diffs more obvious.

Review status: 0 of 1 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

docs/RFCS/sql_column_groups.md, line 34 [r1] (raw file):

  optional fixed32 checksum;
  optional ValueType tag;
}

This description of the Value proto is out of date.

docs/RFCS/sql_column_groups.md, line 116 [r1] (raw file):

existing system. The downside to this approach is that it appears to
be much more invasive than the column group change. Would every
consumer of the KV api need to change?

Another alternative would be to omit the sentinel key when there is non-NULL/non-primary-key column. For example, we could omit the sentinel key in the following table because we know there will always be one KV pair:

CREATE TABLE kv (
  k INT PRIMARY KEY,
  v INT NOT NULL
)

I mention this as an alternative that should be added to the doc, but it won't provide nearly as much benefit as the column groups approach.

Comments from Reviewable

danhhz · 2016-05-17T20:18:49Z

Okay reopening this for discussion. Revision 1 is exactly as Peter left off in #3435 and revision 2 contains my initial changes.

I think some of the original objections related to how much of this gets pushed down into kv may have clearer answers now that we've started work on distributed sql.

cc @andreimatei @bdarnell @dt @RaduBerinde @tschottdorf who all commented on the original PR.

petermattis · 2016-05-17T21:43:46Z

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

docs/RFCS/sql_column_groups.md, line 88 [r2] (raw file):

assignment will be done with a set of heuristics, but will be
override-able by the user at column creation using a SQL syntax
extension. Note that assigning every column to its own group would be

Any thoughts on what the syntax would look like?

docs/RFCS/sql_column_groups.md, line 148 [r2] (raw file):

* We've essentially developed our own encoding to use in the raw_bytes
field of the Value proto, which is giving up a lot of the benefits of
protos. Should we move away from the protobuf completely?

roachpb.Value also contains a Timestamp field. While the timestamp is associated with the value, it is encoded into the key down at the physical storage layer. Changing this would be irritating.

docs/RFCS/sql_column_groups.md, line 152 [r2] (raw file):

* Should column groups have names and metadata?

* Should changing the group of a column be available as an ALTER COLUMN command?

Eventually, yes. Doesn't seem necessary initially.

Comments from Reviewable

RaduBerinde · 2016-05-18T13:46:56Z

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

docs/RFCS/sql_column_groups.md, line 31 [r2] (raw file):

message Value {

This is not that useful now by itself, it would be helpful to mention the format of the raw_bytes buffer (CRC, tag, value).
Ah, I see you describe it below, maybe move/copy some of that description here.

docs/RFCS/sql_column_groups.md, line 74 [r2] (raw file):

the `<columnVal>`. This latter part will be replaced by a series of
`<columnID>/<columnVal>` pairs where `<columnID>` is varint encoded
and `<columnVal>` is encoded using the same "value" encoding as

This won't work with the same value encoding as before. For some types, the decoding function needs to know the length of the encoding, e.g. bytes/strings (Value.GetBytes), decimals (I think). We will need to add new tag types with new encodings for those types where we need to add an encoded length (e.g. new tag ValueType_BYTES_WITH_LEN).

docs/RFCS/sql_column_groups.md, line 89 [r2] (raw file):

override-able by the user at column creation using a SQL syntax
extension. Note that assigning every column to its own group would be
effectively equivalent to the current SQL to KV mapping.

Effectively or exactly equivalent? Specifically - when we have single-column groups, will we use the old-style value (without a ColumnID tag?) Seems like we would need to for compatibility.

docs/RFCS/sql_column_groups.md, line 97 [r2] (raw file):

scope for the first commit of this RFC. Instead, the previous behavior
of one group per column will be used if not user specified. In
parallel to the implementation, heuristics will be reopened for

I like this plan. We can then introduce conservative heuristics "one at a time", for example:

put one (preferably non-null) column in group 0 (sentinel key)
group at most a couple of values that are known to be small (like ints) and don't group any variable sized values.

docs/RFCS/sql_column_groups.md, line 148 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

roachpb.Value also contains a Timestamp field. While the timestamp is associated with the value, it is encoded into the key down at the physical storage layer. Changing this would be irritating.

I think regardless of this question, given that we will have multiple SQL values per `roachpb.Value`, we will probably want to move all the value encoding/decoding stuff to a new type, and have them operate on byte slices instead of `Value`s.

Comments from Reviewable

RaduBerinde · 2016-05-18T14:10:14Z

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.

docs/RFCS/sql_column_groups.md, line 74 [r2] (raw file):

Previously, RaduBerinde wrote…

This won't work with the same value encoding as before. For some types, the decoding function needs to know the length of the encoding, e.g. bytes/strings (Value.GetBytes), decimals (I think). We will need to add new tag types with new encodings for those types where we need to add an encoded length (e.g. new tag ValueType_BYTES_WITH_LEN).

An easy way to solve the above is to use a high bit in the tag to indicate that a varint length follows the tag (and precedes the value). That will effectively create multiple new tag types but with a shared implementation for encoding/decoding the length.

Comments from Reviewable

bdarnell · 2016-05-18T16:25:15Z

Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.

docs/RFCS/sql_column_groups.md, line 88 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Any thoughts on what the syntax would look like?

As a strawman I would suggest following the syntax for indexes: `create table t (a int primary key, b int, c string, primary group (a, b) , group (c)` (where `primary group` means group 0; other groups are not explicitly identified except by their members. Or we could use `constraint primary group (a, b)`, similar to naming indexes).

docs/RFCS/sql_column_groups.md, line 96 [r2] (raw file):

automatically assigning columns to groups will be considered out of
scope for the first commit of this RFC. Instead, the previous behavior
of one group per column will be used if not user specified. In

One column per group, and no columns in group 0, right?

docs/RFCS/sql_column_groups.md, line 105 [r2] (raw file):

* This will change the on disk format of SQL tables. To maintain the
beta guarantees, we will either have to support both the old and the
new version or provide a migration tool.

The migration is my biggest concern here. As long as we keep the old encoding when there is one column per group, it seems tractable, but migrating old data to the new scheme seems very difficult. If we're not committing to supporting the old format needs a more detailed plan for the migration.

docs/RFCS/sql_column_groups.md, line 132 [r2] (raw file):

consumer of the KV api need to change?

* Another alternative would be to omit the sentinel key when there is

s/is/is no/

docs/RFCS/sql_column_groups.md, line 150 [r2] (raw file):

protos. Should we move away from the protobuf completely?

* Should column groups have names and metadata?

It would be good to allow room for this. This would let us map some groups to an alternative storage model (perhaps using rocksdb column families). This was very useful with bigtable column families/locality groups.

docs/RFCS/sql_column_groups.md, line 152 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Eventually, yes. Doesn't seem necessary initially.

If the column is to remain available during the move (and I think that's important), this seems very complicated. But I agree we'll eventually need it.

Comments from Reviewable

tbg · 2016-05-18T16:46:40Z

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.

Comments from Reviewable

petermattis · 2016-05-18T17:01:20Z

Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.

docs/RFCS/sql_column_groups.md, line 88 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

As a strawman I would suggest following the syntax for indexes: create table t (a int primary key, b int, c string, primary group (a, b) , group (c) (where primary group means group 0; other groups are not explicitly identified except by their members. Or we could use constraint primary group (a, b), similar to naming indexes).

And what about syntax for adding a column to a table? Seems like you'd want to "name" the group in that case. Similar to constraints, groups should have names, perhaps auto-generated if not explicitly specified.

docs/RFCS/sql_column_groups.md, line 105 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The migration is my biggest concern here. As long as we keep the old encoding when there is one column per group, it seems tractable, but migrating old data to the new scheme seems very difficult. If we're not committing to supporting the old format needs a more detailed plan for the migration.

I think we definitely need to support the old data scheme. I don't think doing so will be too onerous.

Comments from Reviewable

petermattis · 2016-05-18T17:04:50Z

Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.

docs/RFCS/sql_column_groups.md, line 152 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If the column is to remain available during the move (and I think that's important), this seems very complicated. But I agree we'll eventually need it.

Not sure if this is any more complicated than other schema changes, though that isn't to say it is easy. I think this would be done as:

Mark column as write-only in new group.
Write column to both old and new groups for queries, reading it only from old group.
Backfill column data to new group.
Remove column from old group, mark readable in new group.

Comments from Reviewable

RaduBerinde · 2016-05-18T17:10:20Z

Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.

docs/RFCS/sql_column_groups.md, line 152 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Not sure if this is any more complicated than other schema changes, though that isn't to say it is easy. I think this would be done as:

Mark column as write-only in new group.

Write column to both old and new groups for queries, reading it only from old group.

Backfill column data to new group.

Remove column from old group, mark readable in new group.

I think it's fairly easy for the row decoder to expect a column value in either the old or the new group (and prefer the value in the new group). So we can read from wherever and write to new group, and move from old group to new group in the background.

Comments from Reviewable

danhhz · 2016-05-18T21:26:34Z

Thanks for the reviews! This is ready for another look

Previously, paperstreet (Daniel Harrison) wrote…

Okay reopening this for discussion. Revision 1 is exactly as Peter left off in #3435 and revision 2 contains my initial changes.

I think some of the original objections related to how much of this gets pushed down into kv may have clearer answers now that we've started work on distributed sql.

cc @andreimatei @bdarnell @dt @RaduBerinde @tschottdorf who all commented on the original PR.

Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.

docs/RFCS/sql_column_groups.md, line 34 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

This description of the Value proto is out of date.

Done.

docs/RFCS/sql_column_groups.md, line 116 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Another alternative would be to omit the sentinel key when there is non-NULL/non-primary-key column. For example, we could omit the sentinel key in the following table because we know there will always be one KV pair:
CREATE TABLE kv (
  k INT PRIMARY KEY,
  v INT NOT NULL
)
I mention this as an alternative that should be added to the doc, but it won't provide nearly as much benefit as the column groups approach.

Done.

docs/RFCS/sql_column_groups.md, line 31 [r2] (raw file):

Previously, RaduBerinde wrote…

This is not that useful now by itself, it would be helpful to mention the format of the raw_bytes buffer (CRC, tag, value).
Ah, I see you describe it below, maybe move/copy some of that description here.

Done.

docs/RFCS/sql_column_groups.md, line 74 [r2] (raw file):

Previously, RaduBerinde wrote…

An easy way to solve the above is to use a high bit in the tag to indicate that a varint length follows the tag (and precedes the value). That will effectively create multiple new tag types but with a shared implementation for encoding/decoding the length.

Ah, right. I was thinking that you'd done this already, but just looked and the value encodings are a TODO in EncDatum. Done

docs/RFCS/sql_column_groups.md, line 88 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

And what about syntax for adding a column to a table? Seems like you'd want to "name" the group in that case. Similar to constraints, groups should have names, perhaps auto-generated if not explicitly specified.

I agree that groups should have names, autogenerated if omitted. Done.

docs/RFCS/sql_column_groups.md, line 89 [r2] (raw file):

Previously, RaduBerinde wrote…

Effectively or exactly equivalent? Specifically - when we have single-column groups, will we use the old-style value (without a ColumnID tag?) Seems like we would need to for compatibility.

Done.

docs/RFCS/sql_column_groups.md, line 96 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

One column per group, and no columns in group 0, right?

Interesting question. For existing tables, yes. For new tables, it seems okay to use group 0 for the first non-primary key, fixed size column?

docs/RFCS/sql_column_groups.md, line 105 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think we definitely need to support the old data scheme. I don't think doing so will be too onerous.

I'm convinced. Updated above to detail how this works.

docs/RFCS/sql_column_groups.md, line 132 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/is/is no/

Done.

docs/RFCS/sql_column_groups.md, line 148 [r2] (raw file):

Previously, RaduBerinde wrote…

I think regardless of this question, given that we will have multiple SQL values per roachpb.Value, we will probably want to move all the value encoding/decoding stuff to a new type, and have them operate on byte slices instead of Values.

So I brought this up because it came up in the discussion of the last proposal. I understood the question to be "why are the values we put in rocksdb a serialized proto with one bytes field set (as opposed to just the bytes)?"

It only seemed relevant if we were going to require a migration tool anyway. Since it looks like we can do colgroups in a reasonable, backwards compatible way, this question seems to be to fall out of scope. Removed.

docs/RFCS/sql_column_groups.md, line 150 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It would be good to allow room for this. This would let us map some groups to an alternative storage model (perhaps using rocksdb column families). This was very useful with bigtable column families/locality groups.

Done.

docs/RFCS/sql_column_groups.md, line 152 [r2] (raw file):

Previously, RaduBerinde wrote…

I think it's fairly easy for the row decoder to expect a column value in either the old or the new group (and prefer the value in the new group). So we can read from wherever and write to new group, and move from old group to new group in the background.

Done.

Comments from Reviewable

petermattis · 2016-05-18T21:30:02Z

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.

docs/RFCS/sql_column_groups.md, line 148 [r2] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

So I brought this up because it came up in the discussion of the last proposal. I understood the question to be "why are the values we put in rocksdb a serialized proto with one bytes field set (as opposed to just the bytes)?"

It only seemed relevant if we were going to require a migration tool anyway. Since it looks like we can do colgroups in a reasonable, backwards compatible way, this question seems to be to fall out of scope. Removed.

Note that we don't actually serialize the `Value` proto for storage into RocksDB. We only store the `Value.RawBytes` field. The `Value` proto is used for passing value+timestamp between nodes.

Comments from Reviewable

bdarnell · 2016-05-18T21:32:30Z

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.

docs/RFCS/sql_column_groups.md, line 96 [r2] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

Interesting question. For existing tables, yes. For new tables, it seems okay to use group 0 for the first non-primary key, fixed size column?

Yeah, for new tables we should be able to put one column in group 0.

Comments from Reviewable

danhhz · 2016-05-18T21:35:49Z

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.

docs/RFCS/sql_column_groups.md, line 148 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Note that we don't actually serialize the Value proto for storage into RocksDB. We only store the Value.RawBytes field. The Value proto is used for passing value+timestamp between nodes.

Ah, I missed that. 👍

Comments from Reviewable

petermattis · 2016-05-18T21:36:34Z

Previously, paperstreet (Daniel Harrison) wrote…

Thanks for the reviews! This is ready for another look

Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.

docs/RFCS/sql_column_groups.md, line 105 [r3] (raw file):

## SQL Syntax Examples

`CREATE TABLE t (a INT PRIMARY KEY, b INT, c STRING, PRIMARY GROUP pri (a, b), GROUP (c)`

This would be more readable as a multi-line statement.

What does PRIMARY do? Force that group to have ID 0? Do we need that?

docs/RFCS/sql_column_groups.md, line 168 [r3] (raw file):

cluster upgrade. Is there a reasonable mechanism for this? Or should
it be hidden as a TableDescriptor transformation in the
`getAliasedTableLease` call?

There is no current mechanism for doing such backfilling when a cluster upgrades. Performing the transformation on-demand when a table descriptor is loaded seems easier.

Comments from Reviewable

RaduBerinde · 2016-05-18T21:38:46Z

Previously, petermattis (Peter Mattis) wrote…

:lgtm:

Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.

docs/RFCS/sql_column_groups.md, line 92 [r3] (raw file):

column values will not be present in the value.

For backwards compatibility, groups with only a single column will use

One idea here that might be useful - perhaps the columnID can be preceded by a special tag byte. That way we can decode any value without knowing if it's a single-column or a group. It might make decoding more robust and easier to decode values without context (e.g. when dumping KVs for debugging). It would also allow omitting the columnID for a column in a group (e.g. one that we know will be populated often, like a not null).

Nothing groundbreaking so it doesn't need to go into the RFC but maybe keep it in mind.

Comments from Reviewable

bdarnell · 2016-05-18T21:40:57Z

Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.

docs/RFCS/sql_column_groups.md, line 105 [r3] (raw file):

Previously, petermattis (Peter Mattis) wrote…

This would be more readable as a multi-line statement.

What does PRIMARY do? Force that group to have ID 0? Do we need that?

My thinking when I proposed that was that `primary group` would be group 0 and no other groups would be named. But now if groups are named then we should probably just reserve the name "primary" for group 0 instead of a special keyword.

Comments from Reviewable

danhhz · 2016-05-19T15:31:44Z

Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.

docs/RFCS/sql_column_groups.md, line 105 [r3] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

My thinking when I proposed that was that primary group would be group 0 and no other groups would be named. But now if groups are named then we should probably just reserve the name "primary" for group 0 instead of a special keyword.

Done.

Comments from Reviewable

danhhz force-pushed the rfc_colgroups branch from 8b64818 to 98d31f7 Compare May 17, 2016 20:13

danhhz changed the title ~~[DO NOT MERGE] rfc: SQL column groups~~ rfc: SQL column groups May 17, 2016

danhhz force-pushed the rfc_colgroups branch from 98d31f7 to f47a489 Compare May 18, 2016 21:25

danhhz force-pushed the rfc_colgroups branch from f47a489 to c71994b Compare May 19, 2016 15:31

RFC: SQL column groups.

b0299d0

danhhz force-pushed the rfc_colgroups branch from c71994b to b0299d0 Compare May 19, 2016 15:31

danhhz merged commit 663d964 into cockroachdb:master May 19, 2016

danhhz deleted the rfc_colgroups branch May 19, 2016 15:48

danhhz mentioned this pull request May 19, 2016

RFC: SQL column groups. #3435

Closed

Conversation

danhhz commented May 16, 2016 • edited by petermattis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petermattis commented May 16, 2016

[DO NOT MERGE] rfc: SQL column groups

Uh oh!

danhhz commented May 17, 2016

Uh oh!

petermattis commented May 17, 2016

Uh oh!

RaduBerinde commented May 18, 2016

Uh oh!

RaduBerinde commented May 18, 2016

Uh oh!

bdarnell commented May 18, 2016

Uh oh!

tbg commented May 18, 2016

Uh oh!

petermattis commented May 18, 2016

Uh oh!

petermattis commented May 18, 2016

Uh oh!

RaduBerinde commented May 18, 2016

Uh oh!

danhhz commented May 18, 2016

Uh oh!

petermattis commented May 18, 2016

Uh oh!

bdarnell commented May 18, 2016

Uh oh!

danhhz commented May 18, 2016

Uh oh!

petermattis commented May 18, 2016

Uh oh!

RaduBerinde commented May 18, 2016

Uh oh!

bdarnell commented May 18, 2016

Uh oh!

danhhz commented May 19, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

danhhz commented May 16, 2016 •

edited by petermattis

Loading