Skip to content

rfc: SQL column groups#6712

Merged
danhhz merged 1 commit intocockroachdb:masterfrom
danhhz:rfc_colgroups
May 19, 2016
Merged

rfc: SQL column groups#6712
danhhz merged 1 commit intocockroachdb:masterfrom
danhhz:rfc_colgroups

Conversation

@danhhz
Copy link
Copy Markdown
Contributor

@danhhz danhhz commented May 16, 2016

This replaces #3435. The first revision is exactly as #3435 left off to make diffs more obvious.


This change is Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

:lgtm:

Realize this is going to get some additional attention. Should add a discussion about implicitly grouping columns vs adding some additional syntax.

Previously, paperstreet (Dan Harrison) wrote…

[DO NOT MERGE] rfc: SQL column groups

This replaces #3435. The first revision is exactly as #3435 left off to make diffs more obvious.


Review status: 0 of 1 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


docs/RFCS/sql_column_groups.md, line 34 [r1] (raw file):

  optional fixed32 checksum;
  optional ValueType tag;
}

This description of the Value proto is out of date.


docs/RFCS/sql_column_groups.md, line 116 [r1] (raw file):

existing system. The downside to this approach is that it appears to
be much more invasive than the column group change. Would every
consumer of the KV api need to change?

Another alternative would be to omit the sentinel key when there is non-NULL/non-primary-key column. For example, we could omit the sentinel key in the following table because we know there will always be one KV pair:

CREATE TABLE kv (
  k INT PRIMARY KEY,
  v INT NOT NULL
)

I mention this as an alternative that should be added to the doc, but it won't provide nearly as much benefit as the column groups approach.


Comments from Reviewable

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented May 17, 2016

Okay reopening this for discussion. Revision 1 is exactly as Peter left off in #3435 and revision 2 contains my initial changes.

I think some of the original objections related to how much of this gets pushed down into kv may have clearer answers now that we've started work on distributed sql.

cc @andreimatei @bdarnell @dt @RaduBerinde @tschottdorf who all commented on the original PR.

@danhhz danhhz changed the title [DO NOT MERGE] rfc: SQL column groups rfc: SQL column groups May 17, 2016
@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


docs/RFCS/sql_column_groups.md, line 88 [r2] (raw file):

assignment will be done with a set of heuristics, but will be
override-able by the user at column creation using a SQL syntax
extension. Note that assigning every column to its own group would be

Any thoughts on what the syntax would look like?


docs/RFCS/sql_column_groups.md, line 148 [r2] (raw file):

* We've essentially developed our own encoding to use in the raw_bytes
field of the Value proto, which is giving up a lot of the benefits of
protos. Should we move away from the protobuf completely?

roachpb.Value also contains a Timestamp field. While the timestamp is associated with the value, it is encoded into the key down at the physical storage layer. Changing this would be irritating.


docs/RFCS/sql_column_groups.md, line 152 [r2] (raw file):

* Should column groups have names and metadata?

* Should changing the group of a column be available as an ALTER COLUMN command?

Eventually, yes. Doesn't seem necessary initially.


Comments from Reviewable

@RaduBerinde
Copy link
Copy Markdown
Member

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


docs/RFCS/sql_column_groups.md, line 31 [r2] (raw file):

message Value {

This is not that useful now by itself, it would be helpful to mention the format of the raw_bytes buffer (CRC, tag, value).
Ah, I see you describe it below, maybe move/copy some of that description here.


docs/RFCS/sql_column_groups.md, line 74 [r2] (raw file):

the `<columnVal>`. This latter part will be replaced by a series of
`<columnID>/<columnVal>` pairs where `<columnID>` is varint encoded
and `<columnVal>` is encoded using the same "value" encoding as

This won't work with the same value encoding as before. For some types, the decoding function needs to know the length of the encoding, e.g. bytes/strings (Value.GetBytes), decimals (I think). We will need to add new tag types with new encodings for those types where we need to add an encoded length (e.g. new tag ValueType_BYTES_WITH_LEN).


docs/RFCS/sql_column_groups.md, line 89 [r2] (raw file):

override-able by the user at column creation using a SQL syntax
extension. Note that assigning every column to its own group would be
effectively equivalent to the current SQL to KV mapping.

Effectively or exactly equivalent? Specifically - when we have single-column groups, will we use the old-style value (without a ColumnID tag?) Seems like we would need to for compatibility.


docs/RFCS/sql_column_groups.md, line 97 [r2] (raw file):

scope for the first commit of this RFC. Instead, the previous behavior
of one group per column will be used if not user specified. In
parallel to the implementation, heuristics will be reopened for

I like this plan. We can then introduce conservative heuristics "one at a time", for example:

  • put one (preferably non-null) column in group 0 (sentinel key)
  • group at most a couple of values that are known to be small (like ints) and don't group any variable sized values.

docs/RFCS/sql_column_groups.md, line 148 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

roachpb.Value also contains a Timestamp field. While the timestamp is associated with the value, it is encoded into the key down at the physical storage layer. Changing this would be irritating.

I think regardless of this question, given that we will have multiple SQL values per `roachpb.Value`, we will probably want to move all the value encoding/decoding stuff to a new type, and have them operate on byte slices instead of `Value`s.

Comments from Reviewable

@RaduBerinde
Copy link
Copy Markdown
Member

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


docs/RFCS/sql_column_groups.md, line 74 [r2] (raw file):

Previously, RaduBerinde wrote…

This won't work with the same value encoding as before. For some types, the decoding function needs to know the length of the encoding, e.g. bytes/strings (Value.GetBytes), decimals (I think). We will need to add new tag types with new encodings for those types where we need to add an encoded length (e.g. new tag ValueType_BYTES_WITH_LEN).

An easy way to solve the above is to use a high bit in the tag to indicate that a varint length follows the tag (and precedes the value). That will effectively create multiple new tag types but with a shared implementation for encoding/decoding the length.

Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


docs/RFCS/sql_column_groups.md, line 88 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Any thoughts on what the syntax would look like?

As a strawman I would suggest following the syntax for indexes: `create table t (a int primary key, b int, c string, primary group (a, b) , group (c)` (where `primary group` means group 0; other groups are not explicitly identified except by their members. Or we could use `constraint primary group (a, b)`, similar to naming indexes).

docs/RFCS/sql_column_groups.md, line 96 [r2] (raw file):

automatically assigning columns to groups will be considered out of
scope for the first commit of this RFC. Instead, the previous behavior
of one group per column will be used if not user specified. In

One column per group, and no columns in group 0, right?


docs/RFCS/sql_column_groups.md, line 105 [r2] (raw file):

* This will change the on disk format of SQL tables. To maintain the
beta guarantees, we will either have to support both the old and the
new version or provide a migration tool.

The migration is my biggest concern here. As long as we keep the old encoding when there is one column per group, it seems tractable, but migrating old data to the new scheme seems very difficult. If we're not committing to supporting the old format needs a more detailed plan for the migration.


docs/RFCS/sql_column_groups.md, line 132 [r2] (raw file):

consumer of the KV api need to change?

* Another alternative would be to omit the sentinel key when there is

s/is/is no/


docs/RFCS/sql_column_groups.md, line 150 [r2] (raw file):

protos. Should we move away from the protobuf completely?

* Should column groups have names and metadata?

It would be good to allow room for this. This would let us map some groups to an alternative storage model (perhaps using rocksdb column families). This was very useful with bigtable column families/locality groups.


docs/RFCS/sql_column_groups.md, line 152 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Eventually, yes. Doesn't seem necessary initially.

If the column is to remain available during the move (and I think that's important), this seems very complicated. But I agree we'll eventually need it.

Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member

tbg commented May 18, 2016

Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


docs/RFCS/sql_column_groups.md, line 88 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

As a strawman I would suggest following the syntax for indexes: create table t (a int primary key, b int, c string, primary group (a, b) , group (c) (where primary group means group 0; other groups are not explicitly identified except by their members. Or we could use constraint primary group (a, b), similar to naming indexes).

And what about syntax for adding a column to a table? Seems like you'd want to "name" the group in that case. Similar to constraints, groups should have names, perhaps auto-generated if not explicitly specified.

docs/RFCS/sql_column_groups.md, line 105 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The migration is my biggest concern here. As long as we keep the old encoding when there is one column per group, it seems tractable, but migrating old data to the new scheme seems very difficult. If we're not committing to supporting the old format needs a more detailed plan for the migration.

I think we definitely need to support the old data scheme. I don't think doing so will be too onerous.

Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


docs/RFCS/sql_column_groups.md, line 152 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If the column is to remain available during the move (and I think that's important), this seems very complicated. But I agree we'll eventually need it.

Not sure if this is any more complicated than other schema changes, though that isn't to say it is easy. I think this would be done as:
  • Mark column as write-only in new group.
  • Write column to both old and new groups for queries, reading it only from old group.
  • Backfill column data to new group.
  • Remove column from old group, mark readable in new group.

Comments from Reviewable

@RaduBerinde
Copy link
Copy Markdown
Member

Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


docs/RFCS/sql_column_groups.md, line 152 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Not sure if this is any more complicated than other schema changes, though that isn't to say it is easy. I think this would be done as:

  • Mark column as write-only in new group.
  • Write column to both old and new groups for queries, reading it only from old group.
  • Backfill column data to new group.
  • Remove column from old group, mark readable in new group.
I think it's fairly easy for the row decoder to expect a column value in either the old or the new group (and prefer the value in the new group). So we can read from wherever and write to new group, and move from old group to new group in the background.

Comments from Reviewable

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented May 18, 2016

Thanks for the reviews! This is ready for another look

Previously, paperstreet (Daniel Harrison) wrote…

Okay reopening this for discussion. Revision 1 is exactly as Peter left off in #3435 and revision 2 contains my initial changes.

I think some of the original objections related to how much of this gets pushed down into kv may have clearer answers now that we've started work on distributed sql.

cc @andreimatei @bdarnell @dt @RaduBerinde @tschottdorf who all commented on the original PR.


Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.


docs/RFCS/sql_column_groups.md, line 34 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

This description of the Value proto is out of date.

Done.

docs/RFCS/sql_column_groups.md, line 116 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Another alternative would be to omit the sentinel key when there is non-NULL/non-primary-key column. For example, we could omit the sentinel key in the following table because we know there will always be one KV pair:

CREATE TABLE kv (
  k INT PRIMARY KEY,
  v INT NOT NULL
)

I mention this as an alternative that should be added to the doc, but it won't provide nearly as much benefit as the column groups approach.

Done.

docs/RFCS/sql_column_groups.md, line 31 [r2] (raw file):

Previously, RaduBerinde wrote…

This is not that useful now by itself, it would be helpful to mention the format of the raw_bytes buffer (CRC, tag, value).
Ah, I see you describe it below, maybe move/copy some of that description here.

Done.

docs/RFCS/sql_column_groups.md, line 74 [r2] (raw file):

Previously, RaduBerinde wrote…

An easy way to solve the above is to use a high bit in the tag to indicate that a varint length follows the tag (and precedes the value). That will effectively create multiple new tag types but with a shared implementation for encoding/decoding the length.

Ah, right. I was thinking that you'd done this already, but just looked and the value encodings are a TODO in EncDatum. Done

docs/RFCS/sql_column_groups.md, line 88 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

And what about syntax for adding a column to a table? Seems like you'd want to "name" the group in that case. Similar to constraints, groups should have names, perhaps auto-generated if not explicitly specified.

I agree that groups should have names, autogenerated if omitted. Done.

docs/RFCS/sql_column_groups.md, line 89 [r2] (raw file):

Previously, RaduBerinde wrote…

Effectively or exactly equivalent? Specifically - when we have single-column groups, will we use the old-style value (without a ColumnID tag?) Seems like we would need to for compatibility.

Done.

docs/RFCS/sql_column_groups.md, line 96 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

One column per group, and no columns in group 0, right?

Interesting question. For existing tables, yes. For new tables, it seems okay to use group 0 for the first non-primary key, fixed size column?

docs/RFCS/sql_column_groups.md, line 105 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think we definitely need to support the old data scheme. I don't think doing so will be too onerous.

I'm convinced. Updated above to detail how this works.

docs/RFCS/sql_column_groups.md, line 132 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/is/is no/

Done.

docs/RFCS/sql_column_groups.md, line 148 [r2] (raw file):

Previously, RaduBerinde wrote…

I think regardless of this question, given that we will have multiple SQL values per roachpb.Value, we will probably want to move all the value encoding/decoding stuff to a new type, and have them operate on byte slices instead of Values.

So I brought this up because it came up in the discussion of the last proposal. I understood the question to be "why are the values we put in rocksdb a serialized proto with one bytes field set (as opposed to just the bytes)?"

It only seemed relevant if we were going to require a migration tool anyway. Since it looks like we can do colgroups in a reasonable, backwards compatible way, this question seems to be to fall out of scope. Removed.


docs/RFCS/sql_column_groups.md, line 150 [r2] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It would be good to allow room for this. This would let us map some groups to an alternative storage model (perhaps using rocksdb column families). This was very useful with bigtable column families/locality groups.

Done.

docs/RFCS/sql_column_groups.md, line 152 [r2] (raw file):

Previously, RaduBerinde wrote…

I think it's fairly easy for the row decoder to expect a column value in either the old or the new group (and prefer the value in the new group). So we can read from wherever and write to new group, and move from old group to new group in the background.

Done.

Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/sql_column_groups.md, line 148 [r2] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

So I brought this up because it came up in the discussion of the last proposal. I understood the question to be "why are the values we put in rocksdb a serialized proto with one bytes field set (as opposed to just the bytes)?"

It only seemed relevant if we were going to require a migration tool anyway. Since it looks like we can do colgroups in a reasonable, backwards compatible way, this question seems to be to fall out of scope. Removed.

Note that we don't actually serialize the `Value` proto for storage into RocksDB. We only store the `Value.RawBytes` field. The `Value` proto is used for passing value+timestamp between nodes.

Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/sql_column_groups.md, line 96 [r2] (raw file):

Previously, paperstreet (Daniel Harrison) wrote…

Interesting question. For existing tables, yes. For new tables, it seems okay to use group 0 for the first non-primary key, fixed size column?

Yeah, for new tables we should be able to put one column in group 0.

Comments from Reviewable

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented May 18, 2016

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/sql_column_groups.md, line 148 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Note that we don't actually serialize the Value proto for storage into RocksDB. We only store the Value.RawBytes field. The Value proto is used for passing value+timestamp between nodes.

Ah, I missed that. 👍

Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

:lgtm:

Previously, paperstreet (Daniel Harrison) wrote…

Thanks for the reviews! This is ready for another look


Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.


docs/RFCS/sql_column_groups.md, line 105 [r3] (raw file):

## SQL Syntax Examples

`CREATE TABLE t (a INT PRIMARY KEY, b INT, c STRING, PRIMARY GROUP pri (a, b), GROUP (c)`

This would be more readable as a multi-line statement.

What does PRIMARY do? Force that group to have ID 0? Do we need that?


docs/RFCS/sql_column_groups.md, line 168 [r3] (raw file):

cluster upgrade. Is there a reasonable mechanism for this? Or should
it be hidden as a TableDescriptor transformation in the
`getAliasedTableLease` call?

There is no current mechanism for doing such backfilling when a cluster upgrades. Performing the transformation on-demand when a table descriptor is loaded seems easier.


Comments from Reviewable

@RaduBerinde
Copy link
Copy Markdown
Member

:lgtm:

Previously, petermattis (Peter Mattis) wrote…

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.


docs/RFCS/sql_column_groups.md, line 92 [r3] (raw file):

column values will not be present in the value.

For backwards compatibility, groups with only a single column will use

One idea here that might be useful - perhaps the columnID can be preceded by a special tag byte. That way we can decode any value without knowing if it's a single-column or a group. It might make decoding more robust and easier to decode values without context (e.g. when dumping KVs for debugging). It would also allow omitting the columnID for a column in a group (e.g. one that we know will be populated often, like a not null).

Nothing groundbreaking so it doesn't need to go into the RFC but maybe keep it in mind.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, all commit checks successful.


docs/RFCS/sql_column_groups.md, line 105 [r3] (raw file):

Previously, petermattis (Peter Mattis) wrote…

This would be more readable as a multi-line statement.

What does PRIMARY do? Force that group to have ID 0? Do we need that?

My thinking when I proposed that was that `primary group` would be group 0 and no other groups would be named. But now if groups are named then we should probably just reserve the name "primary" for group 0 instead of a special keyword.

Comments from Reviewable

@danhhz
Copy link
Copy Markdown
Contributor Author

danhhz commented May 19, 2016

Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.


docs/RFCS/sql_column_groups.md, line 105 [r3] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

My thinking when I proposed that was that primary group would be group 0 and no other groups would be named. But now if groups are named then we should probably just reserve the name "primary" for group 0 instead of a special keyword.

Done.

Comments from Reviewable

@danhhz danhhz merged commit 663d964 into cockroachdb:master May 19, 2016
@danhhz danhhz deleted the rfc_colgroups branch May 19, 2016 15:48
@danhhz danhhz mentioned this pull request May 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants