rfcs: add an RFC for enum types by rohany · Pull Request #47070 · cockroachdb/cockroach

rohany · 2020-04-06T14:27:18Z

Closes #47225

Release note: None

cockroach-teamcity · 2020-04-06T14:27:26Z

This change is

otan

cursory glance. the enum encoding from both postgres and us seems big brain.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, and @rohany)

docs/RFCS/20200331_enums.md, line 33 at r1 (raw file):

To implement enum types in CockroachDB, we have to touch many layers
of the system. In particlar, we need to introduce a way of storing

nit: particular

docs/RFCS/20200331_enums.md, line 63 at r1 (raw file):

The SQL Features team

that's news to me, haha!

docs/RFCS/20200331_enums.md, line 112 at r1 (raw file):

[online schema changes RFC](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20151014_online_schema_change.md).
We propose a similar state 
progression to adding new elements to an enum type. When a new value is added

this may be nicer as dot points / numbering (or picture if you're bothered)

docs/RFCS/20200331_enums.md, line 192 at r1 (raw file):

e 2/1/
d 2/2/
c 3/

is there a case where 0xFF becomes full? i think it we can extend that by infinitely adding 0xFF but wanted to make sure.
similar question for 0x00.

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, and @otan)

docs/RFCS/20200331_enums.md, line 63 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

The SQL Features team

that's news to me, haha!

wops

docs/RFCS/20200331_enums.md, line 112 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

this may be nicer as dot points / numbering (or picture if you're bothered)

Done.

docs/RFCS/20200331_enums.md, line 192 at r1 (raw file):

Previously, otan (Oliver Tan) wrote…

is there a case where 0xFF becomes full? i think it we can extend that by infinitely adding 0xFF but wanted to make sure.
similar question for 0x00.

We can always keep adding 0xFF to get larger byte strings -- same idea as adding 0x00 (except in the 0x00 case we have to make sure that we never end with 0x00).

blathers-crl · 2020-04-06T20:28:21Z

❌ The GitHub CI (Cockroach) build has failed on f174afda.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. I am experimental - my owner is otan.}

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, and @rohany)

docs/RFCS/20200331_enums.md, line 137 at r2 (raw file):

```sql
CREATE TYPE t ENUM AS ('v1', 'v2');
ALTER TYPE t ADD VALUE 'v1.5' AFTER 'v2'

did you mean to say AFTER 'v1' here?

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, and @otan)

docs/RFCS/20200331_enums.md, line 137 at r2 (raw file):

Previously, ajwerner wrote…

did you mean to say AFTER 'v1' here?

yes, woops. updated

blathers-crl · 2020-04-06T22:42:18Z

❌ The GitHub CI (Cockroach) build has failed on 85bdb5f9.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. I am experimental - my owner is otan.}

knz

I like the idea of adding some unit tests alongside the RFC.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, and @rohany)

docs/RFCS/20200331_enums.md, line 48 at r3 (raw file):

We propose storing the metadata about an enum in a new descriptor called 
an `EnumDescriptor`. This descriptor will be added to the descriptor union

I think this idea to have enum-specific descriptors is short-sighted.

Look the elephant in the china shop here is that there are two separate projects going on in the RFC:

creating support for user-defined types, which implies creating and maintaining stored metadata for custom data types.
creating support for ENUM types.

Now, it is true that (2) cannot be achieved without (1). So there's no way around (1).

But making the RFC combine the two problems is short-sighted.
With the current phrasing, the day when we want to support RANGE types (#27791) or DOMAIN types (#27796), we'd need to start from scratch and define yet more descriptor types.

What about a UserDefinedTypeDescriptor instead?

docs/RFCS/20200331_enums.md, line 221 at r3 (raw file):
Also please consider the comment at the top of this issue: #25123

Whenever a user-defined type is created (with CREATE TYPE), PostgreSQL automatically creates an associated array type, whose name consists of the element type's name prepended with an underscore, and truncated if necessary to keep it less than NAMEDATALEN bytes long. (If the name so generated collides with an existing type name, the process is repeated until a non-colliding name is found.) This implicitly-created array type is variable length and uses the built-in input and output functions array_in and array_out. The array type tracks any changes in its element type's owner or schema, and is dropped if the element type is.

Where is this discussed in the RFC?

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, and @rohany)

docs/RFCS/20200331_enums.md, line 68 at r3 (raw file):

Enums are scoped within a database and a schema. However, enums 
cannot be accessed from other databases.

Can you explain why "enums cannot be accessed from other databases"? Is that a restriction in pg's dialect? or is that a limitation of your proposed approach?

docs/RFCS/20200331_enums.md, line 198 at r3 (raw file):

This strategy can be extended indefinitely as long as this pattern is followed to reserve
the minimum byte. A prototype of the exact algorithm is included as part of the RFC PR.

This is clever. It's also close so what is being done for some of the other low-level value encodings.

Maybe you can reuse some of the algorithms already used in the pkg/util/encoding package.

docs/RFCS/20200331_enums.md, line 223 at r3 (raw file):

two `DEnum`s have the same enum ID.

## Optimizer Changes

I think there should be a discussion on the following topics too:

what OIDs are chosen for enum types? How do you guarantee OID stability for them?
what happens when DROP TYPE is used - does CASCADE work? does it recursively delete all tables using the type?
what happens when DROP DATABASE CASCADE is used - does it delete the type? How are all the cached copies of the type's metadata invalidated?

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @knz, @lucy-zhang, and @otan)

docs/RFCS/20200331_enums.md, line 48 at r3 (raw file):

Previously, knz (kena) wrote…

I think this idea to have enum-specific descriptors is short-sighted.

Look the elephant in the china shop here is that there are two separate projects going on in the RFC:

creating support for user-defined types, which implies creating and maintaining stored metadata for custom data types.

creating support for ENUM types.

Now, it is true that (2) cannot be achieved without (1). So there's no way around (1).

But making the RFC combine the two problems is short-sighted.
With the current phrasing, the day when we want to support RANGE types (#27791) or DOMAIN types (#27796), we'd need to start from scratch and define yet more descriptor types.

What about a UserDefinedTypeDescriptor instead?

That makes alot of sense -- it would be good to get alot of this groundwork in for the future.

docs/RFCS/20200331_enums.md, line 68 at r3 (raw file):

Previously, knz (kena) wrote…

Can you explain why "enums cannot be accessed from other databases"? Is that a restriction in pg's dialect? or is that a limitation of your proposed approach?

This is a restriction in postgres (i'll update the doc to make it clearer).

docs/RFCS/20200331_enums.md, line 198 at r3 (raw file):

Previously, knz (kena) wrote…

This is clever. It's also close so what is being done for some of the other low-level value encodings.

Maybe you can reuse some of the algorithms already used in the pkg/util/encoding package.

I believe it is similar to how decimals are encoded -- i'll look if there is any easily reusable code

docs/RFCS/20200331_enums.md, line 221 at r3 (raw file):

Previously, knz (kena) wrote…

Also please consider the comment at the top of this issue: #25123

Whenever a user-defined type is created (with CREATE TYPE), PostgreSQL automatically creates an associated array type, whose name consists of the element type's name prepended with an underscore, and truncated if necessary to keep it less than NAMEDATALEN bytes long. (If the name so generated collides with an existing type name, the process is repeated until a non-colliding name is found.) This implicitly-created array type is variable length and uses the built-in input and output functions array_in and array_out. The array type tracks any changes in its element type's owner or schema, and is dropped if the element type is.

Where is this discussed in the RFC?

added

docs/RFCS/20200331_enums.md, line 223 at r3 (raw file):

Previously, knz (kena) wrote…

I think there should be a discussion on the following topics too:

what OIDs are chosen for enum types? How do you guarantee OID stability for them?

what happens when DROP TYPE is used - does CASCADE work? does it recursively delete all tables using the type?

what happens when DROP DATABASE CASCADE is used - does it delete the type? How are all the cached copies of the type's metadata invalidated?

Added

jordanlewis

Dismissed @otan from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @knz, @lucy-zhang, @otan, and @rohany)

docs/RFCS/20200331_enums.md, line 51 at r4 (raw file):

metadata about an enum in a new descriptor called a `UserDefinedTypeDescriptor`.
This descriptor will be added to the descriptor union
along side Table and Database descriptors and will reuse the same leasing

I feel like it might be useful to store the type descriptors in a different system table. There's a lot of stuff that scans all of the descriptors in the descriptors table - this is going to make that table even bigger. And to get all types we'll all of a sudden have to be scanning all the descriptors again. Can we make a new system table for this? system.typedescriptors or something. Or will that introduce more complications?

docs/RFCS/20200331_enums.md, line 147 at r4 (raw file):

If an enum is dropped without `CASCADE`, the operation will not succeed
if there are any tables that use the enum. If an enum is dropped with
`CASCADE`, all dependent tables are dropped as well. If the database

Hmm, I think it probably should just drop the columns, no? Dropping the tables seems ... excessive.

docs/RFCS/20200331_enums.md, line 269 at r4 (raw file):

that is incremented transactionally as new types are created. This counter will
start at a value larger than all existing Postgres type OIDs to avoid 
collisions with existing types.

This is another reason to not use the regular descriptor table. You really want to avoid having to think about namespace collisions with tables and indexes. Indexes have large OIDs as well.

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @knz, @lucy-zhang, and @otan)

docs/RFCS/20200331_enums.md, line 192 at r1 (raw file):

Previously, rohany (Rohan Yadav) wrote…

We can always keep adding 0xFF to get larger byte strings -- same idea as adding 0x00 (except in the 0x00 case we have to make sure that we never end with 0x00).

Done.

docs/RFCS/20200331_enums.md, line 198 at r3 (raw file):

Previously, rohany (Rohan Yadav) wrote…

I believe it is similar to how decimals are encoded -- i'll look if there is any easily reusable code

Done.

docs/RFCS/20200331_enums.md, line 221 at r3 (raw file):

Previously, rohany (Rohan Yadav) wrote…

added

Done.

docs/RFCS/20200331_enums.md, line 223 at r3 (raw file):

Previously, rohany (Rohan Yadav) wrote…

Added

Done.

docs/RFCS/20200331_enums.md, line 51 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I feel like it might be useful to store the type descriptors in a different system table. There's a lot of stuff that scans all of the descriptors in the descriptors table - this is going to make that table even bigger. And to get all types we'll all of a sudden have to be scanning all the descriptors again. Can we make a new system table for this? system.typedescriptors or something. Or will that introduce more complications?

I think that makes sense -- I'm not sure of the complications though. Does anything stand out @ajwerner ?

docs/RFCS/20200331_enums.md, line 147 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

Hmm, I think it probably should just drop the columns, no? Dropping the tables seems ... excessive.

Yeah you're right i misread this. Updated.

docs/RFCS/20200331_enums.md, line 269 at r4 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

This is another reason to not use the regular descriptor table. You really want to avoid having to think about namespace collisions with tables and indexes. Indexes have large OIDs as well.

Solving this point makes it seem like we need another namespace table, but for types. I'm not sure yet what kind of duplication of resolution/caching this would incur.

knz

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @knz, @lucy-zhang, and @otan)

docs/RFCS/20200331_enums.md, line 51 at r4 (raw file):

Previously, rohany (Rohan Yadav) wrote…

I think that makes sense -- I'm not sure of the complications though. Does anything stand out @ajwerner ?

Hold your horses, not so fast. Please take the following into account:

table names and type names are in the same namespace. You can't create a type and a table of the same name in the same schema.
each table in pg's dialect is also a record type, whose fields are typed like the table's column:

CREATE TABLE abc(x INT, y TEXT);
SELECT '(1,"xy")':: abc; -- this is valid

This means that table descriptors are type descriptors (i.e. looking up a type by name should also accept a table as a type descriptor)

(Ditto for sequences and other object types))

And the OIDs for tables/seqs/views/etc and the OIDs for types also share a common OID pool. Therefore it's reasonable to have them share a common ID generator.

So while I am generally open to the idea to store type descriptors "somewhere else" than table descriptors, if/when we do this we must ensure that:

namespace entries for types are stil storedl in system.namespace (so that name conflicts are properly detected)
the ID generator remains shared
the type name lookup algorithm properly considers tables as candidates

knz

Reviewed 2 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, and @rohany)

docs/RFCS/20200331_enums.md, line 147 at r4 (raw file):

Previously, rohany (Rohan Yadav) wrote…

Yeah you're right i misread this. Updated.

You can mention here that this is pg's behavior (with an example).

docs/RFCS/20200331_enums.md, line 249 at r5 (raw file):

I am currently investigating the extend of the changes that need to be made in
this area.

You can link to the PR here as "example implementation".

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, and @rohany)

docs/RFCS/20200331_enums.md, line 51 at r4 (raw file):

Previously, knz (kena) wrote…

Hold your horses, not so fast. Please take the following into account:

table names and type names are in the same namespace. You can't create a type and a table of the same name in the same schema.

each table in pg's dialect is also a record type, whose fields are typed like the table's column:
CREATE TABLE abc(x INT, y TEXT);
SELECT '(1,"xy")':: abc; -- this is valid
This means that table descriptors are type descriptors (i.e. looking up a type by name should also accept a table as a type descriptor)

(Ditto for sequences and other object types))

And the OIDs for tables/seqs/views/etc and the OIDs for types also share a common OID pool. Therefore it's reasonable to have them share a common ID generator.

So while I am generally open to the idea to store type descriptors "somewhere else" than table descriptors, if/when we do this we must ensure that:

namespace entries for types are stil storedl in system.namespace (so that name conflicts are properly detected)

the ID generator remains shared

the type name lookup algorithm properly considers tables as candidates

Good point. I think we can proceed with storing the type descriptors themselves in a separate descriptor table, but put the types in the existing namespace table.

blathers-crl · 2020-04-13T15:15:03Z

❌ The GitHub CI (Cockroach) build has failed on d795212b.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, and @rohany)

docs/RFCS/20200331_enums.md, line 269 at r4 (raw file):

Previously, rohany (Rohan Yadav) wrote…

Solving this point makes it seem like we need another namespace table, but for types. I'm not sure yet what kind of duplication of resolution/caching this would incur.

This seems like a sticking point in this RFC. Here's a proposal:

We limit the number of table descriptors to math.MaxInt32 and then we set the high order bit for user defined types and use the descriptor ID as the OID? That way it will be easy when debugging to link the type back to a descriptor and we don't need another ID allocation mechanism.

rohany · 2020-04-14T18:15:33Z

We limit the number of table descriptors to math.MaxInt32 and then we set the high order bit for user defined types and use the descriptor ID as the OID? That way it will be easy when debugging to link the type back to a descriptor and we don't need another ID allocation mechanism.

~~Since tables can be referenced as types as well, it seems like this OIDs for tables should also be determined in this way, rather than restricting this to just user defined types.~~

In postgres, the OID corresponding to the type of a table is a new OID. When a table is created, it is given an oid, its type is given an oid, and the implicit array type that mirrors it is given an oid.

knz · 2020-04-15T11:23:06Z

I'm glad to hear that the type of a table and the table itself are two different entries in PostgreSQL with distinct OIDs. That gives us more margin to operate.

(@ajwerner FYI OIds are 32-bit values, so we don't have as much flexibility here. We also already have a special purpose for the sign bit.)

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andy-kimball, @jordanlewis, @lucy-zhang, @otan, @RaduBerinde, and @rohany)

docs/RFCS/20200331_enums.md, line 152 at r6 (raw file):

Previously, RaduBerinde wrote…

There are some implications here related to plan caching in the optimizer. The optimizer needs to be able to tell if the type that was resolved when a cached plan was created is the same with what the type would be if we resolve it now. We currently use cat.Object.Equals to verify this.

We'll be versioning however we store the type data on disk, so i think for DataSources we could store all the user defined types that the source uses, and check these in that method.

What I'm not sure about (you might be able to help) are expressions that contain values of the user defined type. We'd also need to invalidate a plan if any expressions contain the user-defined type and it changed right?

docs/RFCS/20200331_enums.md, line 237 at r6 (raw file):

Previously, RaduBerinde wrote…

I think the algorithm could be tweaked so that instead of using the midpoint between the previous max and 256, it would just increment the previous max (perhaps by a small constant, to leave some gaps). Then we would be able to add quite a few values at the end without getting to multi-byte values.

this makes sense. I think we'd want to do this even spacing when making the original set of enums, and then do this constant shifting when adding to the front or back of the set.

pkg/sql/sem/tree/enum_test.go, line 78 at r6 (raw file):

Previously, RaduBerinde wrote…

An idea for testing: generate a random permutation of 1,2,3..N. Pretend we start with an empty enum and add one value at a time, in the order of the permutation. We would maintain the values generated so far and find the correct next and prev each time. At the end the byte values should be in correct order.

this is a great idea, thanks radu!

andy-kimball

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, @RaduBerinde, and @rohany)

docs/RFCS/20200331_enums.md, line 261 at r6 (raw file):

Previously, rohany (Rohan Yadav) wrote…

These descriptors need to also store metadata that is really unrelated to the types themselves, like what tables and columns use the type, and the parent schema and database the type resides in.

OK, definitely put those details into the RFC.

What will the type proto stored on disk look like, then, for UDT's? Just an OID that will need to be resolved when we hydrate its container (e.g. TableDescriptor, etc)?

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, @RaduBerinde, and @rohany)

docs/RFCS/20200331_enums.md, line 261 at r6 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

OK, definitely put those details into the RFC.

What will the type proto stored on disk look like, then, for UDT's? Just an OID that will need to be resolved when we hydrate its container (e.g. TableDescriptor, etc)?

Will do!

Yeah, it seems like some oid or stable identifier for the udt info on disk is the way to go.

andy-kimball · 2020-04-17T04:46:51Z

docs/RFCS/20200331_enums.md, line 261 at r6 (raw file):

Previously, rohany (Rohan Yadav) wrote…

Will do!

Yeah, it seems like some oid or stable identifier for the udt info on disk is the way to go.

Also, why would the type descriptor need to store what tables, etc. use the type? Is it so that we can bump the table version when one of its types changes? Whatever the reason, make sure to explain it in the RFC.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, @RaduBerinde, and @rohany)

docs/RFCS/20200331_enums.md, line 261 at r6 (raw file):

Previously, andy-kimball (Andy Kimball) wrote…

Also, why would the type descriptor need to store what tables, etc. use the type? Is it so that we can bump the table version when one of its types changes? Whatever the reason, make sure to explain it in the RFC.

For the drop behavior (see Changing Enum Definitions section). You can drop the type with cascade and that will drop all the columns that use it. Even without cascade you need to know if you let the drop happen or error out.

petermattis

I haven't look at the RFC in detail (so perhaps this is already accounted for), but let's make sure that the custom types can be generalized beyond enums in the future. I'm thinking of "record" types, and/or proto.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, @RaduBerinde, and @rohany)

thoszhang

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, @RaduBerinde, and @rohany)

docs/RFCS/20200331_enums.md, line 49 at r7 (raw file):

Enums themselves are a special case of user-defined types. In order
to lay the groundwork for future work in this area, we propose storing
metadata about an enum in a new descriptor called a `TypeDescriptor`.

It might be useful to write out the proto definition, for concreteness.

docs/RFCS/20200331_enums.md, line 150 at r7 (raw file):

into it in the same transaction. Depending on how we represent the metadata
for an enum, we can either reuse the existing schema change infrastructure 
or write a simpler variant for managing changes on enums.

It might be useful to elaborate on the implementation a bit more. I tentatively think that a separate implementation of an enum schema changer would be best, since the only "multi-step" schema change we need is for adding new elements, and implementing that would be simpler than trying to make the existing table schema changer more abstract just to reuse the existing tiny amount of code that moves the mutation from one state to another.

There's also the related but separate question of how to store schema change state on the enum descriptor itself. How is a "read-only" value represented? If we model this after table descriptors, we'll have a mutation queue, and read-only values will only be on the mutations list. But there's the alternative of just having a field on the value itself indicating whether it's read-only or public; it's not clear to me that there's much of an advantage to having the mutations in a list form. Do we actually need mutations on enums?

My overall intuition is that we shouldn't try to generalize the existing table schema changer too much, since if we have to add more descriptors, etc. that need "schema changes," then we'll have a better sense of what our needs are at that point, and whatever we end up implementing for enums will be simple and thus hopefully easy to migrate. But maybe I'm underestimating how much of the existing schema changer state machine implementation actually needs to be reused/reimplemented; in my mind, the answer is "basically none" (aside from the leasing API, which we do very much want to use, but that's separate from the schema changer per se).

To be a little more specific about what I have in mind: I think we'd end up with a new job resumer for enum schema changes, and all it would do is: (1) if the value is in read-only, wait for one version and then make it public; (2) wait for one version at the end. Maybe it wouldn't even be cancellable. The rollback hook consists of deleting the read-only value from the enum entirely.

(As a separate note, it's kind of a misuse of the jobs framework to have such short-lived jobs that basically only wait for leases to drain, but this is an existing problem with schema-changes-as-jobs, and if we solve it by finding some better alternative, we'll solve it for both tables and enums. So I don't think we have to think about that now.)

rohany

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @knz, @lucy-zhang, @otan, @RaduBerinde, and @rohany)

docs/RFCS/20200331_enums.md, line 147 at r4 (raw file):

Previously, knz (kena) wrote…

You can mention here that this is pg's behavior (with an example).

Done.

docs/RFCS/20200331_enums.md, line 249 at r5 (raw file):

Previously, knz (kena) wrote…

You can link to the PR here as "example implementation".

Done.

docs/RFCS/20200331_enums.md, line 49 at r7 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

It might be useful to write out the proto definition, for concreteness.

Done.

docs/RFCS/20200331_enums.md, line 150 at r7 (raw file):

Previously, lucy-zhang (Lucy Zhang) wrote…

It might be useful to elaborate on the implementation a bit more. I tentatively think that a separate implementation of an enum schema changer would be best, since the only "multi-step" schema change we need is for adding new elements, and implementing that would be simpler than trying to make the existing table schema changer more abstract just to reuse the existing tiny amount of code that moves the mutation from one state to another.

There's also the related but separate question of how to store schema change state on the enum descriptor itself. How is a "read-only" value represented? If we model this after table descriptors, we'll have a mutation queue, and read-only values will only be on the mutations list. But there's the alternative of just having a field on the value itself indicating whether it's read-only or public; it's not clear to me that there's much of an advantage to having the mutations in a list form. Do we actually need mutations on enums?

My overall intuition is that we shouldn't try to generalize the existing table schema changer too much, since if we have to add more descriptors, etc. that need "schema changes," then we'll have a better sense of what our needs are at that point, and whatever we end up implementing for enums will be simple and thus hopefully easy to migrate. But maybe I'm underestimating how much of the existing schema changer state machine implementation actually needs to be reused/reimplemented; in my mind, the answer is "basically none" (aside from the leasing API, which we do very much want to use, but that's separate from the schema changer per se).

To be a little more specific about what I have in mind: I think we'd end up with a new job resumer for enum schema changes, and all it would do is: (1) if the value is in read-only, wait for one version and then make it public; (2) wait for one version at the end. Maybe it wouldn't even be cancellable. The rollback hook consists of deleting the read-only value from the enum entirely.

(As a separate note, it's kind of a misuse of the jobs framework to have such short-lived jobs that basically only wait for leases to drain, but this is an existing problem with schema-changes-as-jobs, and if we solve it by finding some better alternative, we'll solve it for both tables and enums. So I don't think we have to think about that now.)

I agree. I've updated the document to reflect this. Both mutations and using the existing schema changer are overkill for this purpose. While the idea is similar, the actual implementation can be separate.

yuzefovich

I couldn't resist myself from leaving some nits when I was reading the RFC :)

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @knz, @lucy-zhang, @otan, @RaduBerinde, and @rohany)

docs/RFCS/20200331_enums.md, line 86 at r8 (raw file):

different descriptor types, and we will take advantage of these
systems once they are available. The leasing system will enable caching
and cache invalidation of of type descriptors. Until the leasing system

nit: s/of of/of/g.

docs/RFCS/20200331_enums.md, line 95 at r8 (raw file):

Enums are scoped within a database and a schema. In Postgres, enums
cannot be accessed from other databases -- they can only be accessed from
different schemas in in the same database. However, there is no core reason

nit: s/in in/in/g.

docs/RFCS/20200331_enums.md, line 115 at r8 (raw file):

All user defined types will need a stable ID that they are uniquely addressable
by from within CockroachDB, as well as an OID that can be used for Postgres 
compliant operations. Importantly, the OIDs cannot overlap with conflict

nit: "overlap with conflict with".

docs/RFCS/20200331_enums.md, line 315 at r8 (raw file):

is altered.

A new `Datum` `DEnum` will be introduced to hold to represent values of the

nit: "to hold to represent".

rohany

Not sure how I had so many duplicate words :derp:

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @knz, @lucy-zhang, @otan, @RaduBerinde, and @rohany)

rohany · 2020-05-05T20:32:35Z

Hi all, I think all of the open questions have been resolved, so I'm looking to merge this RFC. Is there anything that folks are still thinking about?

knz

Rohan, you have organized some discussion in the PR thread (namespace vs non-namespace, system.descriptor) which is not present in the RFC. Do you mind propagating the summary of the PR discussion into the RFC, and add the options that were considered but not selected (with the rationales) in the "Alternatives" section?
thanks

Reviewed 1 of 1 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @lucy-zhang, @otan, @RaduBerinde, and @rohany)

docs/RFCS/20200331_enums.md, line 269 at r4 (raw file):

Previously, ajwerner wrote…

This seems like a sticking point in this RFC. Here's a proposal:

We limit the number of table descriptors to math.MaxInt32 and then we set the high order bit for user defined types and use the descriptor ID as the OID? That way it will be easy when debugging to link the type back to a descriptor and we don't need another ID allocation mechanism.

the OID of the vtables themeselves is already generated by setting the high order bit.
We can do something more clever (use more than 1 bit at the beginning to identify the type) but that needs to be spelled out explicitly.

docs/RFCS/20200331_enums.md, line 98 at r9 (raw file):

that CockroachDB cannot support this. In fact, we might need to support 
references of types across databases to be in line with other cross 
database references that we currently support.

This paragraph should be updated in light of the recent discussion about dropping cross-db references (user-defined schemas RFC).

docs/RFCS/20200331_enums.md, line 122 at r9 (raw file):

if a new allocated OID conflicts with one of the statically known OIDs.
These type OID's are intended to only be used for display purposes in catalog
tables like `pg_catalog.pg_type`.

Can you outline how the vtable generators are going to derive an OID from the type desc IDs?

docs/RFCS/20200331_enums.md, line 312 at r9 (raw file):

extra fields will not be serialized as part of the proto. Instead, when a
type is resolved, the returned `*types.T` will be hydrated to populate these
fields.

Are you sure about this explanation? Did you consider having the enum ID stored in the types.T, then look up the additional enum information from the SemaContext?

docs/RFCS/20200331_enums.md, line 355 at r9 (raw file):

expressions. The hydration of `*types.T` objects can be done at operator
initialization. The trickier problem is type checking serialized expressions --
we don't want to pay the cost of name resolution again. Our proposed solution

The cost of name resolution would be alleviated if you were able to refer to an enum type by ID, like we already have for tables, columns and indexes.

If you were to split the enum metadata in the SemaContext as suggested above, then you could serialize that metadata alongside the distsql processor protos, just once, and index that metadata by the enum ID in the expressions.

docs/RFCS/20200331_enums.md, line 363 at r9 (raw file):

new `crdb_internal` builtin `crdb_internal.create_user_defined_type(str, uint32)`
that returns the input string casted to the type with the input ID. So, `'hello'`
would get serialized as `crdb_internal.create_user_defined_type('hello', 10)`,

that sounds slightly more complicated. (if anything, it makes the serialized expressions much longer)

rohany

good point, moved a summary of those discussions into the RFC

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @knz, @lucy-zhang, @otan, @RaduBerinde, and @rohany)

docs/RFCS/20200331_enums.md, line 98 at r9 (raw file):

Previously, knz (kena) wrote…

This paragraph should be updated in light of the recent discussion about dropping cross-db references (user-defined schemas RFC).

Done.

docs/RFCS/20200331_enums.md, line 122 at r9 (raw file):

Previously, knz (kena) wrote…

Can you outline how the vtable generators are going to derive an OID from the type desc IDs?

done, i am going to store the oid in the type desc as well

docs/RFCS/20200331_enums.md, line 312 at r9 (raw file):

Previously, knz (kena) wrote…

Are you sure about this explanation? Did you consider having the enum ID stored in the types.T, then look up the additional enum information from the SemaContext?

Yeah, I considered this in some prototypes I've been working on. I already store the enum ID in the types.T, but it's not feasible to store the metadata in the sema/eval context. There are many places (encoding/decoding) where it's not possible to plumb an eval/sema context down to. The implementation is cleaner -- if you have a types.T, you can do whatever enum operations you want on it. Additionally, I plan on having this metadata be a pointer to cached information anyway, so it's not like every leaf node in the AST has a new copy of the enum metadata.

docs/RFCS/20200331_enums.md, line 355 at r9 (raw file):

Previously, knz (kena) wrote…

The cost of name resolution would be alleviated if you were able to refer to an enum type by ID, like we already have for tables, columns and indexes.

If you were to split the enum metadata in the SemaContext as suggested above, then you could serialize that metadata alongside the distsql processor protos, just once, and index that metadata by the enum ID in the expressions.

I considered introducing some way of referring to a type by ID in the syntax, but I think that this is more complicated and can lead to unknown side effects (for example, we can't support the postgres builtin '@' because of ambiguity with our column references in distsql). As a result of planning, all types present in a query will know their IDs, and all nodes in the cluster will be able to resolve ID's into type descriptors. There isn't a need to serialize metadata in proto's when all nodes will already have this information.

However, I could potentially see a strategy like that needed for types created in the same transaction. To do something like this the semacontext would have to collect all user defined types resolved during planning, and then serialize those as part of the distsql processor specs.

I haven't yet prototyped this part, so I can't say which will work better.

docs/RFCS/20200331_enums.md, line 363 at r9 (raw file):

Previously, knz (kena) wrote…

that sounds slightly more complicated. (if anything, it makes the serialized expressions much longer)

More complicated than what? The reg builtins do the same thing (crdb_internal.create_reg...(oid, name))

thoszhang

I don't have any further comments.

Reviewed 3 of 3 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @knz, @otan, @RaduBerinde, and @rohany)

docs/RFCS/20200331_enums.md, line 150 at r7 (raw file):

Previously, rohany (Rohan Yadav) wrote…

I agree. I've updated the document to reflect this. Both mutations and using the existing schema changer are overkill for this purpose. While the idea is similar, the actual implementation can be separate.

I originally didn't think about renaming/dropping the types themselves when I commented, but I think we may eventually want some abstraction for interacting with the namespace table (including going through the draining names process) that's common to all descriptors. That doesn't have to happen immediately, though.

knz

LGTM

Another thing I just noticed with pg: renaming a table also renames its associated type. Don't forget to add testing for that.

Reviewed 3 of 3 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @jordanlewis, @otan, @RaduBerinde, and @rohany)

docs/RFCS/20200331_enums.md, line 355 at r9 (raw file):

Previously, rohany (Rohan Yadav) wrote…

I considered introducing some way of referring to a type by ID in the syntax, but I think that this is more complicated and can lead to unknown side effects (for example, we can't support the postgres builtin '@' because of ambiguity with our column references in distsql). As a result of planning, all types present in a query will know their IDs, and all nodes in the cluster will be able to resolve ID's into type descriptors. There isn't a need to serialize metadata in proto's when all nodes will already have this information.

However, I could potentially see a strategy like that needed for types created in the same transaction. To do something like this the semacontext would have to collect all user defined types resolved during planning, and then serialize those as part of the distsql processor specs.

I haven't yet prototyped this part, so I can't say which will work better.

I'm glad it's on your radar.

docs/RFCS/20200331_enums.md, line 363 at r9 (raw file):

Previously, rohany (Rohan Yadav) wrote…

More complicated than what? The reg builtins do the same thing (crdb_internal.create_reg...(oid, name))

if you had dedicated syntax it could be more compact. but maybe that's premature optimization.

rohany · 2020-05-07T14:46:07Z

Another thing I just noticed with pg: renaming a table also renames its associated type. Don't forget to add testing for that.

Will do.

I originally didn't think about renaming/dropping the types themselves when I commented, but I think we may eventually want some abstraction for interacting with the namespace table (including going through the draining names process) that's common to all descriptors. That doesn't have to happen immediately, though.

Agreed, I'll add a note about this in the RFC.

Release note: None

blathers-crl · 2020-05-07T15:09:50Z

❌ The GitHub CI (Cockroach) build has failed on 84a68eb5.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

rohany · 2020-05-07T15:34:00Z

Thanks for all the comments!

bors r+

craig · 2020-05-07T16:15:40Z

Build succeeded

GitHub CI (Cockroach)

rohany requested review from ajwerner, jordanlewis and thoszhang April 6, 2020 14:27

rohany requested a review from a team as a code owner April 6, 2020 14:27

otan reviewed Apr 6, 2020

View reviewed changes

rohany commented Apr 6, 2020

View reviewed changes

rohany force-pushed the enums-rfc branch from ccab62e to f174afd Compare April 6, 2020 19:58

ajwerner reviewed Apr 6, 2020

View reviewed changes

rohany force-pushed the enums-rfc branch from f174afd to 85bdb5f Compare April 6, 2020 22:28

rohany commented Apr 6, 2020

View reviewed changes

knz reviewed Apr 7, 2020

View reviewed changes

rohany commented Apr 7, 2020

View reviewed changes

rohany force-pushed the enums-rfc branch 2 times, most recently from 8bba549 to ea8e96c Compare April 7, 2020 20:20

jordanlewis reviewed Apr 11, 2020

View reviewed changes

rohany commented Apr 11, 2020

View reviewed changes

rohany force-pushed the enums-rfc branch from ea8e96c to e7dbd18 Compare April 11, 2020 03:04

knz reviewed Apr 11, 2020

View reviewed changes

rohany commented Apr 13, 2020

View reviewed changes

rohany force-pushed the enums-rfc branch from e7dbd18 to d795212 Compare April 13, 2020 14:39

rohany force-pushed the enums-rfc branch from d795212 to 6a82f91 Compare April 13, 2020 15:16

ajwerner reviewed Apr 14, 2020

View reviewed changes

rohany commented Apr 17, 2020

View reviewed changes

andy-kimball reviewed Apr 17, 2020

View reviewed changes

rohany commented Apr 17, 2020

View reviewed changes

RaduBerinde reviewed Apr 17, 2020

View reviewed changes

petermattis mentioned this pull request Apr 19, 2020

sql: Add protobuf builtins and column type #47534

Open

petermattis reviewed Apr 19, 2020

View reviewed changes

rohany force-pushed the enums-rfc branch from 6a82f91 to e659050 Compare April 20, 2020 04:25

thoszhang reviewed Apr 21, 2020

View reviewed changes

rohany commented Apr 25, 2020

View reviewed changes

rohany force-pushed the enums-rfc branch from e659050 to 91e02a8 Compare April 25, 2020 19:19

yuzefovich reviewed Apr 28, 2020

View reviewed changes

rohany force-pushed the enums-rfc branch from 91e02a8 to 0926e1b Compare April 28, 2020 16:00

rohany commented Apr 28, 2020

View reviewed changes

ajwerner mentioned this pull request Apr 30, 2020

sql,sqlbase: add leasing information to all descriptors #48250

Closed

knz reviewed May 6, 2020

View reviewed changes

rohany force-pushed the enums-rfc branch from 0926e1b to 14301ce Compare May 6, 2020 14:52

blathers-crl Bot requested a review from otan May 6, 2020 14:52

rohany commented May 6, 2020

View reviewed changes

thoszhang reviewed May 7, 2020

View reviewed changes

knz approved these changes May 7, 2020

View reviewed changes

rohany force-pushed the enums-rfc branch from 14301ce to 0ee00a7 Compare May 7, 2020 14:46

rfcs: add an RFC for enum types

84a68eb

Release note: None

rohany force-pushed the enums-rfc branch from 0ee00a7 to 84a68eb Compare May 7, 2020 14:54

craig Bot merged commit 84aa481 into cockroachdb:master May 7, 2020

Conversation

rohany commented Apr 6, 2020 • edited by jordanlewis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Apr 6, 2020

Uh oh!

otan left a comment

Choose a reason for hiding this comment

Uh oh!

rohany left a comment

Choose a reason for hiding this comment

Uh oh!

blathers-crl Bot commented Apr 6, 2020

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

rohany left a comment

Choose a reason for hiding this comment

Uh oh!

blathers-crl Bot commented Apr 6, 2020

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

rohany left a comment

Choose a reason for hiding this comment

Uh oh!

jordanlewis left a comment

Choose a reason for hiding this comment

Uh oh!

rohany left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

knz left a comment

Choose a reason for hiding this comment

Uh oh!

rohany left a comment

Choose a reason for hiding this comment

Uh oh!

blathers-crl Bot commented Apr 13, 2020

Uh oh!

ajwerner left a comment

Choose a reason for hiding this comment

Uh oh!

rohany commented Apr 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

knz commented Apr 15, 2020

Uh oh!

rohany left a comment

Choose a reason for hiding this comment

Uh oh!

andy-kimball left a comment

Choose a reason for hiding this comment

Uh oh!

rohany left a comment

Choose a reason for hiding this comment

Uh oh!

andy-kimball commented Apr 17, 2020

Uh oh!

RaduBerinde left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

thoszhang left a comment

Choose a reason for hiding this comment

Uh oh!

rohany left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

rohany commented Apr 6, 2020 •

edited by jordanlewis

Loading

rohany commented Apr 14, 2020 •

edited

Loading