Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfcs: proposal to make schemas more pg-compatible #21456

Merged
merged 1 commit into from Feb 19, 2018

Conversation

Projects
None yet
6 participants
@knz
Copy link
Member

commented Jan 15, 2018

Do not let the size of the RFC suggest this is a lot of work. The challenge with this RFC will be to educate users / modify the docs. The technical changes are relatively minor.

@knz knz requested review from jordanlewis, vivekmenezes and awoods187 Jan 15, 2018

@knz knz requested a review from cockroachdb/rfc-prs as a code owner Jan 15, 2018

@cockroach-teamcity

This comment has been minimized.

Copy link
Member

commented Jan 15, 2018

This change is Reviewable

@jordanlewis

This comment has been minimized.

Copy link
Member

commented Jan 15, 2018

Thanks! This is an excellent overview of the problem.

I like the strategy you propose for fixing the problem with compatibility: adding public as the single available physical schema, and the other code-level fallout of that.

I don't like the idea of globally replacing our SQL "database" concept with "catalog". Users aren't getting confused by our use of the word database instead of catalog, and it's already known in the community that database and catalog are used interchangably. See the StackOverflow post I linked in an inline comment. Furthermore, I think this naming change goes beyond what's required for compatibility, which is perhaps a stronger reason that we shouldn't lump it in with the compatibility work. I think the terminology we have right now is completely sufficient and unambiguous.

In a world where we did have user schemas, it would still be okay to have database be the name of record for a schema namespace.

So, I'm suggesting that the scope of this RFC be narrowed to exactly what's necessary to achieve virtual-catalog level compatibility with ORMs, tools and apps.


Reviewed 1 of 1 files at r1.
Review status: all files reviewed at latest revision, 7 unresolved discussions, all commit checks successful.


docs/RFCS/pg_virtual_namespacing.md, line 39 at r1 (raw file):

| Postgres concept | CockroachDB concept |
|------------------|---------------------|
| database         | cluster             |

I think this is actually incorrect. I think in Postgres Database == Catalog, and a Postgres installation is also known as a cluster. See the SO post I linked below.


docs/RFCS/pg_virtual_namespacing.md, line 132 at r1 (raw file):

- a CockroachDB cluster contains multiple *catalogs*. Every cluster
  starts with at least the `system` catalog. More catalogs can be
  created with `CREATE CATALOG`.

I don't know if we really need to make this change. Database is often synonymous with catalog in the relational world. I think making this change will add to the confusion rather than diminish it. Check out this answer: https://stackoverflow.com/a/17943883/73632


docs/RFCS/pg_virtual_namespacing.md, line 140 at r1 (raw file):

- each catalog contains one *physical schema* called `public`,
  and some additional *virtual schemas*, currently including `pg_catalog`, `information_schema` and `crdb_internal`.
  - a future version of CockroachDB may support multiple physical schemas per catalog besides `public`.

👍


docs/RFCS/pg_virtual_namespacing.md, line 145 at r1 (raw file):

  - the `public` schema of different catalogs can contain the same table name, but they will designate different tables.
    For example, two applications can use separate catalogs `myapp1` and `myapp2` and define their own `customers` table,
	and the same name "`customers`" will refer to different tables.

stray tab


docs/RFCS/pg_virtual_namespacing.md, line 151 at r1 (raw file):

- the session variable `current_catalog` designates the current
  catalog, which is used in queries to resolve

Again -1 on this change. Let's leave the database terminology alone I think.


docs/RFCS/pg_virtual_namespacing.md, line 171 at r1 (raw file):

  catalog `myapp2`.

  - As a specific CocroachDB extension, schemas in `search_path` can

CockroachDB*


docs/RFCS/pg_virtual_namespacing.md, line 174 at r1 (raw file):

    be prefixed with a catalog name.
	
	For example, with `search_path = myapp2.public, myapp1.public,

s/tabs/spaces/


Comments from Reviewable

@knz knz force-pushed the knz:20180115-schema branch 2 times, most recently from f7ab800 to e7a10ba Jan 16, 2018

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2018

@bdarnell

This comment has been minimized.

Copy link
Member

commented Jan 16, 2018

Review status: 0 of 1 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


docs/RFCS/pg_virtual_namespacing.md, line 237 at r2 (raw file):

  the schema position other than `public`, `pg_catalog`,
  `crdb_internal`, `information_schema`) as a catalog name. This
  should only be implemented if deemed necessary by user acceptance

I think this is necessary. It's our current "fully qualified" format and it's used pretty widely. Removing it will need additional planning. For example, we'd need to go through a release cycle with some sort of metrics/logging of this feature's usage so that users can be confident that their apps won't blow up when upgrading to the version which removes it. I'd also consider this a semver-major change, so the release that removes it would need to be (at least) 3.0.


Comments from Reviewable

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2018

It's our current "fully qualified" format and it's used pretty widely.

An alternative I just figured out would be to also support the db name as schema alias for public in each database.

I do not think that cross-db queries are common (we actually probably don't support them correctly). If we "just" support a single alias next to public the name resolution code can be kept simple and automatically substitute public for the db name.

@bdarnell

This comment has been minimized.

Copy link
Member

commented Jan 16, 2018

But what if there is no current database? That's the most common case for database.table qualified names.

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2018

I.e. the user enters select * from foo.kv, we normalize the name as foo.foo.kv and we then simplify to foo.public.kv

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2018

How so there is no current database? Does that even work?

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 16, 2018

Hmm also the "no current db" case is specifically the one where we most likely need this backward compatibility right?
Clients that do have the current db set are unlikely to qualify their names?

In that case we can have a normalization rule conditional on whether database is set, as follows:

  • database set: X.Y normalized to database.X.Y, possibly support X as alias for public in case X = the current value of database
  • database not set: X.Y normalized to X.public.Y in all cases
@bdarnell

This comment has been minimized.

Copy link
Member

commented Jan 16, 2018

How so there is no current database? Does that even work?

It's the default when you run cockroach sql without a --database flag.

Hmm also the "no current db" case is specifically the one where we most likely need this backward compatibility right?
Clients that do have the current db set are unlikely to qualify their names?

That's probably true, although I wouldn't be surprised if there are some frameworks that both set a current database and redundantly use the qualified form.

I do not think that cross-db queries are common (we actually probably don't support them correctly).

Why wouldn't we support them correctly? In the absence of multiple schemas per database, people may be using multiple databases as a substitute. I don't think this would be common, but I don't think it's so rare that we can break them.

@petermattis

This comment has been minimized.

Copy link
Contributor

commented Jan 17, 2018

I do not think that cross-db queries are common (we actually probably don't support them correctly).

Why wouldn't we support them correctly? In the absence of multiple schemas per database, people may be using multiple databases as a substitute. I don't think this would be common, but I don't think it's so rare that we can break them.

I haven't fully followed this conversation, but wanted to pipe in here. Apologies if this is already understood. Historically some of our load generators have connected to one database and then accessed another. For example, the query string would connect to the photos database but the queries would reference test.kv. Yes, this could be seen as a bug in the load generator, but I doubt we can break that behavior without serious pain.


Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


Comments from Reviewable

@nvanbenschoten

This comment has been minimized.

Copy link
Member

commented Jan 21, 2018

Thanks for putting this together @knz!

How so there is no current database? Does that even work?

Postgres has a default catalog called "postgres". Have you given any thought to the creation of a default writable (i.e. not system) catalog in Cockroach and setting that to the current database whenever one is not specified? It would allow us to eliminate the entire "no current db" state, which would bring us closer to Postgres. It would also simplify the user story for new devs testing out Cockroach by avoiding the CREATE DATABASE test; SET DATABASE = test; step. Of course, it would also add extra backward compatibility complications.


Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


docs/RFCS/pg_virtual_namespacing.md, line 44 at r2 (raw file):

| table            | table               |

The distinction between cluster  and catalog is as follows:

nit: random double spaces all over the place.


docs/RFCS/pg_virtual_namespacing.md, line 102 at r2 (raw file):

| def        | information_schema | tables    |
| def        | information_schema | columns   |
| def        | information_schema | ...       |

For reference, def stands for "default" and is what MySQL lists in the information_schema for any reference to "Catalog". This is because in MySQL "Catalogs" and "Schemas" are both equivalent and given the name "Database". The design of our object hierarchy was heavily influenced by MySQL, which is why we made similar choices along the way.


docs/RFCS/pg_virtual_namespacing.md, line 135 at r2 (raw file):

- each catalog contains one *physical schema* called `public`,
  and some additional *virtual schemas*, currently including `pg_catalog`, `information_schema` and `crdb_internal`.
  - a future version of CockroachDB may support multiple physical schemas per catalog besides `public`.

To be explicit about this we should probably add support for CREATE SCHEMA in our grammar and throw a descriptive error.


docs/RFCS/pg_virtual_namespacing.md, line 186 at r2 (raw file):
@jordan above you said:

I don't like the idea of globally replacing our SQL "database" concept with "catalog".

On that note, are you supportive of making "catalog" synonymous with "database" globally and adding these aliases? I've personally gone back and forth. On one hand, it improves our PG compatibility. On the other, it introduces more room for users to get confused and leads to more users needing to ask questions like that stack overflow question in the first place.


docs/RFCS/pg_virtual_namespacing.md, line 195 at r2 (raw file):

  as previously. The previous filler string "`def`" disappears. The
  string "`public`" is now used as filler for the "Schema" column for
  rows that point to actual table data.

👍


Comments from Reviewable

@jordanlewis

This comment has been minimized.

Copy link
Member

commented Jan 22, 2018

I like your idea, @nvanbenschoten. The story for Postgres is slightly more complicated because psql actually connects you by default to a catalog named after the current user. It doesn't automatically create that catalog, though - the user must do it with the external createdb command.

Regardless, I would be in support of a change that removes the "no current db" state, and changes cockroach sql to connect by default to a default database cockroach or maybe default.

What are the backward compatibility complications you hint at? Are you talking about cross-database references?


Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


docs/RFCS/pg_virtual_namespacing.md, line 186 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

@jordan above you said:

I don't like the idea of globally replacing our SQL "database" concept with "catalog".

On that note, are you supportive of making "catalog" synonymous with "database" globally and adding these aliases? I've personally gone back and forth. On one hand, it improves our PG compatibility. On the other, it introduces more room for users to get confused and leads to more users needing to ask questions like that stack overflow question in the first place.

I think that adding these aliases will definitely lead to more confusion. That being said, @knz, I would suggest we make the discussion of exposing the catalog==database concept as syntax to a separate work item, distinct from whether we add the compatibility-focused features. That way if there's some disagreement about this aliasing it doesn't have to hold up the compatibility wins we get by changing our virtual tables to behave more like Postgres's.


Comments from Reviewable

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 26, 2018

Hi all, thanks for the many suggestions and feedback.
I have started to create a prototype, link in the latest change below.

I think the creation of a default database is a very good way forward. Adopted.
I also bumped into compatibility with previously created views, but there is a solution, see below.

The prototype can now successfully boot a cockroachdb node and answer queries using the new rules, but the following are still not supported:

  • search path (I haven't gotten to it yet)
  • oid lookups (I'll need help from jordan/nathan on these)
  • creation of default database
  • view migration

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, some commit checks pending.


docs/RFCS/pg_virtual_namespacing.md, line 39 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I think this is actually incorrect. I think in Postgres Database == Catalog, and a Postgres installation is also known as a cluster. See the SO post I linked below.

Done.


docs/RFCS/pg_virtual_namespacing.md, line 132 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I don't know if we really need to make this change. Database is often synonymous with catalog in the relational world. I think making this change will add to the confusion rather than diminish it. Check out this answer: https://stackoverflow.com/a/17943883/73632

Clarified.


docs/RFCS/pg_virtual_namespacing.md, line 171 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

CockroachDB*

Done.


docs/RFCS/pg_virtual_namespacing.md, line 174 at r1 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

s/tabs/spaces/

Done.


docs/RFCS/pg_virtual_namespacing.md, line 102 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

For reference, def stands for "default" and is what MySQL lists in the information_schema for any reference to "Catalog". This is because in MySQL "Catalogs" and "Schemas" are both equivalent and given the name "Database". The design of our object hierarchy was heavily influenced by MySQL, which is why we made similar choices along the way.

Done.


docs/RFCS/pg_virtual_namespacing.md, line 135 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

To be explicit about this we should probably add support for CREATE SCHEMA in our grammar and throw a descriptive error.

Done.


docs/RFCS/pg_virtual_namespacing.md, line 186 at r2 (raw file):

Previously, jordanlewis (Jordan Lewis) wrote…

I think that adding these aliases will definitely lead to more confusion. That being said, @knz, I would suggest we make the discussion of exposing the catalog==database concept as syntax to a separate work item, distinct from whether we add the compatibility-focused features. That way if there's some disagreement about this aliasing it doesn't have to hold up the compatibility wins we get by changing our virtual tables to behave more like Postgres's.

Clarified (below).


docs/RFCS/pg_virtual_namespacing.md, line 237 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think this is necessary. It's our current "fully qualified" format and it's used pretty widely. Removing it will need additional planning. For example, we'd need to go through a release cycle with some sort of metrics/logging of this feature's usage so that users can be confident that their apps won't blow up when upgrading to the version which removes it. I'd also consider this a semver-major change, so the release that removes it would need to be (at least) 3.0.

Updated in light of experiments. PTAL.


Comments from Reviewable

@petermattis

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2018

I find the description of catalog/database, schema and table a bit confusing. The current approach in cockroach is what I would describe as a 2-level directory hierarchy. We have the root level which contains databases and the second level (databases) contain tables (e.g. /system/leases). Postgres has a 3-level hierarchy (/system/public/leases). We need to figure out how to evolve our 2-level hierarchy into a 3-level hierarchy, perhaps allowing both for backward compatibility. My understanding is also that schemas, at least some of them, can be "linked" into multiple catalogs/databases. In particular, the virtual schemas pg_catalog and information_schema appear in every catalog. Do tables ever exist in multiple schemas? Do user created schemas ever exist in multiple catalogs? I think the answer to both of these questions is no.

I'm wondering if we can allow tables to exist (for name lookup purposes) both directly within a catalog/database and also within a schema. I think this can be accomplished with a tweak to the name lookup rules. Similar to Postgres, CREATE TABLE foo would be shorthand for CREATE TABLE public.foo. SHOW TABLES FROM database would be shorthand for SHOW TABLES FROM database.public. When we encounter a name x.y, we would have to check to see if x is a schema name within the current database and thus translate that to <current database>.x.y or, if that fails, x is a database name in which case we would translate it to x.public.y. A table name like x.y.z would be interpreted as <database>.<schema>.<table>. Note that this seems like a minor generalization of the Postgres name lookup which treats x.y as <schema>.<table> and x.y.z as <database>.<schema>.<table>. (I'm getting this from https://www.postgresql.org/docs/10/static/ddl-schemas.html).

I might very well be misunderstanding the problem here. I need to give this RFC another thorough read.


Review status: 0 of 1 files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.


docs/RFCS/pg_virtual_namespacing.md, line 128 at r3 (raw file):

| Word, before   | What is being designated                                   | Word, after          | Visible to users? |
|----------------|------------------------------------------------------------|----------------------|-------------------|
| Database       | A KV prefix for multiple tables                            | KV Database          | No                |

This isn't accurate. The Database is not prefixed on the keys. Keys are only prefixed by their Table ID.


docs/RFCS/pg_virtual_namespacing.md, line 133 at r3 (raw file):

| Schema         | A namespace container for virtual tables                   | (Virtual) Schema     | Yes               |
| (didn't exist) | A virtual namespace container for all tables in a catalog  | Schema               | Yes               |
| Table          | a KV prefix for row data                  | (KV) Table, still unambiguous for now | Yes               |

Any mention of KV above is a bit confused. The KV structure of keys knows nothing about Databases, Schemas or Tables. In SQL, keys that are part of a table are prefixed by their Table ID. To find the Database for a key we need to lookup the TableDescriptor and access the ParentID field.


Comments from Reviewable

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 26, 2018

@petermattis

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2018

What about 1) you let me investigate how to make cockroachdb compatible with what pg does first

Isn't that what this whole RFC is about?

  1. validate that client apps that need this support work well

Can you clarify which support you are referring to? I'm anxious about backwards compatibility with our current semantics.

@petermattis

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


docs/RFCS/pg_virtual_namespacing.md, line 233 at r3 (raw file):

  - The SQL catalog name resolves to a KV database descriptor.
  - The SQL catalog logical prefix maps to a KV database prefix.
  - SQL tables map to KV as usual.

The SQL to KV mapping could be extended without too much difficulty to support an arbitrary number of levels. The key bits are in system.namespace which is a mapping from <ParentID,name> to <ID>. Note that even currently <ParentID> is not always a Database ID as we also have an entries for the "root namespace ID" (think of this as the root directory). Allowing <ID> to also be a Schema ID would be possible.


docs/RFCS/pg_virtual_namespacing.md, line 251 at r3 (raw file):

  For compatibility with CockroachDB v1.x. a name of the form `foo.x`
  when the current database is `foo` will be understood as
  `foo.public.x`.

Treating foo.x as foo.public.x, regardless of the current database is challenging due to the current signature of NormalizableTableName.NormalizeWithDatabaseName. I'm not seeing a reason we can't pass in more than the current database name there. Specifically, we could pass in an interface that allows lookups of database.table names.

The downside is additional name lookup complexity, but I think that is fairly contained. Another downside is mild confusion with tables named public. Consider a table named public in the database foo, schema public (i.e. foo.public.public). If the user specifies foo.public, we'll translate that to foo.public.public which might not be what they intended (perhaps the query was actually an error). I don't think this is a significant problem, but perhaps there are similar ones lurking around.

The upside to this approach is no need to migrate existing views, no need to force sessions to always have a database, and we're backward compatible with existing apps.


Comments from Reviewable

@knz knz force-pushed the knz:20180115-schema branch from 3f86458 to 0911db7 Jan 27, 2018

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 27, 2018

In the light of my code experiments and further thinking stimulated by Peter, I have more or less entirely rewritten this RFC.
It still makes the same proposal, but it is much less confused about terminology and mechanism.

A detailed section about expected behavior is now also provided, so that there is no ambiguity about what the end-result should comply to. Special focus is given to the information_schema tables and the contract they have with the name resolution rules. These details are necessary to understand why arbitrarily deep name hierarchies are just not a good idea.

Nevertheless, Peter your other solution from earlier is interesting. I have copied it into the RFC in the section "Detailed solution", together with your additional comments from a later review. I think it needs further consideration. It does not have my preference yet though, because of the pollution it implies in the logical schema name position of FQNs. That's too much painting ourselves into a corner for my taste, even considering the other benefits.


Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/pg_virtual_namespacing.md, line 27 at r3 (raw file):

I'm wondering if we can allow tables to exist (for name lookup purposes) both directly within a catalog/database and also within a schema. I think this can be accomplished with a tweak to the name lookup rules. Similar to Postgres, CREATE TABLE foo would be shorthand for CREATE TABLE public.foo. SHOW TABLES FROM database would be shorthand for SHOW TABLES FROM database.public. When we encounter a name x.y, we would have to check to see if x is a schema name within the current database and thus translate that to .x.y or, if that fails, x is a database name in which case we would translate it to x.public.y. A table name like x.y.z would be interpreted as ..

. Note that this seems like a minor generalization of the Postgres


docs/RFCS/pg_virtual_namespacing.md, line 128 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This isn't accurate. The Database is not prefixed on the keys. Keys are only prefixed by their Table ID.

Indeed. I was confused. Rewritten.


docs/RFCS/pg_virtual_namespacing.md, line 133 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Any mention of KV above is a bit confused. The KV structure of keys knows nothing about Databases, Schemas or Tables. In SQL, keys that are part of a table are prefixed by their Table ID. To find the Database for a key we need to lookup the TableDescriptor and access the ParentID field.

Indeed. Fixed.


docs/RFCS/pg_virtual_namespacing.md, line 233 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The SQL to KV mapping could be extended without too much difficulty to support an arbitrary number of levels. The key bits are in system.namespace which is a mapping from <ParentID,name> to <ID>. Note that even currently <ParentID> is not always a Database ID as we also have an entries for the "root namespace ID" (think of this as the root directory). Allowing <ID> to also be a Schema ID would be possible.

I gave this some thought and, with sufficient research, this idea had to be rejected. The complete argument why it is unsound requires diving into naming semantics, which I was more than happy to do. There is now a detailed background section in the RFC that explains why it's not possible. I do acknowledge your proposal in the "Alternatives" section at the end though.


docs/RFCS/pg_virtual_namespacing.md, line 251 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Treating foo.x as foo.public.x, regardless of the current database is challenging due to the current signature of NormalizableTableName.NormalizeWithDatabaseName. I'm not seeing a reason we can't pass in more than the current database name there. Specifically, we could pass in an interface that allows lookups of database.table names.

The downside is additional name lookup complexity, but I think that is fairly contained. Another downside is mild confusion with tables named public. Consider a table named public in the database foo, schema public (i.e. foo.public.public). If the user specifies foo.public, we'll translate that to foo.public.public which might not be what they intended (perhaps the query was actually an error). I don't think this is a significant problem, but perhaps there are similar ones lurking around.

The upside to this approach is no need to migrate existing views, no need to force sessions to always have a database, and we're backward compatible with existing apps.

I do not understand the first technical part. Or maybe I do not see any complexity there.

The rest of your comment is very useful though, and I copied it into the RFC for further discussion.


Comments from Reviewable

@knz knz force-pushed the knz:20180115-schema branch from 0911db7 to 9dc15e7 Jan 27, 2018

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 27, 2018

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/pg_virtual_namespacing.md, line 687 at r4 (raw file):

> is a schema name within the current database

- knz: or one of the virtual schema names

@petermattis Also, in other words for every name in the SQL query entered as x.y, we'd need 4 (currently) string comparisons before we know how to qualify, and later potentially also look up a schema descriptor. Just these string comparisons make me a bit uncomfortable already.


Comments from Reviewable

@knz knz force-pushed the knz:20180115-schema branch 3 times, most recently from 9abe85b to bc67a0e Jan 31, 2018

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2018

All right I have integrated Peter's proposal.

We need two different algorithms depending on whether we're accessing an existing object or creating a new object. I hadn't properly spelled that out in the previous draft and it's now detailed separately.


Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions.


docs/RFCS/pg_virtual_namespacing.md, line 128 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is better, though you're still mention KV which doesn't seem accurate. This comment also applies to the use of Table below. Perhaps your definition of KV is different than mine.

Updated.


docs/RFCS/pg_virtual_namespacing.md, line 233 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm guessing that you're rejecting the idea of arbitrary number of levels. That wasn't a serious suggestion, just to point out that our system.namespace and system.descriptor tables could support an arbitrary number of levels. Moving from the 2-level hierarchy we have now to a 3-level hierarchy should be possible. And it should be possible to support both a 2-level and a 3-level hierarchy simultaneously. This comment is an aside, though. I think this RFC should stay focused on having a single public schema in each catalog and we can worry later about adding multiple physical schema support.

Yes thanks.


docs/RFCS/pg_virtual_namespacing.md, line 251 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I do not understand the first technical part. Or maybe I do not see any complexity there.

Currently, if we see the table name x we "normalize" the name be prepending the current database name (e.g. foo.x). If we see the table name foo.x we again "normalize" the name but leave it unchanged since a database is already specified. It seems to me that you're attempting to stick with this "normalization" approach. My point was that doing so is restrictive. Instead, we could take a table name like foo.x and think of it more like finding a binary in ${PATH}. We first check to see if <current-db>.foo.x exists. Failing that, we check for foo.public.x. There is no ambiguity in what happens here, but the "normalization" involves checking in system.namespace (which is quick). If we see a table name like x we have to iterate over the search path looking for <current-db>.public.x, <current-db>.pg_catalog.x, <current-db>.crdb_internal.x, etc.

Also, note that looking for foo.public.x can be controlled by a cluster setting. Something like sql.strict_name_resolution (default false).

Noted. Changed accordingly.


docs/RFCS/pg_virtual_namespacing.md, line 177 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Postgres doesn't appear to have these commands. Since this RFC is currently based on a narrow goal of postgres compatibility, we should probably avoid introducing the CATALOG keyword except where needed for compatibility.

Removed.


docs/RFCS/pg_virtual_namespacing.md, line 201 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Related to my comment above, I may have missed your objection to my proposal (you added a lot of text to this RFC and it is overwhelming me). If a schema isn't specified for a name we already have to iterate over search_path to find the schema the name resides in. Why can't we do something similar here? If S.T is specified we first check to see if database.S.T exists. If it doesn't, we check to see if S.public.T exists. I think this is actually more correct as your rules as currently stated are checking for S.public.T if S == database which seems problematic. Consider database = 'test' and S = 'test'. Your rules will resolve the name test.kv to test.public.kv which would differ from Postgres behavior.

Yep, understood.


docs/RFCS/pg_virtual_namespacing.md, line 247 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This rule seems subtle and I'm not sure the benefit of it outweighs the confusion it may cause. Is this improving our compatibility with Postgres?

Not needed with Peter's suggestion. Removed.


docs/RFCS/pg_virtual_namespacing.md, line 539 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This brings up the question about whether we should begin to implement more of the system info functions once our object hierarchy mirrors Postgres. Doing so seems like a clear win to me in terms of compatibility.

I agree! But we really implement many of those already. Which are those that are missing?


docs/RFCS/pg_virtual_namespacing.md, line 687 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

What makes you uncomfortable about the string comparisons? Performance? These string comparisons are cheap relative to all of the other work to plan and execute a query.

In the usual case, there wouldn't be any additional work. x.y is looked up as <current database>.x.y which is what we have to do with either your preferred approach or my small alteration. FWIW, the lookup here potentially involves a lot more than 4 string comparisons as we have to do a binary search in the SystemConfig.

As explained offline: it's not performance, it's the story about how to explain this. The real simplicity of your proposal is that it is an additional rule and only covers cases that would otherwise cause an error in postgres. I had missed that. Changed the RFC to reflect.


Comments from Reviewable

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2018

@nvanbenschoten @justinj it would be swell if you could check if the rules I propose here are a strict superset of what pg does (i.e. it resolves everything pg would resolve in the same way, and resolves more things that would otherwise fail in pg).


Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions.


docs/RFCS/pg_virtual_namespacing.md, line 1 at r4 (raw file):

Show diff


Comments from Reviewable

@petermattis

This comment has been minimized.

Copy link
Contributor

commented Jan 31, 2018

:lgtm: with a question below about the search_path default.


Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.


docs/RFCS/pg_virtual_namespacing.md, line 198 at r5 (raw file):

- Algorithm B: resolving the name for a column name
   - `SELECT <here> FROM ...` (i.e. names in scalar expressions that don't fall the patterns above)

s/fall the/fall into the/g


docs/RFCS/pg_virtual_namespacing.md, line 239 at r5 (raw file):

        otherwise if N is a composite name of the form C.S, use that and go to step 3.2
   3.2. if the object C.S.T exists, then keep C and S and go to step 4.
   3.3. if no value N in `search_path` makes N.T / C.N.T exist as per the rule above, then fail with a name resolution error.

I think we might be missing a case here. What happens if search_path is empty (which is the current default)? Cockroach currently resolves T to C.T (assuming C is the current database). Should the new rule be to resolve that as C.public.T even though the search_path is empty? Or do you plan to change the default search_path and if an application has override the default (e.g. to set it empty) they will break? That might be ok, but I don't see mention of this in the doc.


Comments from Reviewable

@knz

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2018

Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, all commit checks successful.


docs/RFCS/pg_virtual_namespacing.md, line 198 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/fall the/fall into the/g

Done.


docs/RFCS/pg_virtual_namespacing.md, line 239 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think we might be missing a case here. What happens if search_path is empty (which is the current default)? Cockroach currently resolves T to C.T (assuming C is the current database). Should the new rule be to resolve that as C.public.T even though the search_path is empty? Or do you plan to change the default search_path and if an application has override the default (e.g. to set it empty) they will break? That might be ok, but I don't see mention of this in the doc.

I had a note about this and I think I mistakenly deleted it. Adding it back.

tldr: the search_path cannot be empty. Even if it's not obvious we already do not allow it to be empty. pg mandates a non-empty search path ("if the user specifies an empty search path, lookup rules will operate as if it contains pg_catalog only") and we have code to do that. Spelling it out.


Comments from Reviewable

@knz knz force-pushed the knz:20180115-schema branch from bc67a0e to c999079 Jan 31, 2018

@petermattis

This comment has been minimized.

Copy link
Contributor

commented Feb 1, 2018

:shipit:


Review status: 0 of 1 files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.


Comments from Reviewable

@knz

This comment has been minimized.

Copy link
Member Author

commented Feb 5, 2018

There's an oversight in the RFC, table patterns may need to normalize using the current db and search path too. Looking into it.

@knz

This comment has been minimized.

Copy link
Member Author

commented Feb 16, 2018

Will update the RFC with the learnings from #22371/#22753:

  • the algorithm for new (target) names is different than what I had understood and wrote down (it only uses the first item in the search path).
  • there was a case missing due to a weird cockroachdb extension (e.g. select * from "".pg_catalog.pg_class)

Also I'll outline the implementation.

@knz knz force-pushed the knz:20180115-schema branch from c999079 to 095dd02 Feb 19, 2018

@knz knz force-pushed the knz:20180115-schema branch from 095dd02 to 92c8603 Feb 19, 2018

@knz

This comment has been minimized.

Copy link
Member Author

commented Feb 19, 2018

Note that I chose against supporting database prefixes in search_path in the way that initially was suggested. The reason for this is that a period is a valid character in a SQL identifier. Trying to interpret periods would be complicated.

@knz

This comment has been minimized.

Copy link
Member Author

commented Feb 19, 2018

RFC updated to reflect the implementation, Merging

@knz knz merged commit 0286431 into cockroachdb:master Feb 19, 2018

3 checks passed

GitHub CI (Cockroach) TeamCity build finished
Details
code-review/reviewable Review complete: 0 of 0 LGTMs obtained (and 1 stale)
Details
license/cla Contributor License Agreement is signed.
Details

@knz knz deleted the knz:20180115-schema branch Feb 19, 2018

@petermattis

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


docs/RFCS/20180219_pg_virtual_namespacing.md, line 157 at r7 (raw file):

There are 4 relevant separate algorithms for name resolution, depending
on where in the SQL syntax the name resolution occurs:

There is a level of rigor to this updated section that is at the same time awe inspiring and disconcerting. The detail is counter to the goal of concise high-level descriptions in RFCs. Is this level of detail captured in the comments in the code? If yes then there is a significant duplication here, and if no, this detail should be captured in comments. Perhaps I should be viewing this as detailed code documentation for future engineers, though my response would be that this detail would be better captured in a tech-note or detailed comment. I guess my high-level question is who is the expected reader?


docs/RFCS/20180219_pg_virtual_namespacing.md, line 190 at r7 (raw file):

## Outline of the implementation

The generic, reuisable algorithms are implemented in

nit: s/reuisable/reusable/g


Comments from Reviewable

@knz

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


docs/RFCS/20180219_pg_virtual_namespacing.md, line 157 at r7 (raw file):

Is this level of detail captured in the comments in the code?

yes, insofar that the comments describe at a high level what the code does...

If yes then there is a significant duplication here

... but not so much that the comments repeat what the code does. This RFC states what the code should do in human language; Go code is no substitute for a clear English-language specification.

and if no, this detail should be captured in comments.
Perhaps I should be viewing this as detailed code documentation for future engineers, though my response would be that this detail would be better captured in a tech-note or detailed comment.

Yes I think a tech note is justified. I am going to observe how further bug fixes / performance work will refine this work and produce a tech note with the result.

I guess my high-level question is who is the expected reader?

Mostly 1) people who will later ask "why is this code so complex?" and 2) people who will embark on rewriting the code without bothering to look at the existing implementation. Their reviewer will then have materials to evaluate whether the new code is up to spec.


docs/RFCS/20180219_pg_virtual_namespacing.md, line 190 at r7 (raw file):

Previously, petermattis (Peter Mattis) wrote…

nit: s/reuisable/reusable/g

Thanks #22829.


Comments from Reviewable

@petermattis

This comment has been minimized.

Copy link
Contributor

commented Feb 20, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


docs/RFCS/20180219_pg_virtual_namespacing.md, line 157 at r7 (raw file):

... but not so much that the comments repeat what the code does. This RFC states what the code should do in human language; Go code is no substitute for a clear English-language specification.

My complaint wasn't about the duplication of this detail with the Go code, but the duplication of this detail with the comments. I agree the human language description is valuable for understanding, but I feel this belongs in detailed comments, not this RFC. My contention, perhaps mistaken, is that RFCs are read once and then forgotten while code is read many many (many!) times.

Mostly 1) people who will later ask "why is this code so complex?" and 2) people who will embark on rewriting the code without bothering to look at the existing implementation. Their reviewer will then have materials to evaluate whether the new code is up to spec.

Are we actually seeing these concerns happen in practice? Regardless, detailed commentary in the code seems a better defense. The RFC can easily be overlooked when code is rewritten. The comments have to be either ignored or deleted.


Comments from Reviewable

@knz

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


docs/RFCS/20180219_pg_virtual_namespacing.md, line 157 at r7 (raw file):

I agree the human language description is valuable for understanding, but I feel this belongs in detailed comments, not this RFC.

I tend to agree that comments could be more extensive. I am open to extending the comments by including some of the details from the RFC into them. I'll see what I can. However it is not mutually exclusive, because...

My contention, perhaps mistaken, is that RFCs are read once and then forgotten while code is read many many (many!) times.

.. I think your contention is mistaken. Code is indeed read many times, but RFCs provide a time-contextual "storyline" around the changes in the code.

For example, one cannot form a clear high-level picture of name resolution in SQL by looking at the 100s separate places where it happens in CockroachDB, nor by trying to aggregate the comments that accompany these 100s of places. The RFCs / tech notes can do that.

I do not claim here to favor RFCs over tech notes to provide these rationales. Yet we have already numerous times in the team observed that RFCs are being re-read "after the fact"! For various purposes, including but not limited to:

a) dig into history
b) provide discussion anchors when discussing new improvements
c) providing background for doc work
d) answering the "why" for design choices
e) understanding the order in which things were layered
f) ensure common terminology is reused over time

Maybe you could make a point that a well-populated tech note repository could satisfy these various needs better than RFCs. But we're not there yet, and I'm happy that RFCs play this role in the interim instead of not being able to do those things at all.


Comments from Reviewable

@knz

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


docs/RFCS/20180219_pg_virtual_namespacing.md, line 157 at r7 (raw file):

Previously, knz (kena) wrote…

I agree the human language description is valuable for understanding, but I feel this belongs in detailed comments, not this RFC.

I tend to agree that comments could be more extensive. I am open to extending the comments by including some of the details from the RFC into them. I'll see what I can. However it is not mutually exclusive, because...

My contention, perhaps mistaken, is that RFCs are read once and then forgotten while code is read many many (many!) times.

.. I think your contention is mistaken. Code is indeed read many times, but RFCs provide a time-contextual "storyline" around the changes in the code.

For example, one cannot form a clear high-level picture of name resolution in SQL by looking at the 100s separate places where it happens in CockroachDB, nor by trying to aggregate the comments that accompany these 100s of places. The RFCs / tech notes can do that.

I do not claim here to favor RFCs over tech notes to provide these rationales. Yet we have already numerous times in the team observed that RFCs are being re-read "after the fact"! For various purposes, including but not limited to:

a) dig into history
b) provide discussion anchors when discussing new improvements
c) providing background for doc work
d) answering the "why" for design choices
e) understanding the order in which things were layered
f) ensure common terminology is reused over time

Maybe you could make a point that a well-populated tech note repository could satisfy these various needs better than RFCs. But we're not there yet, and I'm happy that RFCs play this role in the interim instead of not being able to do those things at all.

I forgot to mention that if a new hire or a new contributor or a 3rd party comes to us and says "how can I get an intro to this area of the product" having a RFC with a clear section "how to explain this to newcomers" is 1000x better than comments 1) scattered throughout the code 2) that have been written with an already experienced audience in mind.


Comments from Reviewable

@petermattis

This comment has been minimized.

Copy link
Contributor

commented Feb 20, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


docs/RFCS/20180219_pg_virtual_namespacing.md, line 157 at r7 (raw file):

.. I think your contention is mistaken. Code is indeed read many times, but RFCs provide a time-contextual "storyline" around the changes in the code.

I exaggerated for effect. Here is less of an exaggeration: code and comments are read at least an order of magnitude more than RFCs.

I forgot to mention that if a new hire or a new contributor or a 3rd party comes to us and says "how can I get an intro to this area of the product" having a RFC with a clear section "how to explain this to newcomers" is 1000x better than comments 1) scattered throughout the code 2) that have been written with an already experienced audience in mind.

Certainly, all new contributors (new hires or 3rd parties) should read through the RFCs in areas they are interacting with, yet they need to do so with the understanding that RFCs are out of date almost as soon as they are written. Fighting against that is a losing battle.


Comments from Reviewable

@knz

This comment has been minimized.

Copy link
Member Author

commented Feb 20, 2018

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


docs/RFCS/20180219_pg_virtual_namespacing.md, line 157 at r7 (raw file):

Previously, petermattis (Peter Mattis) wrote…

.. I think your contention is mistaken. Code is indeed read many times, but RFCs provide a time-contextual "storyline" around the changes in the code.

I exaggerated for effect. Here is less of an exaggeration: code and comments are read at least an order of magnitude more than RFCs.

I forgot to mention that if a new hire or a new contributor or a 3rd party comes to us and says "how can I get an intro to this area of the product" having a RFC with a clear section "how to explain this to newcomers" is 1000x better than comments 1) scattered throughout the code 2) that have been written with an already experienced audience in mind.

Certainly, all new contributors (new hires or 3rd parties) should read through the RFCs in areas they are interacting with, yet they need to do so with the understanding that RFCs are out of date almost as soon as they are written. Fighting against that is a losing battle.

So more comments it is then.


Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.