Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: Add builtin function to return locality of current node #37310

Closed
andy-kimball opened this issue May 4, 2019 · 21 comments · Fixed by #37369
Closed

sql: Add builtin function to return locality of current node #37310

andy-kimball opened this issue May 4, 2019 · 21 comments · Fixed by #37369
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@andy-kimball
Copy link
Contributor

andy-kimball commented May 4, 2019

Suggested feature

There is currently no easy way to programmatically inspect the locality of the current node (it appears in crdb_internal.gossip_nodes, but it's hard to work with). I think we should add some kind of builtin locality function that returns the tiers of the locality in a way that can easily be manipulated in SQL. Here's a strawman implementation that returns an ARRAY of (key, value) TUPLEs:

	"locality": makeBuiltin(
		tree.FunctionProperties{Category: categorySystemInfo},
		tree.Overload{
			Types:      tree.ArgTypes{},
			ReturnType: tree.FixedReturnType(types.MakeArray(localityTupleType)),
			Fn: func(ctx *tree.EvalContext, args tree.Datums) (tree.Datum, error) {
				arr := tree.NewDArray(localityTupleType)
				arr.Array = make(tree.Datums, len(ctx.Locality.Tiers))
				for i := range ctx.Locality.Tiers {
					tier := &ctx.Locality.Tiers[i]
					arr.Array[i] = tree.NewDTuple(
						localityTupleType, tree.NewDString(tier.Key), tree.NewDString(tier.Value))
				}
				return arr, nil
			},
			Info: "Returns the locality.",
		},
	),

An example use case

When building geo-distributed applications, this would enable a very nice way to automatically assign the partition key, as illustrated below. This is a credit card charges table that is partitioned on region. The DEFAULT expression for the region column automatically inserts the region value from the current node's locality (e.g. cloud=gce,region=us-east1,zone=us-east1-b).

CREATE TABLE charges (
    region STRING NOT NULL DEFAULT (locality()[2]).value,
    id UUID NOT NULL DEFAULT gen_random_uuid(),
    amount DECIMAL NOT NULL,
    currency STRING NOT NULL,
    last4 STRING NOT NULL,
    account_id INT8 NOT NULL,
    created_at TIMESTAMP NOT NULL,
    updated_at TIMESTAMP NOT NULL,
    CONSTRAINT "primary" PRIMARY KEY (region ASC, id ASC),
    FAMILY "primary" (region, id, amount, currency, last4, account_id, created_at, updated_at)
) PARTITION BY LIST (region) (
    PARTITION us_east1 VALUES IN (('us-east1')),
    PARTITION us_west1 VALUES IN (('us-west1')),
    PARTITION europe_west2 VALUES IN (('europe-west2'))
)

There are undoubtedly other use cases for this as well.

Notes

We need to take care to always evaluate this on the Gateway node so that it's not shipped via DistSQL to other nodes (where each node could evaluate it to a different value).

@andy-kimball
Copy link
Contributor Author

CC @awoods187

@andy-kimball
Copy link
Contributor Author

andy-kimball commented May 5, 2019

Another possible idea is determining the type at SQL analysis time. For example, say the locality is:

cloud=gce,region=us-east1,zone=us-east1-b

Then we'd create a tuple datum like this with type TUPLE[STRING AS <key1>, STRING AS <key2>, ...]:

('gce' AS cloud, 'us-east1' AS region, 'us-east1' AS zone)

This would allow more natural SQL expressions like this:

SELECT (locality()).cloud, (locality()).region, (locality()).zone;

@andy-kimball
Copy link
Contributor Author

andy-kimball commented May 6, 2019

@knz, do you have an opinion on how this function should work? I think that the second idea is more useable. However, it might set a new precedent, since we'd be deriving a more specific return type during type checking, based on the sema context. That would need to now contain a new Locality field, which means that the results of SQL analysis could differ based on locale.

I think I'll create a PR with this approach, and let people react there. I'd like to use this function in a demo I'm building for a talk.

andy-kimball added a commit to andy-kimball/cockroach that referenced this issue May 7, 2019
There is currently no easy way to programmatically inspect the locality of the
current node. It appears in crdb_internal.gossip_nodes, but it's hard to work
with because it's JSON, not constant-foldable, and keyed by node id.

This commit adds a new locality builtin function that:

  Returns the hierarchical location of the current node as a tuple
  of labeled values, ordered from most inclusive to least inclusive.

  For example: `region=east,datacenter=us-east-1`.

When building geo-distributed applications, this enables a very nice way to
automatically assign the partition key, as illustrated below:

  CREATE TABLE charges (
    region STRING NOT NULL DEFAULT (locality()).region,
    id UUID NOT NULL DEFAULT gen_random_uuid(),
    ...

The DEFAULT expression for the region column automatically inserts the region
value from the current node's locality.

It also enables queries like this, that only touch rows in the current region
(which is necessary to avoid cross-region hops):

  SELECT * FROM charges WHERE region = (locality()).region AND id = $1

The locality is constant, so the optimizer is able to fold column access to
a constant value, which can then be used to select an optimal index.

Resolves cockroachdb#37310

Release note (sql change): Adds a new locality builtin function that returns
the hierarchical location of the current node as a tuple of labeled values,
ordered from most inclusive to least inclusive.
@knz
Copy link
Contributor

knz commented May 8, 2019

Woah there is a problem here, you are defining a dynamic type: even with full knowledge of the SQL schema (the declarative context) there is no way to predict the type signature of the result when looking at the SQL text on the client.

In fact the type signature and the validity of the SQL query as a whole may be different depending on the node where you happen to be connected. Think about this: if the client connects through a load balancer, they'll get 100% query failure on some nodes. This is not great design!

pg labeled tuples were not meant to be used as associative arrays (which is what you are doing here).

My recommendation is either to introduce an actual associative array type, or instead define the result of the function as a 2 dimensional array. This way the client can use unnest() on it and operate on its results using relational operators.

@andy-kimball
Copy link
Contributor Author

andy-kimball commented May 8, 2019

I initially had similar thoughts to you, which is why I first did it exactly as you suggest (I have commits with both implementations). However, the more I thought about it, the more I saw the value in this other approach. Below are my arguments; you can decide if they're convincing.

This type is known at analysis time, so it is not a dynamic type by the traditional definition (type not known until runtime). I'm saying that the locality would be considered part of the declarative context, just as the types of the placeholders are. That is the way the user thinks about locality as well, since they use it in ALTER TABLE, ALTER PARTITION, etc.

As for the argument that depending on which nodes the clients connect to, they could get different results:

  1. It's almost certainly an application bug to have different locality fields on different nodes. So you're talking about an edge case where a user has deliberately (or mistakenly) used different fields on different nodes.
  2. If different nodes do have different fields, then they can get different results, regardless.
-- If locality is "region=east" on one node and "region=east,datacenter=east-1"
-- on another node, then the following statement will return different results
-- depending on which node happens to run it:
SELECT locality()[2] IS NULL

Why is giving different results preferable to sometimes raising an error and sometimes not?

Ultimately, we took a shortcut with localities. To be consistent with how SQL schema works, we really should have had a statement like (don't get hung up on syntax, just want to communicate the idea):

CREATE LOCALITY (cloud, region, datacenter)

Then we wouldn't even be having this discussion. We're treating locality when starting up nodes as if it's the dynamic thing, but it's really not. We really should be error'ing if you try to start up a node with locality fields that don't match other nodes. But now we've got it in our heads that there's some kind of dynamism here, when there really isn't.

@andy-kimball
Copy link
Contributor Author

@RaduBerinde, @jordanlewis, see here for more discussion.

@RaduBerinde
Copy link
Member

Continuing the discussion from #37369.

(Can you provide examples of how to turn the json into a table? I think it's important to retain the ability to go from blurb to table.)

root@:26257/tpcc> select * from json_each_text('{"k1": "v1", "k2": "v2"}');                                                                  
  key | value  
+-----+-------+
  k1  | v1     
  k2  | v2     

@knz
Copy link
Contributor

knz commented May 8, 2019

So it seems to me that JSON is the superior solution so far. The syntax locality()->'cloud' reads to me nicer than (locality()).cloud

@knz
Copy link
Contributor

knz commented May 8, 2019

I think github had a glitch, several comments (including my last) just disappeared. Also I see Andy's comments is reportedly posted "6 hours from now".

@knz
Copy link
Contributor

knz commented May 8, 2019 via email

@awoods187
Copy link
Contributor

I like @andy-kimball's point about the create localitystatement. As an aside, I do wonder if we should go back and do that work separately. I think we should be responding with errors when localities do not agree.

I also agree with @knz that JSON is much easier to read.

@andy-kimball
Copy link
Contributor Author

andy-kimball commented May 8, 2019

The JSON solution looks OK, but I find that anytime we use JSON I've got to go to the docs to remember how to access fields. The JSON and relational data models make for a rough fit.

The more I work with constraints and localities, the more I think we've missed the boat on making them intuitive to use. Each person who's tried using the new locality-sensitive optimizer feature (Andy W, Rich, Roko, Nate) has been confused by some aspect of how to set it up. One of the reasons is due to the strange syntax:

CONFIGURE ZONE USING lease_preferences='[[+cloud=aws,-region=us_east], [+cloud=gce,+region=us_west]]'

instead of something that feels more natural with SQL, such as:

CONFIGURE ZONE USING LEASE PREFERENCES ON cloud='aws' AND region<>'us_east', FALLBACK TO cloud='gce' AND region='us_west'

I'd expect to use strongly-typed localities, so that I get an error if cloud is not a valid locality field. I'd expect that ZONEs and LOCALITYs are first-class schema objects that are versioned just like any other schema objects (with changes rolled out across cluster in same way).

I know I'm pulling in a bunch of extra stuff into what seems like a simple issue. But it's context to help others understand why I think we should at least consider more strongly-typed handling of LOCALITY. If we really want geo-distributed use-cases to be a centerpiece of our strategy, I think we need to recognize that we've made some usability mistakes so far, and consider rectifying them (in a future release) with better syntax and stronger typing.

@knz
Copy link
Contributor

knz commented May 8, 2019

I know I'm pulling in a bunch of extra stuff into what seems like a simple issue.

I appreciate the issue has multiple aspects, and that using and creating zone configs are intimately related. However, please consider:

  • the issue that sparked my rejection of labeled tuples has nothing to do with the type of the values, and more with the fact that the set of store attributes is variable per store and per node

    • on that specific topic I disagree with your assertion that mixing and matching multiple sets of attributes is necessarily an application bug.
    • even if we agreed, an application bug should not have any influence on whether the type of a SQL query is the same from the perspective of every node. I still stand by that point. Having the set of local attributes on the node change the types derived in a query is just wrong.
  • for context, there is no commitment to the current zone configuration facilities and we can certainly work on improvements:

    • the original mechanism was CONFIGURE ZONE '....yaml....' with just a single yaml string.
    • I split the single yaml strings into multiple parameters a year ago, so as to add the ability to copy a parameter from the parent zone, or reset an already set parameter; at that time that was the only required improvement;
    • then Ridwan added yet a 3rd inheritance mechanism, to ensure the zones can cascade;
    • At no point YET we have really engaged the discussion of typing (nor taken any decision on that topic) and I personally would welcome any additional/new effort in that direction.
    • for additional context, the reason why that conversation was delayed/postponed is that we simply did not have enough experience early on to design this properly. Arguably, with the additional/repeated experience of customers and sales engineers, now is a better time to do this design work than, say, one or two years ago.
  • I'd recommend that you stay mindful of the amount of effort required to have a proper design discussion (it doesn't need to be long and involved, but it shouldn't be too casual either). If we need to make roadmap adjustements, I would recommend we talk about that before committing the time to re-design the zone config mechanisms.

  • regardless of the outcomes on the point above, I appreciate that you're looking for a solution that gives you(us) separate types for the various attributes.

    • in that context, tuples are still not good enough because of the problems I have identified before
    • however, given your additional concerns, I would concede that JSON is better than arrays because different fields in JSON can have different types.

I find that anytime we use JSON I've got to go to the docs to remember how to access fields.

I think that's lack of familiarity. It's actually pretty simple and regular.
Furthermore, if we find usage patterns on these values that would benefit from a syntax shorthand, we can (later) define built-in functions or operators to do that.

Just don't pull locality labels in the type system. That's a bad idea all around.

@andy-kimball
Copy link
Contributor Author

andy-kimball commented May 8, 2019

Your answer seems to be mixing locality with attributes. My understanding is that they're distinct and are passed separately on the command line:

cockroach start --locality=region=us,datacenter=us-east --attrs=ram:64gb

I agree with you 100% that arbitrary attributes should not become part of type system. The proposed locality function is only about the key/value pairs passed via the locality flag, not the attrs flag. My assertion is that there's no useful scenario where users would want to pass different locality keys to different nodes, and that it should be an error to do so. Can you show a plausible scenario where that is not desirable behavior?

@knz
Copy link
Contributor

knz commented May 8, 2019

I'd say that it's possible/desirable for certain nodes to have more keys than the common set. for example suppose there are 3 regions, then in each region there's just 1 server, then later on additional servers are added in one of the regions. I expect that the nodes in that region will get additional attributes to distinguish them within the region, while the other nodes will remain with just a region attribute.

@knz
Copy link
Contributor

knz commented May 8, 2019

Meanwhile the argument still stands, regardless of "mistakes" made, the same query should type-check identically at a given time stamp regardless of the node where you issue it.

@andy-kimball
Copy link
Contributor Author

This is why I brought up the idea of CREATE LOCALITY (cloud, region, datacenter), because it resolves all the issues that you're highlighting.

Given that any sort of statically declared, strongly-typed solution like this is one or more releases away, I'm going to propose that we sidestep some of these questions by implementing only the function that Radu described:

locality-value('region')

That's easy to understand and optimize, and will fit nicely with whatever we decide to do in the future.

@RaduBerinde
Copy link
Member

I haven't seen a convincing argument against JSON.. locality-value('region') would just be locality()->'region'. But it also gives you an easy way to inspect the entire thing or do other stuff with it. It would still be a legitimate solution even if the keys will be fixed.

@andy-kimball
Copy link
Contributor Author

andy-kimball commented May 8, 2019

Here are my reasons for not wanting to add locality as a JSON function right now:

  1. It introduces a backwards-compatibility issue if we decide to have a strongly-typed locality in the future. We don't yet know what we're doing along those lines, and I'd rather reserve locality until then.

  2. The expressions you give above are subtly different:

locality-value('region') -- statically typed as <string>
locality()->'region' -- statically typed as <json>

Users will need to know that they should use ->> rather than ->. And if you made this mistake (and I originally made it as well), I suspect that most every user will make it. This makes the proper usage of this function unintuitive. Probably most users will get a type error, and then do this: (locality()->'region')::string, because they won't know about the undiscoverable ->> operator.

  1. Making this JSON means that we use YAML to declare the localities/constraints and then JSON to view/query them, all embedded in relational statements. 3 data models. 3 syntaxes. It just feels like we're taking the expedient route here, not a strongly-typed route consistent with other SQL syntax/semantics. We're requiring a lot from our users (knowledge of 3 sets of syntax/operators/usage), just to interact with these features. No wonder people are getting confused.

@RaduBerinde
Copy link
Member

I see, thanks. The yaml / json thing is pretty bad indeed. I see the value in holding off on this.

We may have some of the same backward compatibility issues with locality-value('x') (in that we might want to remove it once we have strongly-typed locality). Maybe we should make it crdb_internal.locality_value() for that reason?

@andy-kimball
Copy link
Contributor Author

Yes, I like the suggestion to put the new function in the crdb_internal schema.

andy-kimball added a commit to andy-kimball/cockroach that referenced this issue May 9, 2019
There is currently no easy way to programmatically inspect the locality of the
current node. It appears in crdb_internal.gossip_nodes, but it's hard to work
with because it's JSON, not constant-foldable, and keyed by node id.

This commit adds a new crdb_internal.locality_value builtin function that returns
the value of the locality key given as its argument.

When building geo-distributed applications, this enables a very nice way to
automatically assign the partition key, as illustrated below:

  CREATE TABLE charges (
    region STRING NOT NULL DEFAULT crdb_internal.locality_value('region'),
    id UUID NOT NULL DEFAULT gen_random_uuid(),
    ...

The DEFAULT expression for the region column automatically inserts the region
value from the current node's locality.

It also enables queries like this, that only touch rows in the current region
(which is necessary to avoid cross-region hops):

  SELECT * FROM charges WHERE region = crdb_internal.locality_value('region') AND id = $1

The optimizer folds the function to a constant value, which can then be used
to select an optimal index.

Resolves cockroachdb#37310

Release note (sql change): Adds a new locality_value builtin function that
returns the value of the locality key given as its argument.
@awoods187 awoods187 added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label May 9, 2019
craig bot pushed a commit that referenced this issue May 15, 2019
37369: sql: Add crdb_internal.locality_value builtin function r=andy-kimball a=andy-kimball

There is currently no easy way to programmatically inspect the locality of the
current node. It appears in crdb_internal.gossip_nodes, but it's hard to work
with because it's JSON, not constant-foldable, and keyed by node id.

This commit adds a new crdb_internal.locality_value builtin function that returns
the value of the locality key given as its argument.

When building geo-distributed applications, this enables a very nice way to
automatically assign the partition key, as illustrated below:

  CREATE TABLE charges (
    region STRING NOT NULL DEFAULT crdb_internal.locality_value('region'),
    id UUID NOT NULL DEFAULT gen_random_uuid(),
    ...

The DEFAULT expression for the region column automatically inserts the region
value from the current node's locality.

It also enables queries like this, that only touch rows in the current region
(which is necessary to avoid cross-region hops):

  SELECT * FROM charges WHERE region = crdb_internal.locality_value('region') AND id = $1

The optimizer folds the function to a constant value, which can then be used
to select an optimal index.

Resolves #37310

Release note (sql change): Adds a new locality_value builtin function that
returns the value of the locality key given as its argument.

37513: opt: maintain a set of grouping cols instead of a list r=ridwanmsharif a=ridwanmsharif

Fixes #37317 and possibly also fixes #37444.
This commit changes the semantics of the grouping columns stored
in scope to only now save the columns if they're unique.

Release note: None

Co-authored-by: Andrew Kimball <andyk@cockroachlabs.com>
Co-authored-by: Ridwan Sharif <ridwan@cockroachlabs.com>
@craig craig bot closed this as completed in #37369 May 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants