Skip to content

WIP RFC: Expressive ZoneConfig#7660

Merged
d4l3k merged 1 commit intocockroachdb:masterfrom
d4l3k:zoneconfig-rfc
Aug 15, 2016
Merged

WIP RFC: Expressive ZoneConfig#7660
d4l3k merged 1 commit intocockroachdb:masterfrom
d4l3k:zoneconfig-rfc

Conversation

@d4l3k
Copy link
Copy Markdown
Contributor

@d4l3k d4l3k commented Jul 6, 2016

Very much WIP. RFCified on request of @petermattis.

See #4868.

@bdarnell


This change is Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 67 [r1] (raw file):

  * not on `hdd`
  * not in `us-.*`
* Diversity (replicas distributed to minimize risk)

We should sketch out how we're going to configure hierarchy. You mention above encoding the hierarchy in the attribute, but we currently allow multiple attributes to be specified at the node level. How do we know which one contains the diversity hierarchy?

Per our discussion today, I'm wondering if we should always attempt to maximize diversity and leave the constraints for controlling which nodes are candidates. More specifically, a zone config would specify a number of replicas and an expression with positive and negative constraints which would control which stores are candidates for the replicas. Stores would be configured with a diversity label. After determining the candidate stores for a replica the allocator would make a best effort to maximize replica diversity at all levels of the diversity hierarchy for the candidate set. So if the candidate set included 3 racks there would be an effort made to place the replicas on different racks. If the candidate set included 4 different data centers there would be an effort made to place the replicas in different data centers. If you want all of the replicas in a particular data center, you specify a positive constraint on the candidates.


Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Jul 7, 2016

Review status: 0 of 1 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 67 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

We should sketch out how we're going to configure hierarchy. You mention above encoding the hierarchy in the attribute, but we currently allow multiple attributes to be specified at the node level. How do we know which one contains the diversity hierarchy?

Per our discussion today, I'm wondering if we should always attempt to maximize diversity and leave the constraints for controlling which nodes are candidates. More specifically, a zone config would specify a number of replicas and an expression with positive and negative constraints which would control which stores are candidates for the replicas. Stores would be configured with a diversity label. After determining the candidate stores for a replica the allocator would make a best effort to maximize replica diversity at all levels of the diversity hierarchy for the candidate set. So if the candidate set included 3 racks there would be an effort made to place the replicas on different racks. If the candidate set included 4 different data centers there would be an effort made to place the replicas in different data centers. If you want all of the replicas in a particular data center, you specify a positive constraint on the candidates.

I was kind of thinking to leave that up to the users. I feel the average user won't care about how their data is being replicated. However, certain deployments might need more control.

Tag per tier

If we allow them to have lots of tags [cloud-gce datacenter-us-west-1 rack-12 ssd] they can chose how to replicate however they want.

The only issue with this format, is it makes it hard to have lots of control with a simple protobuf. There's not a good way to say "unique (dc + rack)". Though, we could aim for maximum diversity of the remaining tags using something like hamming distance.

Simple protobuf

num_replicas: 3
candidates: rack-.*
constraints: [+datacenter-us-west-1 +ssd]

SQL

num_replicas: 3
constraints: DISTINCT "rack-.*" AND "datacenter-us-west-1" AND "ssd"

Single tag

If we use [cloud-gce-datacenter-us-west-1-rack-12 ssd] it makes it simpler to define candidates with just regex. The harder question is how do you determine the maximum amount of diversity to use? With the previous tag structure and DISTINCT "datacenter-.*" you could make it so there was one replica per datacenter. With this, there could be lots of candidates in each datacenter. Adding capture groups could fix that problem, but likely would be rather confusing and potentially hard to implement.

Simple protobuf

Capture Groups

num_replicas: 3
candidates: .*-(datacenter-us-west-1)-rack-.*
constraints: [+ssd]

SQL

num_replicas: 3
constraints: DISTINCT ".*-(datacenter-us-west-1)-rack-.*" AND "ssd"

Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented Jul 7, 2016

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 112 [r1] (raw file):

  * `"ssd" AND NOT ("us-.*" OR "canada-.*")`

This format would require it to be heavily regex based with capture groups and

We don't have to support regexes if we allow larger numbers of partially-redundant attributes (e.g. ("country:us", "region:us-west", "az:us-west-1") instead of just "us-west-1").


docs/RFCS/expressive_zone_config.md, line 145 [r1] (raw file):

## Per replica attrs

Do we want to keep the ability to specify per replica attributes?

Probably not. Per-replica attributes are useful now as a kind of pseudo-diversity constraint, but with real diversity constraints there's little reason to use them. In particular, it wouldn't make sense to use per-replica attributes for things like hardware type; it's only useful for location. (it doesn't make sense to say e.g. "two ssd replicas and one hdd", because we generally assume homogenous replicas. If the hdd node is a follower it will be chronically behind, and if it's a leader (we don't have any way to ensure that the SSDs become leader) it will slow down the rest of the group. However, non-homogenous replicas seem to be a common request from users so we may want to leave room to support this in the future).


Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Jul 7, 2016

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 112 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We don't have to support regexes if we allow larger numbers of partially-redundant attributes (e.g. ("country:us", "region:us-west", "az:us-west-1") instead of just "us-west-1").

That's definitely true. Still might be useful for doing things like `DISTINCT "rack-.*"` and that way they don't have to manually specify every single rack.

It's also probably not a problem if the allocator is refactored to be more resource efficient.


docs/RFCS/expressive_zone_config.md, line 145 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Probably not. Per-replica attributes are useful now as a kind of pseudo-diversity constraint, but with real diversity constraints there's little reason to use them. In particular, it wouldn't make sense to use per-replica attributes for things like hardware type; it's only useful for location. (it doesn't make sense to say e.g. "two ssd replicas and one hdd", because we generally assume homogenous replicas. If the hdd node is a follower it will be chronically behind, and if it's a leader (we don't have any way to ensure that the SSDs become leader) it will slow down the rest of the group. However, non-homogenous replicas seem to be a common request from users so we may want to leave room to support this in the future).

Do we want to add support in for having different tiers of constraints? It would help constraints fail gracefully.

Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented Jul 8, 2016

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 112 [r1] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

That's definitely true. Still might be useful for doing things like DISTINCT "rack-.*" and that way they don't have to manually specify every single rack.

It's also probably not a problem if the allocator is refactored to be more resource efficient.

Maybe for "distinct" constraints we want to make these labels into KV pairs.

docs/RFCS/expressive_zone_config.md, line 145 [r1] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

Do we want to add support in for having different tiers of constraints? It would help constraints fail gracefully.

I think different tiers of constraints will eventually be needed. E.g. keeping data in a particular country could be a legal requirement, while spreading data across distinct racks is nice to have.

Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 112 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Maybe for "distinct" constraints we want to make these labels into KV pairs.

What are the semantics of `DISTINCT`? I think we want to always be targeting the maximum diversity given other constraints but to relax the diversity as needed to satisfy an allocation. Given that, how is `DISTINCT "rack.*"` different from `"rack.*"`?

Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Jul 11, 2016

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 112 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

What are the semantics of DISTINCT? I think we want to always be targeting the maximum diversity given other constraints but to relax the diversity as needed to satisfy an allocation. Given that, how is DISTINCT "rack.*" different from "rack.*"?

That makes sense. We probably don't need `DISTINCT` if we're always targeting maximum diversity.

docs/RFCS/expressive_zone_config.md, line 145 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think different tiers of constraints will eventually be needed. E.g. keeping data in a particular country could be a legal requirement, while spreading data across distinct racks is nice to have.

Does that mean we should have "hard" constraints? What happens if we say `"us-.*"` and there's no space available in that cluster? In some cases it would be better to balance to a different datacenter, but due to legal reasons that could be a huge problem.

Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 145 [r1] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

Does that mean we should have "hard" constraints? What happens if we say "us-.*" and there's no space available in that cluster? In some cases it would be better to balance to a different datacenter, but due to legal reasons that could be a huge problem.

I think we'll eventually want hard constraints. Not necessary in the first implementation, but it's something to keep in mind for the future.

Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 145 [r1] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think we'll eventually want hard constraints. Not necessary in the first implementation, but it's something to keep in mind for the future.

Agreed about eventually wanting hard constraints. I could envision this similar to search queries. `foo` by itself tries to have `foo` in the attribute set. `+foo` requires and `-foo` prohibits.

Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Jul 12, 2016

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


docs/RFCS/expressive_zone_config.md, line 145 [r1] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Agreed about eventually wanting hard constraints. I could envision this similar to search queries. foo by itself tries to have foo in the attribute set. +foo requires and -foo prohibits.

Added to RFC.

Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Jul 12, 2016

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


docs/RFCS/expressive_zone_config.md, line 67 [r1] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

I was kind of thinking to leave that up to the users. I feel the average user won't care about how their data is being replicated. However, certain deployments might need more control.

Tag per tier

If we allow them to have lots of tags [cloud-gce datacenter-us-west-1 rack-12 ssd] they can chose how to replicate however they want.

The only issue with this format, is it makes it hard to have lots of control with a simple protobuf. There's not a good way to say "unique (dc + rack)". Though, we could aim for maximum diversity of the remaining tags using something like hamming distance.

Simple protobuf

num_replicas: 3
candidates: rack-.*
constraints: [+datacenter-us-west-1 +ssd]

SQL

num_replicas: 3
constraints: DISTINCT "rack-.*" AND "datacenter-us-west-1" AND "ssd"

Single tag

If we use [cloud-gce-datacenter-us-west-1-rack-12 ssd] it makes it simpler to define candidates with just regex. The harder question is how do you determine the maximum amount of diversity to use? With the previous tag structure and DISTINCT "datacenter-.*" you could make it so there was one replica per datacenter. With this, there could be lots of candidates in each datacenter. Adding capture groups could fix that problem, but likely would be rather confusing and potentially hard to implement.

Simple protobuf

Capture Groups

num_replicas: 3
candidates: .*-(datacenter-us-west-1)-rack-.*
constraints: [+ssd]

SQL

num_replicas: 3
constraints: DISTINCT ".*-(datacenter-us-west-1)-rack-.*" AND "ssd"
The more I think about this, the more I think tag per tier (cloud, datacenter, rack) is the right way to go. Since we're targeting maximum diversity there's no need to have explicit unique statements and it's much easier to implement.

Comments from Reviewable

@d4l3k d4l3k force-pushed the zoneconfig-rfc branch 2 times, most recently from 667a3b7 to 40f19f5 Compare July 12, 2016 19:29
@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 132 [r2] (raw file):

constraint category should have a separate tag format.

For instance `[cloud-gce, datacenter-us-west-1, country-us, region-west, rack-12, ssd]`.

Is the ordering important here?

It isn't clear to me how we'll be measuring diversity. If I only have the tags ssd and hdd, we wouldn't want to be maximizing the diversity across these different tags. I think it is worth fleshing out what will be done here.


docs/RFCS/expressive_zone_config.md, line 150 [r2] (raw file):

Require data to be stored in the US on different racks.
```yaml
constraints: [+country-us, rack-.*]

Is rack-.* necessary? Without it, we'll still prefer different racks.


Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Jul 13, 2016

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 132 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is the ordering important here?

It isn't clear to me how we'll be measuring diversity. If I only have the tags ssd and hdd, we wouldn't want to be maximizing the diversity across these different tags. I think it is worth fleshing out what will be done here.

I was thinking that we would only maximize diversity based on the constraints set. That way you could choose how you wanted them to be split. It would also be less ambiguous than maximizing diversity based on tag sets.

I could imagine users doing strange things with extra tags that we wouldn't necessarily want to maximize diversity of. Especially things like disk type, since that would limit our ability to improve the allocator. I could imagine us improving the allocator to put "hot" replicas on faster machines with SSDs automatically.

If we wanted to always maximize diversity, we should probably change the tag format to be something along the lines of

tags:
  disk: ssd
  datacenter: us-west-1
  cloud: gce
  country: us
  region: west
  rack: 12

That way we can maximize based on each category instead of just doing set comparison.


docs/RFCS/expressive_zone_config.md, line 150 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is rack-.* necessary? Without it, we'll still prefer different racks.

See previous comment.

If we only maximize diversity in constraints, rack-.* would be required.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 132 [r2] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

I was thinking that we would only maximize diversity based on the constraints set. That way you could choose how you wanted them to be split. It would also be less ambiguous than maximizing diversity based on tag sets.

I could imagine users doing strange things with extra tags that we wouldn't necessarily want to maximize diversity of. Especially things like disk type, since that would limit our ability to improve the allocator. I could imagine us improving the allocator to put "hot" replicas on faster machines with SSDs automatically.

If we wanted to always maximize diversity, we should probably change the tag format to be something along the lines of

tags:
  disk: ssd
  datacenter: us-west-1
  cloud: gce
  country: us
  region: west
  rack: 12

That way we can maximize based on each category instead of just doing set comparison.

Agreed that we'll have some attributes we don't want to maximize diversity across. This suggests a separation of `attributes` (what we have now at the node/store level) and `locality` (rack, datacenter, region, country). For example, `--locality=us/west/us-west-1/12`. I think there is a hierarchy for locality (hence the use of `/` for a separator), though it's worth thinking this through more. Each locality element (i.e. `us`, `west`, etc) would be an attribute that could be used for constraints, but we'd always maximize diversity across the locality hierarchy for the nodes matching a set of constraints.

docs/RFCS/expressive_zone_config.md, line 150 [r2] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

See previous comment.

If we only maximize diversity in constraints, rack-.* would be required.

Hmm. If we don't specify any constraints (which will be the default), then we won't maximize diversity across racks, datacenters, etc. That's unfortunate.

Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Jul 13, 2016

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 132 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Agreed that we'll have some attributes we don't want to maximize diversity across. This suggests a separation of attributes (what we have now at the node/store level) and locality (rack, datacenter, region, country). For example, --locality=us/west/us-west-1/12. I think there is a hierarchy for locality (hence the use of / for a separator), though it's worth thinking this through more. Each locality element (i.e. us, west, etc) would be an attribute that could be used for constraints, but we'd always maximize diversity across the locality hierarchy for the nodes matching a set of constraints.

I feel like the `/` hierarchy would just make it hard to filter by. Instead of `/`s we could just define a set of locality fields (`datacenter`, `country`, etc) that can be used. Constraints for those would then be done with a special prefix `country:us`, `datacenter:us-.*`.

It's interesting that Cassandra only has datacenter and rack locality settings.

https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html
http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureSnitchGossipPF_c.html


docs/RFCS/expressive_zone_config.md, line 150 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Hmm. If we don't specify any constraints (which will be the default), then we won't maximize diversity across racks, datacenters, etc. That's unfortunate.

I feel that if a user doesn't configure the zone config he's unlikely to also specify attributes on the command line. We could also add warnings on the command line if there's attributes but no constraints.

I was thinking a simple visualization/web page for attributes and constraints might be useful (but probably unnecessary).


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 132 [r2] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

I feel like the / hierarchy would just make it hard to filter by. Instead of /s we could just define a set of locality fields (datacenter, country, etc) that can be used. Constraints for those would then be done with a special prefix country:us, datacenter:us-.*.

It's interesting that Cassandra only has datacenter and rack locality settings.

https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeReplication_c.html
http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureSnitchGossipPF_c.html

How would the hierarchy make filtering harder? We might just be misunderstanding each other here.

I'm mildly anxious about a fixed set of locality fields.


docs/RFCS/expressive_zone_config.md, line 150 [r2] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

I feel that if a user doesn't configure the zone config he's unlikely to also specify attributes on the command line. We could also add warnings on the command line if there's attributes but no constraints.

I was thinking a simple visualization/web page for attributes and constraints might be useful (but probably unnecessary).

The user and the administrator might be different people. I can see an administrator configuring the locality for a node while the user uses the default zone config (which doesn't specify any constraints).

Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Jul 13, 2016

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 132 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

How would the hierarchy make filtering harder? We might just be misunderstanding each other here.

I'm mildly anxious about a fixed set of locality fields.

I was just saying that writing constraints using `/` without having fixed locality fields would change how you write constraints.

Say you have us/west/us-west-1/12, if you want to constrain to us-west-1 it'd be us/west/us-west-1/.*, but how would you go about doing us west or eur west? Using regexes(us/west/us-west-1/.*|eur/west/eur-west-2/.*) wouldn't work since it might match both us/west/us-west-1/12 and us/west/us-west-1/5.

You could do it with [us/west/us-west-1, eur/west/eur-west-2], do prefix matches, and then try to maximize diversity based off the unmentioned tags. Just adds more complexity since we now have to do positive and negative constraints on two different formats instead of one.


docs/RFCS/expressive_zone_config.md, line 150 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

The user and the administrator might be different people. I can see an administrator configuring the locality for a node while the user uses the default zone config (which doesn't specify any constraints).

Could happen. I'd think that if you were specifying locality you'd change the default zone config.

Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


docs/RFCS/expressive_zone_config.md, line 132 [r2] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

I was just saying that writing constraints using / without having fixed locality fields would change how you write constraints.

Say you have us/west/us-west-1/12, if you want to constrain to us-west-1 it'd be us/west/us-west-1/.*, but how would you go about doing us west or eur west? Using regexes(us/west/us-west-1/.*|eur/west/eur-west-2/.*) wouldn't work since it might match both us/west/us-west-1/12 and us/west/us-west-1/5.

You could do it with [us/west/us-west-1, eur/west/eur-west-2], do prefix matches, and then try to maximize diversity based off the unmentioned tags. Just adds more complexity since we now have to do positive and negative constraints on two different formats instead of one.

I was thinking that each of the components of the locality could be matched separately. We separate attributes using a `,`, but would use a `/` to separate components of the locality to indicate the hierarchy.

docs/RFCS/expressive_zone_config.md, line 150 [r2] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

Could happen. I'd think that if you were specifying locality you'd change the default zone config.

As I mentioned, the user and the administrator might be different people. I'd agree if you were singular, but it is plural here.

Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Jul 14, 2016

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


docs/RFCS/expressive_zone_config.md, line 132 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

I was thinking that each of the components of the locality could be matched separately. We separate attributes using a ,, but would use a / to separate components of the locality to indicate the hierarchy.

Oh, instead of trying to do prefix matches just do `[us-west-1, eur-west2]` as a locality constraint? Otherwise maximize diversity for locality?

Are we worried at all about having something duplicate names between different tiers? Like what if small setup only has two racks and they do us/west/west and us/west/east. Might be a stretch.


docs/RFCS/expressive_zone_config.md, line 150 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

As I mentioned, the user and the administrator might be different people. I'd agree if you were singular, but it is plural here.

Fair enough.

Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


docs/RFCS/expressive_zone_config.md, line 132 [r2] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

Oh, instead of trying to do prefix matches just do [us-west-1, eur-west2] as a locality constraint? Otherwise maximize diversity for locality?

Are we worried at all about having something duplicate names between different tiers? Like what if small setup only has two racks and they do us/west/west and us/west/east. Might be a stretch.

I think it's time to hash out the remaining details in a whiteboard session. Let's find some time next week. @bdarnell, @BramGruneir and @cuongdo should be involved.

Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Jul 25, 2016

Brief notes I wrote down after the meeting. Let me know if I missed anything.

https://docs.google.com/document/d/1IeNHe3qysz2DEoOjBFMooFRUV7-Se8_J8SBpfW6hQZ0/edit?usp=sharing

@petermattis
Copy link
Copy Markdown
Collaborator

Looks complete to me. Let's trim down the portions of the RFC which have fallen into disfavor (e.g. SQL based constraint system) and flesh out the locality tag idea, either with an explicit hierarchy or some sort of key=value approach.

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Jul 28, 2016

Updated the RFC to talk about locality as KV vs hierarchical tags as well as stripping out the SQL stuff.

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 40 [r4] (raw file):

> replica selection in the way that they are today, otherwise it is as if there
> are num_replicas empty attr entries. No need to actually address the issues
> around negative or diversity constraints at this point.

While working near the allocator code today, I discovered that we only respect the attributes for ZoneConfig.ReplicaAttrs[0]. See the use of Zone.ReplicaAttrs in replicate_queue.go.


docs/RFCS/expressive_zone_config.md, line 203 [r4] (raw file):

A potential pitfall of this method, is if a user forgets to add a specific
locality field. If one node has 5 elements and the other has 6, it's unclear of
how to handle that.

This feels like a significant pitfall. Seems very easy for this to be fouled up by the administrator.


docs/RFCS/expressive_zone_config.md, line 205 [r4] (raw file):

how to handle that.

### KV locality

Having read the write-ups of both locality options, I'm leaning in favor of this one.


docs/RFCS/expressive_zone_config.md, line 230 [r4] (raw file):

`datacenter=us-{1,2,3},rack={{1,2,3},{4,5,6},{7,8,9}}` there's no way to
distinguish the primary diversity factor automatically and we have to pick
either maximizing diversity of datacenter or rack.

This paragraph seems repetitive with the one above. And for this configuration, it seems like trying to diverse both levels simultaneously will lead to the desired outcome. I'm not sure what that algorithm looks like, but it seems feasible.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 205 [r4] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Having read the write-ups of both locality options, I'm leaning in favor of this one.

Me too.

docs/RFCS/expressive_zone_config.md, line 225 [r4] (raw file):

diversity on both levels. In this situation, the best thing to do would be to
have 3 replicas
`datacenter=us-1,rack=1`, `datacenter=us-2,rack=2` and `datacenter=us-3,rack=3`.

When the cluster is empty, we can choose good placements like this one, but as it fills up we may have cases where the allocator has to choose between putting a second replica on the same rack and a second replica in the same datacenter. I think we need an explicit ordering here to ensure the proper prioritization.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 225 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

When the cluster is empty, we can choose good placements like this one, but as it fills up we may have cases where the allocator has to choose between putting a second replica on the same rack and a second replica in the same datacenter. I think we need an explicit ordering here to ensure the proper prioritization.

This is where hierarchy, perhaps implicit in the ordering of the locality tags, can come in. If we view the rack names as being `1`, `2` and `3` then you are correct that we need to consider datacenter first. But we could also consider the rack names as having an implicit prefix of all of the preceding tags: `us-1/1`, `us-1/2`, `us-1/3`, `us-2/1`, `us-2/2`, `us-2/3`, `us-2/1`, `us-3/2`, `us-3/3`. With this point of view, placing two replicas in the same rack is equivalent to placing them in the same datacenter. So if we diversify across datacenters we'll get diversity across racks.

Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Aug 1, 2016

Review status: 0 of 1 files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 40 [r4] (raw file):

Previously, petermattis (Peter Mattis) wrote…

While working near the allocator code today, I discovered that we only respect the attributes for ZoneConfig.ReplicaAttrs[0]. See the use of Zone.ReplicaAttrs in replicate_queue.go.

Filed an issue for docs #8182.

At least, this makes it easier to update the ZoneConfig since we don't have to worry about backwards compatibility and can just convert ReplicaAttrs[0] into constraints.


docs/RFCS/expressive_zone_config.md, line 225 [r4] (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is where hierarchy, perhaps implicit in the ordering of the locality tags, can come in. If we view the rack names as being 1, 2 and 3 then you are correct that we need to consider datacenter first. But we could also consider the rack names as having an implicit prefix of all of the preceding tags: us-1/1, us-1/2, us-1/3, us-2/1, us-2/2, us-2/3, us-2/1, us-3/2, us-3/3. With this point of view, placing two replicas in the same rack is equivalent to placing them in the same datacenter. So if we diversify across datacenters we'll get diversity across racks.

Doesn't the implicit ordering of locality tags have the same pitfalls as hierarchy tags? What happens if someone has one server with `datacenter=us-1,rack=1` and another `rack=2,datacenter=us-2`. Might be better to explicitly specify the hierarchy in the ZoneConfig and have a sane default as a fallback.

Added Ben's example to the RFC.


docs/RFCS/expressive_zone_config.md, line 230 [r4] (raw file):

Previously, petermattis (Peter Mattis) wrote…

This paragraph seems repetitive with the one above. And for this configuration, it seems like trying to diverse both levels simultaneously will lead to the desired outcome. I'm not sure what that algorithm looks like, but it seems feasible.

Yeah, I was trying to think of an example where not having an explicit ordering was a significant pitfall. Removed.

Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Aug 3, 2016

I've updated the RFC to clear up most of the issues. It now uses the order of tags for hierarchy and will warn the users via logs + admin UI about any issues with replication.

The only unresolved question is about weak negative constraints, but I don't see any need to include those initially as there isn't a strong use case.

@bdarnell
Copy link
Copy Markdown
Contributor

bdarnell commented Aug 3, 2016

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 143 [r5] (raw file):

We would use the order the tags were defined in as a hierarchy for which levels to
prioritize hierarchy for. If the order of tags on one node doesn't match those
on others, we would warn the user.

How would we discover this mismatch? Would we try to pick a "winning" order in the meantime or would each node just use its own order for its rebalancing decisions? The latter could result in thrashing so it may be worth trying to converge (via gossip?) on a single order even if the configurations are inconsistent.

Also consider when a new attribute is introduced (or an old one removed). This would be rolled out gradually, so not all nodes would have the same set of attributes. I suggest that locality attributes should be ignored unless every candidate node has a value for them.


Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Aug 3, 2016

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 143 [r5] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

How would we discover this mismatch? Would we try to pick a "winning" order in the meantime or would each node just use its own order for its rebalancing decisions? The latter could result in thrashing so it may be worth trying to converge (via gossip?) on a single order even if the configurations are inconsistent.

Also consider when a new attribute is introduced (or an old one removed). This would be rolled out gradually, so not all nodes would have the same set of attributes. I suggest that locality attributes should be ignored unless every candidate node has a value for them.

The locality tags for each node/store will be gossiped as part of the store descriptor. It's fairly simple for the allocator to find the most popular order and just use that. Alternatively, we could just sort alphabetically/frequency if there's no inferable hierarchy to ensure consistent allocations.

I don't think one misconfigured node should stop all writes in the case that someone has a hard constraint to be in a certain country.

In any case, there definitely will be warning messages about locality mismatches in the log and admin UI.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 143 [r5] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

The locality tags for each node/store will be gossiped as part of the store descriptor. It's fairly simple for the allocator to find the most popular order and just use that. Alternatively, we could just sort alphabetically/frequency if there's no inferable hierarchy to ensure consistent allocations.

I don't think one misconfigured node should stop all writes in the case that someone has a hard constraint to be in a certain country.

In any case, there definitely will be warning messages about locality mismatches in the log and admin UI.

Sorting the tags doesn't sound right. I think we should use the majority order (as inferred from gossip) and put big warning tags on the admin UI and monitoring signals that alert about such a misconfiguration.

Btw, a debugging tool we'll want is to see the stores matching a constraint. We'll want to show this information (in a compact form) when viewing a zone config in the admin UI and when setting a zone config.


Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Aug 15, 2016

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


docs/RFCS/expressive_zone_config.md, line 143 [r5] (raw file):

Previously, petermattis (Peter Mattis) wrote…

Sorting the tags doesn't sound right. I think we should use the majority order (as inferred from gossip) and put big warning tags on the admin UI and monitoring signals that alert about such a misconfiguration.

Btw, a debugging tool we'll want is to see the stores matching a constraint. We'll want to show this information (in a compact form) when viewing a zone config in the admin UI and when setting a zone config.

Added to RFC.

Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Aug 15, 2016

Any final comments before merging and beginning the implementation?

@petermattis
Copy link
Copy Markdown
Collaborator

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


Comments from Reviewable

@bdarnell
Copy link
Copy Markdown
Contributor

LGTM


Review status: 0 of 1 files reviewed at latest revision, 8 unresolved discussions, some commit checks pending.


docs/RFCS/expressive_zone_config.md, line 162 [r6] (raw file):

Prefer SSDs and do not put on in-memory stores.
```yaml
constraints: [ssd, -mem]

Does this syntax work in yaml? The - and + characters might be special; we should make sure this syntax will work has intended.


Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Aug 15, 2016

Review status: 0 of 1 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


docs/RFCS/expressive_zone_config.md, line 162 [r6] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Does this syntax work in yaml? The - and + characters might be special; we should make sure this syntax will work has intended.

Yeah.
$ ruby -ryaml -e "puts YAML.load('[ssd, +ssd, -ssd]').inspect"
["ssd", "+ssd", "-ssd"]

In the expanded array format you need quotes around -.

--- 
- ssd
- +ssd
- "-ssd"

Comments from Reviewable

@d4l3k
Copy link
Copy Markdown
Contributor Author

d4l3k commented Aug 15, 2016

Review status: 0 of 1 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


docs/RFCS/expressive_zone_config.md, line 162 [r6] (raw file):

Previously, d4l3k (Tristan Rice) wrote…

Yeah.

$ ruby -ryaml -e "puts YAML.load('[ssd, +ssd, -ssd]').inspect"
["ssd", "+ssd", "-ssd"]

In the expanded array format you need quotes around -.

--- 
- ssd
- +ssd
- "-ssd"
Actually, you totally don't. That's just the recommended way of formatting it.

Comments from Reviewable

@d4l3k d4l3k merged commit b742c6e into cockroachdb:master Aug 15, 2016
@d4l3k d4l3k deleted the zoneconfig-rfc branch August 15, 2016 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants