Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zone config updates #1027

Merged
merged 3 commits into from Jan 26, 2017
Merged

Zone config updates #1027

merged 3 commits into from Jan 26, 2017

Conversation

jseldess
Copy link
Contributor

@jseldess jseldess commented Jan 25, 2017

This PR updates the zone config docs to explain locality and constraints. It also adds examples and moves replica recommendations to our recommended production settings doc.


This change is Reviewable

@jseldess
Copy link
Contributor Author

@bdarnell and @petermattis, in case you want to review as well.

@jseldess jseldess force-pushed the zone-updates branch 2 times, most recently from 19b90f2 to 767af37 Compare January 26, 2017 04:03
@BramGruneir
Copy link
Member

Reviewed 4 of 4 files at r1.
Review status: all files reviewed at latest revision, 11 unresolved discussions.


configure-replication-zones.md, line 47 at r1 (raw file):

`ttlseconds` | The number of seconds overwritten values will be retained before garbage collection. Smaller values can save disk space if values are frequently overwritten; larger values increase the range allowed for `AS OF SYSTEM TIME` queries, also know as [Time Travel Queries](select.html#select-historical-data-time-travel).<br><br>It is not recommended to set this below `600` (10 minutes); doing so will cause problems for long-running queries. Also, since all versions of a row are stored in a single range that never splits, it is not recommended to set this so high that all the changes to a row in that time period could add up to more than 64MiB; such oversized ranges could contribute to the server running out of memory or other problems.<br><br>**Default:** `86400` (24 hours)
`num_replicas` | The number of replicas in the zone.<br><br>**Default:** `3` 
`constraints` | A comma-separated list of positive, required, and/or prohibited constraints influencing the location of replicas. See [Replica Constraints](#replication-constraints) for more details.<br><br>**Default:** No constraints, with CockroachDB locating each replica on a unique rack, if possible.

Unique node is still correct here. If localities are provided, there will be a preference for more diversity.


configure-replication-zones.md, line 67 at r1 (raw file):

#### Constraints in Replication Zones

The node- and store-level descriptive attributes mentioned above can be used as the following types of constraints in replication zones to influence the location of replicas. However, note the following general guidance: 

node- ?


configure-replication-zones.md, line 74 at r1 (raw file):

Constraint Type | Description | Syntax
----------------|-------------|-------
**Positive** | Replicas will be placed on nodes/stores with as many matching attributes as possible. When there are no matching nodes/stores with capacity, replicas will be added wherever there is capacity.<br><br>For example, `constraints: [ssd, datacenter=us-west-1a]` would cause replicas to be located on different `ssd` drives in the `us-west-1a` datacenter. When there's not sufficient capacity for a new replica on an `ssd` drive in the datacenter, the replica would get added on an available `ssd` drive elsewhere. When there's not sufficient capacity for a new replica on any `ssd` drive, the replica would get added on any other drive with capacity. | `[ssd]`

There is no implied order to positive constraints, so in this case, the preference would be:

  1. nodes that have both ssd and us-west-1a
  2. nodes with one of ssd or us-west-1a
  3. any other available node

If locality info is provided:

  1. nodes that have both ssd and us-west-1a
  2. nodes with one of ssd or us-west-1a
  3. the most diverse available node

Also, you mentioned added. This would also apply to rebalancing, so I don't know if added is technically correct. This is true for all sections here. But then again, when rebalancing, we are adding a replica and removing one. Ugh.


configure-replication-zones.md, line 75 at r1 (raw file):

----------------|-------------|-------
**Positive** | Replicas will be placed on nodes/stores with as many matching attributes as possible. When there are no matching nodes/stores with capacity, replicas will be added wherever there is capacity.<br><br>For example, `constraints: [ssd, datacenter=us-west-1a]` would cause replicas to be located on different `ssd` drives in the `us-west-1a` datacenter. When there's not sufficient capacity for a new replica on an `ssd` drive in the datacenter, the replica would get added on an available `ssd` drive elsewhere. When there's not sufficient capacity for a new replica on any `ssd` drive, the replica would get added on any other drive with capacity. | `[ssd]`
**Required** | Replicas **must** be placed on nodes/stores with matching attributes. When there are no matching nodes/stores with capacity, new replicas will not be added.<br><br>For example, `constraints: [+datacenter=us-west-1a]` would force replicas to be located on different racks in the `us-west-1a` datacenter. When there's not sufficient capacity for a new replica on a unique rack in the datacenter, the replica would get added on a rack already storing a replica. When there's not sufficient capacity for a new replica on any rack in the datacenter, the replica would not get added. | `[+ssd]`

I don't think you want racks here. Use nodes or machines.

If and only if locality information is provided, then it will try to spread the the nodes to the next available locality level.
So, let say you have locality levels of (in order):
area, datacenter, sector, floor, rack...

If you set datacenter to be a required constraint and there is locality information for sectors, than it will diversify at the sector level.

If you only have 3 sectors, but 5 replicas, it would actually try to diversity the next level (floor) as well, when placing a 2nd replica in a sector.

Also, this section can be simplified to state that only nodes with the required attributes or locality are considered.


configure-replication-zones.md, line 76 at r1 (raw file):

**Positive** | Replicas will be placed on nodes/stores with as many matching attributes as possible. When there are no matching nodes/stores with capacity, replicas will be added wherever there is capacity.<br><br>For example, `constraints: [ssd, datacenter=us-west-1a]` would cause replicas to be located on different `ssd` drives in the `us-west-1a` datacenter. When there's not sufficient capacity for a new replica on an `ssd` drive in the datacenter, the replica would get added on an available `ssd` drive elsewhere. When there's not sufficient capacity for a new replica on any `ssd` drive, the replica would get added on any other drive with capacity. | `[ssd]`
**Required** | Replicas **must** be placed on nodes/stores with matching attributes. When there are no matching nodes/stores with capacity, new replicas will not be added.<br><br>For example, `constraints: [+datacenter=us-west-1a]` would force replicas to be located on different racks in the `us-west-1a` datacenter. When there's not sufficient capacity for a new replica on a unique rack in the datacenter, the replica would get added on a rack already storing a replica. When there's not sufficient capacity for a new replica on any rack in the datacenter, the replica would not get added. | `[+ssd]`
**Prohibited** | Replicas **must not** be placed on nodes/stores with matching attributes. When there are no alternate nodes/stores with capacity, new replicas will not be added.<br><br>For example, `constraints: [-mem, -us-west-1a]` would force replicas to be located on-disk on different racks outside the `us-west-1a` datacenter. When there's not sufficient on-disk capacity on a unique rack outside the datacenter, the replica would get added on a rack already storing a replica. When there's sufficient capacity for a new replica only in the `us-west-1a` datacenter, the replica would not get added. | `[-ssd]`

Again, no racks here. That's very specific.

I think the language in this needs to be updated. If you have a prohibited constraint, when looking for where to place a replica, it will simple ignore all nodes that are prohibited and proceed accordingly.


configure-replication-zones.md, line 173 at r1 (raw file):

### Edit the Default Replication Zone

To edit the default replication zone, create a YAML file with changes, and use the `cockroach zone set .default -f <file.yaml>` command with appropriate flags:

Please add a comment that adjusting the default config is dangerous and should be avoided. Maybe make this the final example?


configure-replication-zones.md, line 199 at r1 (raw file):

~~~ shell
$ echo 'num_replicas: 5' | cockroach zone set .default -f -

Please mention how values not mentioned remain the same, so you can change the num_replicas without affecting the other values.


configure-replication-zones.md, line 256 at r1 (raw file):

num_replicas: 7
constraints: []
~~~

How about a section on how to write a yaml file first, then the specific commands to update a table, a db, or the default.


recommended-production-settings.md, line 14 at r1 (raw file):

## Cluster Topology

When running a cluster with more than one node, each replica will be on a different node and a majority of replicas must remain available for the cluster to make progress. Therefore:

Perhaps elaborate on majority here with some examples. Ah, I see you mention it later, but don't use majority and instead use quorum. Please pick one or the other.


recommended-production-settings.md, line 20 at r1 (raw file):

- Run one node per machine. Since CockroachDB replicates across nodes, running more than one node per machine increases the risk of data unavailability if a machine fails.

- If a machine has multiple disks or SSDs, it's better to run one node with multiple `--store` flags instead of one node per disk, because this informs CockroachDB about the relationship between the stores and ensures that data will be replicated across different machines instead of being assigned to different disks of the same machine. For more details about stores, see [Start a Node](start-a-node.html).

I think there's also an advantage for rocksdb here. Something about shared memory. I don't remember if we actually use it or not. Suffice to say, it is significantly more efficient as well.


start-a-node.md, line 45 at r1 (raw file):

`--max-sql-memory` | The total size for storage of temporary data for SQL clients, including prepared queries and intermediate data rows during query execution. This can be in any bytes-based unit, for example:<br><br>`--max-sql-memory=10000000000 ----> 1000000000 bytes`<br>`--max-sql-memory=1GB ----> 1000000000 bytes`<br>`--max-sql-memory=1GiB ----> 1073741824 bytes`<br><br>**Default:** 25% of total system memory (excluding swap), or 512MiB if the memory size cannot be determined
`--port`<br>`-p` | The port to bind to for internal and client communication. <br><br>**Env Variable:** `COCKROACH_PORT`<br>**Default:** `26257`
`--raft-tick-interval` | CockroachDB uses the [Raft consensus algorithm](https://raft.github.io/) to replicate data consistently according to your [replication zone configuration](configure-replication-zones.html). For each replica group, an elected leader heartbeats its followers and keeps their logs replicated. When followers fail to receive heartbeats, a new leader is elected. <br><br>This flag is factored into defining the interval at which replica leaders heartbeat followers. It is not recommended to change the default, but if changed, it must be changed identically on all nodes in the cluster.<br><br>**Default:** 200ms 

And I think every machine must be stopped and restarted with the new value.


Comments from Reviewable


- If a machine has multiple disks or SSDs, it's better to run one node with multiple `--store` flags instead of one node per disk, because this informs CockroachDB about the relationship between the stores and ensures that data will be replicated across different machines instead of being assigned to different disks of the same machine. For more details about stores, see [Start a Node](start-a-node.html).

- Configurations with odd numbers of replicas are more robust than those with even numbers. Clusters of three and four nodes can each tolerate one node failure and still reach a quorum (2/3 and 3/4 respectively), so the fourth replica doesn't add any extra fault-tolerance. To survive two simultaneous failures, you must have five replicas.

- When replicating across datacenters, you should use datacenters on a single continent to ensure peformance (cross-continent scenarios will be better supported in the future).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we recommend adding locality labels as a best practice in these cases?

@jseldess
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 12 unresolved discussions.


configure-replication-zones.md, line 47 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Unique node is still correct here. If localities are provided, there will be a preference for more diversity.

Done.


configure-replication-zones.md, line 67 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

node- ?

Yeah, when two hyphenated words share a suffix, that's actually a way to save space, instead of node-level and store-level. But for clarify, I'll change it to the latter.


configure-replication-zones.md, line 74 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

There is no implied order to positive constraints, so in this case, the preference would be:

  1. nodes that have both ssd and us-west-1a
  2. nodes with one of ssd or us-west-1a
  3. any other available node

If locality info is provided:

  1. nodes that have both ssd and us-west-1a
  2. nodes with one of ssd or us-west-1a
  3. the most diverse available node

Also, you mentioned added. This would also apply to rebalancing, so I don't know if added is technically correct. This is true for all sections here. But then again, when rebalancing, we are adding a replica and removing one. Ugh.

I've made a few updates, in the overall intro and the intro to Replica Constraints, to clarify that zone configs apply both when replicas are first added and when they are rebalanced. Sometime soon, I'll want to add another page explaining rebalancing and link to that.

But I still need to find a way to simplify these definitions while conveying the complexity of various scenarios. Not sure how to do that. Will keep working and push additional commits, but it'd be great to get input from @bdarnell as well.


configure-replication-zones.md, line 173 at r1 (raw file):
Just chatted with @bdarnell about this. He doesn't know that making changes to the default zone config is dangerous. To quote him:

If it's just that it may affect a large amount of data, then we may want a warning that reconfiguring large tables or databases may impact performance, but i don't see any reason to warn on the default zone specifically.

But even on that point, we don't know that it's worth adding a warning unless we know that this is true. Ben didn't know if we've tested this yet.


configure-replication-zones.md, line 199 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Please mention how values not mentioned remain the same, so you can change the num_replicas without affecting the other values.

I've worked that into all of the basic examples, i.e. , create a YAML file defining only the values you want to change (other values will not be affected).


configure-replication-zones.md, line 256 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

How about a section on how to write a yaml file first, then the specific commands to update a table, a db, or the default.

We define the YAML format further up the page, and provide plenty of examples. I don't think we need to go too much further into explaining how to write one of these, unless we get questions.


recommended-production-settings.md, line 14 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Perhaps elaborate on majority here with some examples. Ah, I see you mention it later, but don't use majority and instead use quorum. Please pick one or the other.

Done.


recommended-production-settings.md, line 20 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

I think there's also an advantage for rocksdb here. Something about shared memory. I don't remember if we actually use it or not. Suffice to say, it is significantly more efficient as well.

Made a minor language change here. Let me know if you'd suggest more.


start-a-node.md, line 45 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

And I think every machine must be stopped and restarted with the new value.

Done.


Comments from Reviewable

@bdarnell
Copy link
Member

LGTM


Review status: 1 of 4 files reviewed at latest revision, 15 unresolved discussions, some commit checks failed.


configure-replication-zones.md, line 61 at r2 (raw file):

Attribute Type | Description
---------------|------------
**Node Locality** | Using the `--locality` flag, you can assign arbitrary key-value pairs that describe the locality of the node. Locality might include cloud provider, country, region, datacenter, rack, etc.<br><br>CockroachDB attempts to spread replicas evenly across the cluster based on locality, with the order determining the priority. The keys themselves and the order of key-value pairs must be the same on all nodes, for example:<br><br>`--locality=cloud=gce,country=us,region=east,datacenter=us-east-1,rack=10`<br>`--locality=cloud=aws,country=us,region=west,datacenter=us-west-1,rack=12`

"cloud" probably shouldn't be in the locality list (or if it is, it belongs after "region"). Two different cloud providers' "east" regions are closer together than other regions of the same provider. I think "cloud" probably makes more sense as a non-locality attribute.

I'd also simplify this a bit - it's going to be rare for anyone to use more than 1 or 2 levels here (until we get to much larger deployments). I'd trim the example to "region" and "datacenter".


configure-replication-zones.md, line 62 at r2 (raw file):

---------------|------------
**Node Locality** | Using the `--locality` flag, you can assign arbitrary key-value pairs that describe the locality of the node. Locality might include cloud provider, country, region, datacenter, rack, etc.<br><br>CockroachDB attempts to spread replicas evenly across the cluster based on locality, with the order determining the priority. The keys themselves and the order of key-value pairs must be the same on all nodes, for example:<br><br>`--locality=cloud=gce,country=us,region=east,datacenter=us-east-1,rack=10`<br>`--locality=cloud=aws,country=us,region=west,datacenter=us-west-1,rack=12`
**Node Capability** | Using the `--attrs` flag, you can specify node capability, which might include specialized hardware or number of cores, for example:<br><br>`--attrs=gpu:x16c`

We don't currently support GPUs, so this is an odd example. How about ram:64gb?


recommended-production-settings.md, line 20 at r2 (raw file):

- Run one node per machine. Since CockroachDB replicates across nodes, running more than one node per machine increases the risk of data unavailability if a machine fails.

- If a machine has multiple disks or SSDs, it's more efficient to run one node with multiple `--store` flags than one node per disk, because this informs CockroachDB about the relationship between the stores and ensures that data will be replicated across different machines instead of being assigned to different disks of the same machine. For more details about stores, see [Start a Node](start-a-node.html).

I think "better" is more appropriate than "more efficient" here. The problem with running one node per disk isn't that it's less efficient (is it? we haven't measured), it's the one from the previous bullet - an increased risk of data loss due to the failure of a node that held multiple replicas. It might make more sense to combine these bullets.


Comments from Reviewable

@jseldess
Copy link
Contributor Author

Review status: 1 of 4 files reviewed at latest revision, 15 unresolved discussions.


configure-replication-zones.md, line 74 at r1 (raw file):

Previously, jseldess wrote…

I've made a few updates, in the overall intro and the intro to Replica Constraints, to clarify that zone configs apply both when replicas are first added and when they are rebalanced. Sometime soon, I'll want to add another page explaining rebalancing and link to that.

But I still need to find a way to simplify these definitions while conveying the complexity of various scenarios. Not sure how to do that. Will keep working and push additional commits, but it'd be great to get input from @bdarnell as well.

I've decided to revise and simply the descriptions of constraint types for now. I want to take more time to work with @BramGruneir to get additional details right.


configure-replication-zones.md, line 75 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

I don't think you want racks here. Use nodes or machines.

If and only if locality information is provided, then it will try to spread the the nodes to the next available locality level.
So, let say you have locality levels of (in order):
area, datacenter, sector, floor, rack...

If you set datacenter to be a required constraint and there is locality information for sectors, than it will diversify at the sector level.

If you only have 3 sectors, but 5 replicas, it would actually try to diversity the next level (floor) as well, when placing a 2nd replica in a sector.

Also, this section can be simplified to state that only nodes with the required attributes or locality are considered.

Same as above: Simplifying for now. Will add additional details later.


configure-replication-zones.md, line 76 at r1 (raw file):

Previously, BramGruneir (Bram Gruneir) wrote…

Again, no racks here. That's very specific.

I think the language in this needs to be updated. If you have a prohibited constraint, when looking for where to place a replica, it will simple ignore all nodes that are prohibited and proceed accordingly.

Same as above.


configure-replication-zones.md, line 61 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

"cloud" probably shouldn't be in the locality list (or if it is, it belongs after "region"). Two different cloud providers' "east" regions are closer together than other regions of the same provider. I think "cloud" probably makes more sense as a non-locality attribute.

I'd also simplify this a bit - it's going to be rare for anyone to use more than 1 or 2 levels here (until we get to much larger deployments). I'd trim the example to "region" and "datacenter".

Done.


configure-replication-zones.md, line 62 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We don't currently support GPUs, so this is an odd example. How about ram:64gb?

Done.


recommended-production-settings.md, line 24 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Should we recommend adding locality labels as a best practice in these cases?

Added that suggestion with a link to the new "Even Replication Across Datacenters" example.


recommended-production-settings.md, line 20 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think "better" is more appropriate than "more efficient" here. The problem with running one node per disk isn't that it's less efficient (is it? we haven't measured), it's the one from the previous bullet - an increased risk of data loss due to the failure of a node that held multiple replicas. It might make more sense to combine these bullets.

Done.


Comments from Reviewable

@jseldess jseldess merged commit 64559b7 into gh-pages Jan 26, 2017
@jseldess jseldess deleted the zone-updates branch January 26, 2017 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants