Topology pattterns #4235

lnhsingh · 2019-01-10T22:36:24Z

Closes #2935.

cockroach-teamcity · 2019-01-10T22:36:29Z

This change is

cockroach-teamcity · 2019-01-10T22:39:49Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/9389d81fb715a4ccf6ab37f19105f87e46fe23b8/

cockroach-teamcity · 2019-02-13T01:24:23Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/e2ee1b77eb8d6715464e144496ab4fe5da4a42e7/

cockroach-teamcity · 2019-02-13T16:58:01Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/522be9690249cc93e633aa0eb8f87f9d7d1665c1/

cockroach-teamcity · 2019-02-14T18:31:41Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/e20286fb4c912d57f457c38f4f1387b05a7d09b4/

cockroach-teamcity · 2019-02-14T18:32:56Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/1371c5df78d6ec8e9fde5186a34fa8c105405e11/

cockroach-teamcity · 2019-02-14T18:40:33Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/0ede005c08defa68e2d5bbb76fd4323f688a091b/

cockroach-teamcity · 2019-02-14T18:46:25Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/6ad12cdf1323c205ce7e3a9fd2ca57fc0195de48/

cockroach-teamcity · 2019-02-14T18:46:58Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/e6621f6276f9c81b0b23cf1eafa1e17d5663bf19/

lnhsingh

Nice to have: illustrations of the cluster patterns

Reviewable status: complete! 0 of 0 LGTMs obtained

v2.2/cluster-topology-patterns.md, line 183 at r1 (raw file):

#### Performance expectations

<!-- Add -->

Need help adding performance expectations for this cluster pattern

v2.2/cluster-topology-patterns.md, line 227 at r1 (raw file):

#### Performance expectations

<!-- Add -->

Need help adding performance expectations for this cluster pattern

v2.2/cluster-topology-patterns.md, line 231 at r1 (raw file):

#### Application expectations

<!-- Add -->

Need help adding app expectations for this cluster pattern

v2.2/cluster-topology-patterns.md, line 270 at r1 (raw file):

#### Performance expectations

<!-- Add -->

Need help adding performance expectations for this cluster pattern

v2.2/cluster-topology-patterns.md, line 274 at r1 (raw file):

#### Application expectations

<!-- Add -->

Need help adding app expectations for this cluster pattern

v2.2/cluster-topology-patterns.md, line 431 at r1 (raw file):

## Anti-patterns

_Do we want to add a section for bad patterns (i.e., two datacenters, even # of replicas)?_

Thoughts here? If yes, what would be helpful?

cockroach-teamcity · 2019-02-14T19:05:35Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/67ce5258054cf4e0da40ea1b73adf2015e9b2bc5/

rolandcrosby

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lhirata and @rolandcrosby)

a discussion (no related file):
Sorry this has taken me so long. Overall I think we need to provide more context on when you would want to use one of these patterns, rather than just showing the diagrams and their characteristics.

v2.2/cluster-topology-patterns.md, line 431 at r1 (raw file):

Previously, lhirata wrote…

Thoughts here? If yes, what would be helpful?

this is a good place to mention that you almost certainly don't want a replication factor equal to the number of nodes in your cluster

v2.2/cluster-topology-patterns.md, line 7 at r2 (raw file):

---

This is page covers common cluster topology patterns with setup examples and performance considerations.

minor: "This page"

v2.2/cluster-topology-patterns.md, line 11 at r2 (raw file):

## Considerations

When selecting a pattern for your cluster, the following must be taken into consideration:

probably should be more explicit about the fact that these are all tradeoffs that you need to balance - also worth adding a bullet about replication factor

v2.2/cluster-topology-patterns.md, line 35 at r2 (raw file):

- `App` is an application that accesses CockroachDB
- `HA-Proxy` is a software based load balancer

minor: change to "HAProxy" throughout

v2.2/cluster-topology-patterns.md, line 38 at r2 (raw file):

- `1`, `2`, and `3` each represents a CockroachDB node
- The nodes are all running in a single datacenter

other things to mention in the bullet points for each configuration:

all CockroachDB nodes are expected to be able to communicate with each other
using default replication factor of 3

v2.2/cluster-topology-patterns.md, line 138 at r2 (raw file):

- Each region defines an availability zone, and three or more regions are recommended.
- Can survive a single datacenter failure

can you add details of how/why this can survive a single datacenter failure? i.e. what combination of zone configs, localities, and replication factors causes this to be the case?

v2.2/cluster-topology-patterns.md, line 151 at r2 (raw file):

         Clients
           |
          GSLB

what's this stand for?

v2.2/cluster-topology-patterns.md, line 163 at r2 (raw file):

  West---Central ---East
    \               /
     \ CockroachDB /

not sure what this is supposed to represent

v2.2/cluster-topology-patterns.md, line 193 at r2 (raw file):

### High-Performance

Some applications have high-performance requirements. In the diagram below, `NJ` and `NY` depict two separate datacenters that are connected by a high bandwidth low-latency network:

should make it clear that each datacenter here is 3+ nodes

v2.2/cluster-topology-patterns.md, line 214 at r2 (raw file):

- `NJ` and `NY` have the performance characteristics of the [local topology](#single-local-datacenter-clusters), but the benefit of Zero RPO and near Zero RTO disaster recovery SLA.
- `CA` and `NV` have been set up with a network capability

what does this mean?

v2.2/cluster-topology-patterns.md, line 274 at r2 (raw file):

## Partitioned clusters

Are these really different topologies or just special cases of the above topologies?

lnhsingh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lhirata and @rolandcrosby)

v2.2/cluster-topology-patterns.md, line 151 at r2 (raw file):

Previously, rolandcrosby (Roland Crosby) wrote…

what's this stand for?

I assumed "Global Server Load Balancer" but not sure. Will double check with SE

v2.2/cluster-topology-patterns.md, line 193 at r2 (raw file):

Previously, rolandcrosby (Roland Crosby) wrote…

should make it clear that each datacenter here is 3+ nodes

Done.

cockroach-teamcity · 2019-03-04T21:29:00Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/6faca0d45149a360db5b8d428bace6359391f80a/

cockroach-teamcity · 2019-03-07T03:48:00Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/16a88df4eb6a3c0dbac7dcdc47879a8d841c6f89/

cockroach-teamcity · 2019-03-07T17:31:25Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/9e034933ffe73b814a392c1cb8c7f63479a1f17f/

cockroach-teamcity · 2019-03-07T17:32:48Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/358bbcb53b78c0d534d842036ff4ac23be0dcfd2/

cockroach-teamcity · 2019-03-11T16:44:35Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/8993023076cbc3bea67ec5f0b4fbde519d86d30a/

jseldess

@lhirata, as discussed, I made some direct edits to the single-region patterns. The multi-region patterns still need a bunch of work to clarify what we're talking about. Please discuss with Roko and Roland. If you need support, I'm happy to chip in more, especially since I just finished a geo-partitioning demo and so can somewhat easily reuse some of that content.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @lhirata, and @rolandcrosby)

v19.1/cluster-topology-patterns.md, line 95 at r4 (raw file):

In this example, the cluster has an asymmetrical setup where `Central` is closer to the `West` than the `East`. This configuration will provide better write latency to the write workloads in the `West` and `Central` because there is a lower latency (versus writing in the `East`). This is assuming you are not using zone configurations.

<img src="{{ 'images/v19.1/topology-patterns/basic-multi-region.png' | relative_url }}" alt="Basic pattern for multi-region" style="border:1px solid #eee;max-width:100%" />

This image doesn't give a clear sense of 3 datacenters in each region and it's missing load balancers. Without those details, the text below is a lot harder to follow since it doesn't map to the image.

v19.1/cluster-topology-patterns.md, line 119 at r4 (raw file):

If all of the nodes for a preferred locality are down, then the app will try databases in other localities. The cluster can withstand a datacenter failure. In general, multi-regions can help protect against natural disaster.

**Performance expectations**

We need to do more to explain and show how having a range's replicas spread across 3 regions impacts read and writer performance. The slides I have for the geo-partitioning demo can help here, I think.

Without doing this, the partitioning example below is less clear.

v19.1/cluster-topology-patterns.md, line 143 at r4 (raw file):

- All CockroachDB nodes communicate with each other
- Tables are [partitioned](partitioning.html) at row-level by locality.
- Rows with the `West` partition have their leaseholder in the `West` datacenter.

Why are we saying that only the leaseholder is in the west datacenter, etc.? We're talking about all data.

v19.1/cluster-topology-patterns.md, line 150 at r4 (raw file):

    ~~~
    --loc=Region=East

These would probably be like: --locality=region=us-east1,datacenter=us-east1-a

v19.1/cluster-topology-patterns.md, line 162 at r4 (raw file):

- Reads respond in a few milliseconds.
- Writes respond in 60ms.

Why are writes expected to be so high? If data is fully partitioned by locality, all reads and writes should be 2-4 milliseconds.

v19.1/cluster-topology-patterns.md, line 98 at r5 (raw file):

- A software-based load balancer directs traffic to any of the regions' nodes at random.
- Every region has 3 datacenters.
- Similar to the [single-datacenter](#single-region-clusters) topology, more regions can be added dynamically

I don't understand what this and the following bullet are saying. For this pattern, I think it's very important for us to emphasize setting --locality the right way (with region and datacenter per node) and what that gives you: Without any special replication controls, CockroachDB will spread each range across the 3 regions.

lnhsingh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @lhirata, @rkruze, and @rolandcrosby)

v19.1/cluster-topology-patterns.md, line 95 at r4 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

This image doesn't give a clear sense of 3 datacenters in each region and it's missing load balancers. Without those details, the text below is a lot harder to follow since it doesn't map to the image.

Done.

v19.1/cluster-topology-patterns.md, line 143 at r4 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Why are we saying that only the leaseholder is in the west datacenter, etc.? We're talking about all data.

Tried to clarify. Lmk if this is still confusing or incorrect.

v19.1/cluster-topology-patterns.md, line 150 at r4 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

These would probably be like: --locality=region=us-east1,datacenter=us-east1-a

Done.

v19.1/cluster-topology-patterns.md, line 162 at r4 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Why are writes expected to be so high? If data is fully partitioned by locality, all reads and writes should be 2-4 milliseconds.

Done.

v19.1/cluster-topology-patterns.md, line 98 at r5 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

I don't understand what this and the following bullet are saying. For this pattern, I think it's very important for us to emphasize setting --locality the right way (with region and datacenter per node) and what that gives you: Without any special replication controls, CockroachDB will spread each range across the 3 regions.

Done.

v19.1/cluster-topology-patterns.md, line 149 at r6 (raw file):

- Rows with the `region=us-east` partition have their leaseholder constrained to a `us-east-b` datacenter.

**Availability expectations**

Are these availability expectations correct? What happens if you were to lose a whole region?

cockroach-teamcity · 2019-03-12T22:40:21Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/0395b3f842fe5f1803f64840d949431d1395b2e0/

lnhsingh · 2019-03-12T22:41:20Z

@rkruze / @rolandcrosby / @jseldess, can you review? See rendered page here: http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/5fc63234ea0d5b1b695118752c8753bc172d0762/dev/cluster-topology-patterns.html

cockroach-teamcity · 2019-03-12T23:20:01Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/1e8136106be117fe00131c856bb759f351cbb944/

cockroach-teamcity · 2019-03-15T17:16:22Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/5fc63234ea0d5b1b695118752c8753bc172d0762/

jseldess

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @lhirata, @rkruze, and @rolandcrosby)

a discussion (no related file):

Previously, rolandcrosby (Roland Crosby) wrote…

Sorry this has taken me so long. Overall I think we need to provide more context on when you would want to use one of these patterns, rather than just showing the diagrams and their characteristics.

Getting close, @lhirata. Please see my comments for another round.

v19.1/cluster-topology-patterns.md, line 149 at r6 (raw file):

Previously, lhirata wrote…

Are these availability expectations correct? What happens if you were to lose a whole region?

As mentioned above, you want 3 datacenters in each region. With that configuration, your bullets are no longer correct. Please replace with this:

The cluster as a whole can withstand a regional failure because system-level ranges have their replicas balanced across regions. However, because user data is partitioning and pinned to specific regions, region-specific data will be unavailable during a regional failure.
Within a region, partitions pinned to the region will remain available as long as 2/3 datacenters are up.

@roko or @bdarnell, I'd like your confirmation that I've gotten this right.

v19.1/cluster-topology-patterns.md, line 165 at r6 (raw file):

- Do not deploy to 2 datacenters. A cluster across 2 datacenters is not protected against datacenter failure and can lead to a [split-brain scenario](https://en.wikipedia.org/wiki/Split-brain_(computing)). For CockroachDB to work from a resiliency standpoint, it is best practice to deploy your cluster across 3 or more datacenters.
- Do not deploy to regions with high network latency (e.g., `us-west`, `asia`, and `europe`) without using partitioning.

Make partitioning a link to https://www.cockroachlabs.com/docs/v19.1/partitioning.html.

v19.1/cluster-topology-patterns.md, line 39 at r8 (raw file):

- The 3 nodes are all running in a single datacenter.
- The cluster is using the default replication factor of 3 (represented by 3 blocks of the same color). Each range (e.g., `r1`) has 3 replicas, with each replica on a different node.
- All CockroachDB nodes communicate with each other

I still don't think this is necessary to say. It's never going to change, regardless of the pattern. I'd remove this here and from other patterns.

v19.1/cluster-topology-patterns.md, line 93 at r8 (raw file):

    ~~~

- The cluster is using a replication factor of 5 (represented by 5 blocks of the same color). Each range (e.g., `r1`) has 5 replicas, with each replica on a different node.

I don't think we should focus on rep factor of 5 here. With this setup (3 regions, 2 datacenters per region), with a rep factor of 3, you're already tolerant to an entire region failure. You don't get much more from rep factor of 5 in most cases, except more write latency.

So let's reduce this to rep factor of 3 and update the diagram accordingly.

v19.1/cluster-topology-patterns.md, line 95 at r8 (raw file):

- The cluster is using a replication factor of 5 (represented by 5 blocks of the same color). Each range (e.g., `r1`) has 5 replicas, with each replica on a different node.
- All CockroachDB nodes communicate with each other
- Similar to the [single-datacenter](#single-region-clusters) topology, more regions can be added dynamically.

This doesn't mean much to me. I'd remove.

v19.1/cluster-topology-patterns.md, line 101 at r8 (raw file):
If we reduce the rep factor to 3, we'll need to update this to something like:

The cluster can withstand a regional failure because, with --locality specified on each node as shown above, the cluster balances each range across all 3 regions; with one region down, each range still has a majority of its replicas (2/3).

v19.1/cluster-topology-patterns.md, line 105 at r8 (raw file):

**Performance expectations**

- The latency numbers (e.g., `60ms`) in the first diagram represent network round-trip from one datacenter to another.

I think you mean from one region to another?

v19.1/cluster-topology-patterns.md, line 106 at r8 (raw file):

- The latency numbers (e.g., `60ms`) in the first diagram represent network round-trip from one datacenter to another.
- [Follow-the-workload](demo-follow-the-workload.html) will increase the speed for reads.

I still don't think these last 2 bullets are detailed enough. Here's my suggestion:

For reads, if the gateway node (the node the app connects to) is in the region containing the leaseholder replica of the relevant range, latency should be around 2ms. If the gateway node is in a region that does not contain the leaseholder, the cluster will route the request to the node with the leaseholder in another region, that node will retrieve the data, and then the cluster will return the data to the gateway node. In this case, the network round-trips from one region to another will add latency. In some cases, follow-the-workload will increase the speed for reads by moving the leaseholder closer to the application.
For writes, because a majority of replicas are always required to agree before a write is committed, latencies will be as fast as the slowest quorum between 2 regions.

v19.1/cluster-topology-patterns.md, line 126 at r8 (raw file):

- A client connects to geographically close `app` server via `GSLB`.
- Inside each region, an `app` server connects to one of the CockroachDB nodes within the region through a software-based `load balancer`.
- Every region has 3 nodes across 2 datacenters (e.g., `us-west-a`, `us-west-b`). Note that most cloud providers only have 2 datacenters per region. Each node is started with the `--locality` flag to identify which region it is in:

Oh, this is not ideal. You want 3 nodes across 3 datacenters. That's most resilient and what we should show here. You'll have to adjust the diagram. Also the availability expectations below.

v19.1/cluster-topology-patterns.md, line 138 at r8 (raw file):

- The cluster is using a replication factor of 3 (represented by the 3 blocks of the same color). Each range (e.g., `r1`) has a prefix (`w-` for West, `c-` for Central, `e-` for East), which denotes the partition that is replicated.
- Leaseholders are denoted by a dashed line. Using [zone configurations](configure-replication-zones.html), leaseholders can be pinned (represented by the `x`) to a datacenter close to the users.

I don't think we should focus on pinning leaseholders. With partitioning and zone configs, you pin entire ranges, including the leaseholder replica. This is a useful feature, but not necessary in combination with partitioning, so let's remove this bullet. Also remove the pinned leaseholder aspect from the diagram.

Let's also remove the next bullet and the "However..." sentence and just follow this with "Tables are partitioned...".

v19.1/cluster-topology-patterns.md, line 143 at r8 (raw file):

However, to make the cluster more performant, you need to add [partitions](partitioning.html) (an enterprise-only feature). In this example:

- Tables are [partitioned](partitioning.html) at the row level by locality.

We probably should add an example, like we have for locality.

v19.1/cluster-topology-patterns.md, line 145 at r8 (raw file):

- Tables are [partitioned](partitioning.html) at the row level by locality.
- Partition replicas are distributed among the 3 nodes within each region.
- Rows with the `region=us-west` partition have their leaseholder constrained to a `us-west-b` datacenter.

Maybe we should combine this and the next 2 bullets and add examples like we do for locality.

Alternatively, we can link to this example for more insight in to the specific commands required: https://www.cockroachlabs.com/docs/dev/partitioning.html#define-table-partitions-by-list.

I'll also be adding a geo-partitioning tutorial very soon.

lnhsingh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @lhirata, @rkruze, and @rolandcrosby)

v19.1/cluster-topology-patterns.md, line 149 at r6 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

As mentioned above, you want 3 datacenters in each region. With that configuration, your bullets are no longer correct. Please replace with this:

The cluster as a whole can withstand a regional failure because system-level ranges have their replicas balanced across regions. However, because user data is partitioning and pinned to specific regions, region-specific data will be unavailable during a regional failure.

Within a region, partitions pinned to the region will remain available as long as 2/3 datacenters are up.

@roko or @bdarnell, I'd like your confirmation that I've gotten this right.

Updated.

v19.1/cluster-topology-patterns.md, line 165 at r6 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Make partitioning a link to https://www.cockroachlabs.com/docs/v19.1/partitioning.html.

Done.

v19.1/cluster-topology-patterns.md, line 39 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

I still don't think this is necessary to say. It's never going to change, regardless of the pattern. I'd remove this here and from other patterns.

Done.

v19.1/cluster-topology-patterns.md, line 93 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

I don't think we should focus on rep factor of 5 here. With this setup (3 regions, 2 datacenters per region), with a rep factor of 3, you're already tolerant to an entire region failure. You don't get much more from rep factor of 5 in most cases, except more write latency.

So let's reduce this to rep factor of 3 and update the diagram accordingly.

Done.

v19.1/cluster-topology-patterns.md, line 95 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

This doesn't mean much to me. I'd remove.

Done.

v19.1/cluster-topology-patterns.md, line 101 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

If we reduce the rep factor to 3, we'll need to update this to something like:

The cluster can withstand a regional failure because, with --locality specified on each node as shown above, the cluster balances each range across all 3 regions; with one region down, each range still has a majority of its replicas (2/3).

Done.

v19.1/cluster-topology-patterns.md, line 105 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

I think you mean from one region to another?

Done.

v19.1/cluster-topology-patterns.md, line 106 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

I still don't think these last 2 bullets are detailed enough. Here's my suggestion:

For reads, if the gateway node (the node the app connects to) is in the region containing the leaseholder replica of the relevant range, latency should be around 2ms. If the gateway node is in a region that does not contain the leaseholder, the cluster will route the request to the node with the leaseholder in another region, that node will retrieve the data, and then the cluster will return the data to the gateway node. In this case, the network round-trips from one region to another will add latency. In some cases, follow-the-workload will increase the speed for reads by moving the leaseholder closer to the application.

For writes, because a majority of replicas are always required to agree before a write is committed, latencies will be as fast as the slowest quorum between 2 regions.

Done.

v19.1/cluster-topology-patterns.md, line 126 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Oh, this is not ideal. You want 3 nodes across 3 datacenters. That's most resilient and what we should show here. You'll have to adjust the diagram. Also the availability expectations below.

Since most cloud providers only have 2 datacenters per region, do we assume that they are using multiple cloud providers to have 3+ datacenters per region? Does this need to be represented in the diagram or is us-west-a/b/c enough?

v19.1/cluster-topology-patterns.md, line 138 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

I don't think we should focus on pinning leaseholders. With partitioning and zone configs, you pin entire ranges, including the leaseholder replica. This is a useful feature, but not necessary in combination with partitioning, so let's remove this bullet. Also remove the pinned leaseholder aspect from the diagram.

Let's also remove the next bullet and the "However..." sentence and just follow this with "Tables are partitioned...".

Done.

v19.1/cluster-topology-patterns.md, line 143 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

We probably should add an example, like we have for locality.

Done, but can you please double check that I created the example correctly

v19.1/cluster-topology-patterns.md, line 145 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Maybe we should combine this and the next 2 bullets and add examples like we do for locality.

Alternatively, we can link to this example for more insight in to the specific commands required: https://www.cockroachlabs.com/docs/dev/partitioning.html#define-table-partitions-by-list.

I'll also be adding a geo-partitioning tutorial very soon.

Done, but let me know if you had something else in mind.

cockroach-teamcity · 2019-03-19T18:21:15Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/d4084cf15dd8ef37b129bb557456b93115d8cacc/

bdarnell

cockroachdb/cockroach#12768 has just resurfaced and unfortunately has some major implications for availability and the ability to survive region/AZ failures. We should probably hold off on this until we figure out how we want to message that.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jseldess, @lhirata, @rkruze, and @rolandcrosby)

v19.1/cluster-topology-patterns.md, line 149 at r6 (raw file):

Previously, lhirata wrote…

Updated.

👍

"user data is partitioned", not "partitioning"

v19.1/cluster-topology-patterns.md, line 93 at r8 (raw file):

Previously, lhirata wrote…

Done.

A replication factor of 5 doesn't necessarily increase write latency. It does reduce write throughput, though.
Amazon talks about Aurora fault tolerance in terms of "AZ+1" failures - an entire availability zone goes down, plus one node in another AZ. A replication factor of 5 can be useful here for region+1 failures (or the failure of two nodes independently).

v19.1/cluster-topology-patterns.md, line 81 at r9 (raw file):

- Leaseholders are denoted by a dashed line.
- 6 Nodes are spread across 3 regions (`us-west`, `us-central`, `us-east`) within a country (`us`).
- Every region has 2 nodes across 2 datacenters (e.g., `us-west-a`, `us-west-b`). Note that most cloud providers only have 2 datacenters per region. Each node is started with the `--locality` flag to identify which region it is in:

Use the words "availability zone" or "AZ" here - "datacenter" is correct too so you may want to use both, but the cloud providers use the term AZ instead of datacenter so that's what people may be looking for. And there are usually three AZs per region for the major clouds (this is important, since allows you have a 3-node quorum spread across 3 AZs in a single-region deployment)

v19.1/cluster-topology-patterns.md, line 183 at r9 (raw file):

Anti-patterns are commonly used patterns that are ineffective or risky. Consider the following when choosing a cluster pattern:

- Do not deploy to 2 datacenters. A cluster across 2 datacenters is not protected against datacenter failure and can lead to a [split-brain scenario](https://en.wikipedia.org/wiki/Split-brain_(computing)). For CockroachDB to work from a resiliency standpoint, it is best practice to deploy your cluster across 3 or more datacenters.

CockroachDB is immune to split-brain scenarios. Deploying across two datacenters is safe, but it's not necessarily helpful - losing either one of the datacenters can knock out the whole cluster. In order to survive the failure of a datacenter, you need at least three of them.

lnhsingh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @jseldess, @lhirata, @rkruze, and @rolandcrosby)

v19.1/cluster-topology-patterns.md, line 149 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

👍

"user data is partitioned", not "partitioning"

Done.

v19.1/cluster-topology-patterns.md, line 81 at r9 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Use the words "availability zone" or "AZ" here - "datacenter" is correct too so you may want to use both, but the cloud providers use the term AZ instead of datacenter so that's what people may be looking for. And there are usually three AZs per region for the major clouds (this is important, since allows you have a 3-node quorum spread across 3 AZs in a single-region deployment)

Done.

v19.1/cluster-topology-patterns.md, line 183 at r9 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

CockroachDB is immune to split-brain scenarios. Deploying across two datacenters is safe, but it's not necessarily helpful - losing either one of the datacenters can knock out the whole cluster. In order to survive the failure of a datacenter, you need at least three of them.

Done.

cockroach-teamcity · 2019-03-19T23:02:17Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/354001846f6ccd1cfec737078f57dc3517d07415/

jseldess

There are still a few very minor fixes. I'll make those myself and then merge so we can get this v1 in front of users.

We will need to follow this with a v2 soon after, in Q1, based on conversations I had with Robert, Ben, and Nate. Will discuss that with you offline.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @jseldess, @lhirata, @rkruze, and @rolandcrosby)

v19.1/cluster-topology-patterns.md, line 126 at r8 (raw file):

Previously, lhirata wrote…

Since most cloud providers only have 2 datacenters per region, do we assume that they are using multiple cloud providers to have 3+ datacenters per region? Does this need to be represented in the diagram or is us-west-a/b/c enough?

For now, let's just take out that sentence, Note that most cloud providers have 3 2 availability zones (i.e., datacenters) per region. We need to do a v2 of this anyway.

v19.1/cluster-topology-patterns.md, line 143 at r8 (raw file):

Previously, lhirata wrote…

Done, but can you please double check that I created the example correctly

Yes, this looks good, but the zone config command should be later.

cockroach-teamcity · 2019-03-25T13:55:22Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/ec045e9012ace18b799f8be5e20f7d6b7f3d486d/

bdarnell

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @jseldess, @lhirata, @rkruze, and @rolandcrosby)

v19.1/cluster-topology-patterns.md, line 126 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

For now, let's just take out that sentence, Note that most cloud providers have 3 2 availability zones (i.e., datacenters) per region. We need to do a v2 of this anyway.

I mentioned this in another thread, but most cloud regions have three AZs, not two.

jseldess

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @jseldess, @lhirata, @rkruze, and @rolandcrosby)

v19.1/cluster-topology-patterns.md, line 126 at r8 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I mentioned this in another thread, but most cloud regions have three AZs, not two.

Yep, updated.

cockroach-teamcity · 2019-03-25T16:34:06Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/6fc8d42574d8b490c6fe3d36e44475aea7560fcf/

lnhsingh

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell, @jseldess, @lhirata, @rkruze, and @rolandcrosby)

v19.1/cluster-topology-patterns.md, line 126 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Yep, updated.

Removed for now.

v19.1/cluster-topology-patterns.md, line 143 at r8 (raw file):

Previously, jseldess (Jesse Seldess) wrote…

Yes, this looks good, but the zone config command should be later.

What do you mean by later?

cockroach-teamcity · 2019-03-25T16:52:36Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/520c43c5e38070a0c4520c5621e16e6110b2f2bd/

cockroach-teamcity · 2019-03-25T17:12:13Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/be6a17e04c23da0baa9dd7fc1579b15b10cfe1ae/

Closes #2935. Edits based on feedback. Moving back to WIP Edits Edits from working session with Roko/Roland, & Jesse review Fix replica/range explanation Fix broken link Edits based on Jesse's feedback. Edits based on feedback Revisions Minor edits based on feedback + update basic-multi-region-layout.png Update sidebar + link from Production Checklist

cockroach-teamcity · 2019-03-25T17:38:13Z

http://cockroach-docs-review.s3-website-us-east-1.amazonaws.com/8da5958d13b75d9af10a1a56b370ae782a6e2a87/

lnhsingh added the in progress label Jan 10, 2019

lnhsingh force-pushed the topology-pattterns branch from e20286f to 1371c5d Compare February 14, 2019 18:29

lnhsingh force-pushed the topology-pattterns branch from e8f67d8 to 0ede005 Compare February 14, 2019 18:37

lnhsingh force-pushed the topology-pattterns branch from 6ad12cd to e6621f6 Compare February 14, 2019 18:43

lnhsingh commented Feb 14, 2019

View reviewed changes

lnhsingh force-pushed the topology-pattterns branch from 6a03e08 to 67ce525 Compare February 14, 2019 19:03

lnhsingh requested a review from rolandcrosby February 14, 2019 19:05

lnhsingh changed the title ~~[WIP] Topology pattterns~~ Topology pattterns Feb 21, 2019

rolandcrosby suggested changes Mar 4, 2019

View reviewed changes

lnhsingh force-pushed the topology-pattterns branch from 67ce525 to 718026f Compare March 4, 2019 17:10

lnhsingh commented Mar 4, 2019

View reviewed changes

lnhsingh changed the title ~~Topology pattterns~~ (WIP) Topology pattterns Mar 4, 2019

lnhsingh force-pushed the topology-pattterns branch from 9e03493 to 358bbcb Compare March 7, 2019 17:29

lnhsingh requested a review from jseldess March 7, 2019 17:29

lnhsingh changed the title ~~(WIP) Topology pattterns~~ Topology pattterns Mar 7, 2019

jseldess suggested changes Mar 11, 2019

View reviewed changes

lnhsingh requested a review from rkruze March 12, 2019 22:37

lnhsingh commented Mar 12, 2019

View reviewed changes

jseldess suggested changes Mar 19, 2019

View reviewed changes

lnhsingh commented Mar 19, 2019

View reviewed changes

bdarnell reviewed Mar 19, 2019

View reviewed changes

lnhsingh commented Mar 19, 2019

View reviewed changes

jseldess approved these changes Mar 25, 2019

View reviewed changes

bdarnell reviewed Mar 25, 2019

View reviewed changes

jseldess reviewed Mar 25, 2019

View reviewed changes

lnhsingh force-pushed the topology-pattterns branch from ec045e9 to 6fc8d42 Compare March 25, 2019 16:29

lnhsingh commented Mar 25, 2019

View reviewed changes

lnhsingh force-pushed the topology-pattterns branch from be6a17e to 8da5958 Compare March 25, 2019 17:34

lnhsingh merged commit b17d914 into master Mar 25, 2019

lnhsingh deleted the topology-pattterns branch March 25, 2019 17:39

lnhsingh removed the in progress label Mar 25, 2019

Topology pattterns #4235

Topology pattterns #4235

Conversation

lnhsingh commented Jan 10, 2019

cockroach-teamcity commented Jan 10, 2019

cockroach-teamcity commented Jan 10, 2019

cockroach-teamcity commented Feb 13, 2019

cockroach-teamcity commented Feb 13, 2019

cockroach-teamcity commented Feb 14, 2019

cockroach-teamcity commented Feb 14, 2019

cockroach-teamcity commented Feb 14, 2019

cockroach-teamcity commented Feb 14, 2019

cockroach-teamcity commented Feb 14, 2019

lnhsingh left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Feb 14, 2019

rolandcrosby left a comment

Choose a reason for hiding this comment

lnhsingh left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Mar 4, 2019

cockroach-teamcity commented Mar 7, 2019

cockroach-teamcity commented Mar 7, 2019

cockroach-teamcity commented Mar 7, 2019

cockroach-teamcity commented Mar 11, 2019

jseldess left a comment

Choose a reason for hiding this comment

lnhsingh left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Mar 12, 2019

lnhsingh commented Mar 12, 2019 • edited Loading

cockroach-teamcity commented Mar 12, 2019

cockroach-teamcity commented Mar 15, 2019

jseldess left a comment

Choose a reason for hiding this comment

lnhsingh left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Mar 19, 2019

bdarnell left a comment

Choose a reason for hiding this comment

lnhsingh left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Mar 19, 2019

jseldess left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Mar 25, 2019

bdarnell left a comment

Choose a reason for hiding this comment

jseldess left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Mar 25, 2019

lnhsingh left a comment

Choose a reason for hiding this comment

cockroach-teamcity commented Mar 25, 2019

cockroach-teamcity commented Mar 25, 2019

cockroach-teamcity commented Mar 25, 2019

lnhsingh commented Mar 12, 2019 •

edited

Loading