title | summary | toc |
---|---|---|
Cluster Topology Patterns |
Common cluster topology patterns with setup examples and performance considerations. |
true |
This page covers common cluster topology patterns with setup examples, as well as the benefits and trade-off for each pattern. Before you select a candidate pattern for your cluster, use the following broad patterns as a starting point and consider trade-offs.
When selecting a pattern for your cluster, the following must be taken into consideration:
- The function of a CockroachDB leaseholder
- The impacts of the leaseholder on read and write activities
- The leaseholders are local to reader and writers within the datacenter
- The
--locality
flag must be set properly on each node to enable follow-the-workload - The leaseholder migration among the datacenters is minimized by using partitioning, an Enterprise feature
- Whether the application is designed to use the partitioning feature or not
{{site.data.alerts.callout_info}} This page does not factor in hardware differences. {{site.data.alerts.end}}
This first example is of a single datacenter cluster, i.e., a local deployment. This pattern is common starting point for smaller organizations who may not have the resources (or need) to worry about a datacenter failure, but still want to take advantage of CockroachDB's high availability. The cluster is self-hosted with each node on a different machine within the same datacenter. The network latency among the nodes is expected to be the same, around 1ms.
For the diagram above:
Configuration
App
is an application that accesses CockroachDBLoad Balancer
is a software-based load balancer- The 3 nodes are all running in a single datacenter
- All CockroachDB nodes communicate with each other
- The cluster is using the default replication factor of 3 (represented by
r1
,r2
,r3
)
Availability expectations
- The cluster can survive 1 node failure because a majority of replicas (2/3) remains available. It will not survive a datacenter failure.
Performance expectations
- The network latency among the nodes is expected to be the same, sub-millisecond.
While the basic local deployment takes advantage of CockroachDB's high availability, shares the load, and spreads capacity, dynamically scaling out the nodes from 3 (to 4) to 5 has many benefits:
- There will be more room to increase replication factor, which increases resiliency against the failure of more than one node.
- You can scale out; because there are more nodes, you can increase throughput, add storage, etc.
There are no constraints on node increments.
Once an organization begins to grow, a datacenter outage isn't acceptable and a cluster needs to be available all of the time. This is where a single-region cluster with multiple datacenters is useful. For example, an organization can do a cloud deployment across multiple datacenters within the same geographical region.
For the diagram above:
Configuration
App
is an application that accesses CockroachDBLoad Balancer
is a software-based load balancer- The 3 nodes are each in a different datacenter, all located in the
us-east
region - All CockroachDB nodes communicate with each other
- The cluster is using the default replication factor of 3 (represented by
r1
,r2
,r3
)
Availability expectations
- The cluster can withstand a datacenter failure.
Performance expectations
- The network latency among the nodes is expected to be the same, sub-millisecond.
For even more resiliency, use a multi-region cluster. A multi-region cluster is comprised of multiple datacenters in different regions (e.g., East
, West
), that each have with multiple nodes. CockroachDB will automatically try to diversify replica placement across localities (i.e., place a replica in each region). Using this setup, many organization will also transition to using different cloud providers (one provider per region).
In this example, the cluster has an asymmetrical setup where Central
is closer to the West
than the East
. This configuration will provide better write latency to the write workloads in the West
and Central
because there is a lower latency (versus writing in the East
). This is assuming you are not using zone configurations.
For this example:
Configuration
- Nodes are spread across 3 regions within a country (
West
,East
,Central
) - A software-based load balancer directs traffic to any of the regions' nodes at random
- Every region has 3 datacenters
- All CockroachDB nodes communicate with each other
- Similar to the local topology, more regions can be added dynamically
- A homogenous configuration among the regions for simplified operations is recommended
- For sophisticated workloads, each region can have different node count and node specification. This heterogeneous configuration could better handle regional specific concurrency and load characteristics.
When locality is enabled, the load balancer should be setup to load balance on the database nodes within the same locality as the app servers first:
- The
West
app servers should connect to the West CockroachDB servers - The
Central
app servers should connect to the Central CockroachDB servers - The
East
app servers should connect to the East CockroachDB servers
Availability expectations
If all of the nodes for a preferred locality are down, then the app will try databases in other localities. The cluster can withstand a datacenter failure. In general, multi-regions can help protect against natural disaster.
Performance expectations
- The latency numbers (e.g.,
60ms
) in the first diagram represent network round-trip from one datacenter to another. - Follow-the-workload will keep the performance quick for where the load is so you do not pay cross-country latency on reads.
- Write latencies will not be faster than the slowest quorum between two regions.
While the basic pattern for a multi-region cluster can help protect against regional failures, there will be high latency due to cross-country roundtrips. This is not ideal for organizations who have users spread out across the country. For any multi-region cluster, partitioning should be used to keep data close to the users who access it.
This setup uses a modern multi-tier architecture, which is simplified to global server load balancer (GSLB
), App
, and Load Balancer
layers in the below diagram:
Configuration
-
Nodes are spread across 3 regions within a country (
West
,East
,Central
) -
A client connects to geographically close app server via
GSLB
. -
Inside each region, an app server connects to one of the CockroachDB nodes within their geography through a software-based load balancer
-
Every region has 3 datacenters
-
All CockroachDB nodes communicate with each other
-
Tables are partitioned at row-level by locality.
-
Rows with the
West
partition have their leaseholder in theWest
datacenter. -
Rows with the
Central
partition have their leaseholder in theCentral
datacenter. -
Rows with the
East
partition have their leaseholder in theEast
datacenter. -
Replicas are evenly distributed among the three datacenters.
-
Abbreviated startup flag for each datacenter:
--loc=Region=East --loc=Region=Central --loc=Region=West
Availability expectations
- Can survive a single datacenter failure, since a majority of the replicas will remain available.
Performance expectations
- Reads respond in a few milliseconds.
- Writes respond in 60ms.
- Symmetrical latency between datacenters.
Application expectations
- West
App
servers connect to theWest
CockroachDB nodes. - Central
App
servers connect to theCentral
CockroachDB nodes. - East
App
servers connect to theEast
CockroachDB nodes.
Do we want to add a section for bad patterns / things not to do? What should be added here?