The rebalance operation #24

andybons · 2014-06-03T19:47:19Z

No description provided.

embark · 2014-12-03T21:35:59Z

From Spencer on the google group:

Every node (a node is a physical or virtual machine) which is added to the system is added with a set of "attributes". Attributes are simply arbitrary strings. In addition, each store (a store is a device: e.g. spinny disk or flash storage drive) has a set of attributes. Attributes are chosen by system administrators to have useful meanings. For example, a very common node attribute would be the datacenter name (e.g. "sjc1b", "iad1"). Common store attributes might be "hdd" or "ssd". These attributes allow a node/store combination to advertise themselves for allocation. In other words, a particular raft range might need a new replica in datacenter "iad1" requiring an "ssd" storage device.

Cockroach allows the monolithic key space to be divided into virtual configuration "zones". Each zone is specified by a proto.ZoneConfig message and covers a prefix of the key space. At cluster bootstrap time, a default ZoneConfig is established for the empty key prefix, meaning it applies to the entire key space. The ZoneConfig indicates how many replicas, and for each replica, what the required matching attributes must be. From cockroach help set-zone:

The zone config format has the following YAML schema:

  replicas:
    - [comma-separated attribute list]
    - ...
  range_min_bytes: <size-in-bytes>
  range_max_bytes: <size-in-bytes>

For example:

  replicas:
    - [us-east-1a, ssd]
    - [us-east-1b, ssd]
    - [us-west-1b, ssd]
  range_min_bytes: 8388608
  range_min_bytes: 67108864

In this example, the zone will have three replicas, and requires replica 1 to be from an SSD store on a node with "us-east-1a" as an attribute, replica 2 to be SSD with "us-east-1b", etc..

Each store takes its parent node's attributes and its own attributes, concatenates them, sorts them, and then advertises itself on the gossip network using its sorted attribute list as the name of the gossip "min" group and a StoreCapacity message as value. The gossip network allows simple key/value information to be propagated, but also provides for "min" and "max" groups. These groups limit their size to some preset limit and only propagate the minimum values or maximum values, depending on the group type. This allows a cluster with 10,000 storage devices to only propagate updated capacity information for (say) the 50 least-utilized stores and the 50 most-utilized stores, using a min group and a max group.

When a range needs to rebalance a replica, it consults its zone config for that replica's index, gets the list of required attributes, then trolls over the capacity groups which are advertised on the gossip network. If using the zone config above, the range were to rebalance the third replica, it would look in the gossip network for a capacity min group matching attributes {us-west-1b, ssd}. All capacity groups which are supersets of the required attributes should be considered. We currently do a weighted-random selection from amongst the available stores, weighted by available capacity (these amounts are measured as percentages).

A decent amount of the code for all of this is there and tested, though the whole end-to-end process has not been tested. Here's where the existing code is:

proto/config.proto -- ZoneConfig
server/zone_cli.go -- command-line utilities for setting, getting & listing zone configs
server/zone.go -- REST API for setting, getting & listing zone configs
storage/allocator.go -- does random weighted selection
storage/store.go -- gossips capacity periodically

Currently, the storage.allocator class requires a StoreFinder implementation. There's only one for the unittests at the moment. This will be the code which trolls the available capacity groups on the gossip network for attribute matches. Right now, we only consider capacity. I think for the beta this is more than adequate, so I don't think we should focus on machine load or other measures which will likely become important future work.

What needs to be done for beta:

Change from weighted random selection to choosing a set of N candidates randomly from gossiped stores matching the required attributes and choose the one with the least utilization.
Provide a StoreFinder implementation using storage.Store which finds candidates for allocator.
Store finder probably should provide next-closest matches if the requested attributes yield no matches.
Gossip node load to aid in rebalancing decisions.
Periodic check from each store to determine whether the current load / capacity puts the store above average for the system (would consult current load vs. the max group for node load in gossip). If node is loaded, needs to initiate a rebalance.
Each store should be allowed at most one active rebalance.
Stores need some means of selecting a range replica to rebalance. This could probably just be randomly chosen for now. These kinds of rebalances are optional.
Need a method which takes a proto.RangeDescriptor and a proto.ZoneConfig and returns whether or not rebalancing is required due to some mismatch (e.g. wrong number of replicas, mismatched attributes). If rebalancing is required, the index of the first replica to rebalance should be returned. These kinds of rebalances are mandatory.
A more end-to-end unittest with 100 stores advertising capacities on gossip network and verify that allocator works as expected.
Work to integrate with Raft/Range -- this is TBD

More advanced stuff to consider:

Figure out how to capture correlated risk factors. Perhaps Pete could remember exactly what Google provided. My recollection is you had a hierarchical description of risk. From top down it looked like: datacenter, power distribution unit, rack, machine, device. My feeling is that only the datacenter would be something you'd add to node attributes (remember that anything you add to the attributes list further fragments the gossip groups, so you don't want to go hog wild). Maybe we could add these items to the StoreCapacity message and somehow fold them into the weighted selection.

Followup discussion:
Spencer: I think perhaps the right solution [for the beta] is to gossip StoreCapacity for each and every store. We'll need to make sure that we only update gossip when the capacity has changed by some reasonable increment, so it's not re-gossiped on every write. Fast-changing information like node load, however, could still be gossiped using min / max groups so that each node can determine whether it's suffering relative to the other worst sufferers in the cluster, before initiating a rebalance.

tbg · 2015-06-24T05:04:09Z

closing in favor of #620 #768

rabbitmq client

Remove go2xunit

andybons assigned spencerkimball Jun 3, 2014

andybons modified the milestone: v0.1 (Beta) Jan 13, 2015

tbg closed this as completed Jun 24, 2015

zhaoyuxi mentioned this issue Aug 16, 2016

sql/performance: independent filter expressions on JOINs should be pushed down to the operands #8566

Closed

soniabhishek pushed a commit to soniabhishek/cockroach that referenced this issue Feb 15, 2017

Merge pull request cockroachdb#24 from crowdflux/himanshu-github

c7cc109

rabbitmq client

bra-fsn mentioned this issue Jul 12, 2019

One node eats all CPU and an increasing amount of memory #38788

Closed

irfansharif pushed a commit that referenced this issue Aug 15, 2020

Merge pull request #24 from tamird/remove-go2xunit

efe259b

Remove go2xunit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The rebalance operation #24

The rebalance operation #24

andybons commented Jun 3, 2014

embark commented Dec 3, 2014

tbg commented Jun 24, 2015

The rebalance operation #24

The rebalance operation #24

Comments

andybons commented Jun 3, 2014

embark commented Dec 3, 2014

tbg commented Jun 24, 2015