Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The rebalance operation #24

Closed
andybons opened this issue Jun 3, 2014 · 2 comments
Closed

The rebalance operation #24

andybons opened this issue Jun 3, 2014 · 2 comments
Assignees
Milestone

Comments

@andybons
Copy link
Contributor

andybons commented Jun 3, 2014

No description provided.

@embark
Copy link
Contributor

embark commented Dec 3, 2014

From Spencer on the google group:

Every node (a node is a physical or virtual machine) which is added to the system is added with a set of "attributes". Attributes are simply arbitrary strings. In addition, each store (a store is a device: e.g. spinny disk or flash storage drive) has a set of attributes. Attributes are chosen by system administrators to have useful meanings. For example, a very common node attribute would be the datacenter name (e.g. "sjc1b", "iad1"). Common store attributes might be "hdd" or "ssd". These attributes allow a node/store combination to advertise themselves for allocation. In other words, a particular raft range might need a new replica in datacenter "iad1" requiring an "ssd" storage device.

Cockroach allows the monolithic key space to be divided into virtual configuration "zones". Each zone is specified by a proto.ZoneConfig message and covers a prefix of the key space. At cluster bootstrap time, a default ZoneConfig is established for the empty key prefix, meaning it applies to the entire key space. The ZoneConfig indicates how many replicas, and for each replica, what the required matching attributes must be. From cockroach help set-zone:

The zone config format has the following YAML schema:

  replicas:
    - [comma-separated attribute list]
    - ...
  range_min_bytes: <size-in-bytes>
  range_max_bytes: <size-in-bytes>

For example:

  replicas:
    - [us-east-1a, ssd]
    - [us-east-1b, ssd]
    - [us-west-1b, ssd]
  range_min_bytes: 8388608
  range_min_bytes: 67108864

In this example, the zone will have three replicas, and requires replica 1 to be from an SSD store on a node with "us-east-1a" as an attribute, replica 2 to be SSD with "us-east-1b", etc..

Each store takes its parent node's attributes and its own attributes, concatenates them, sorts them, and then advertises itself on the gossip network using its sorted attribute list as the name of the gossip "min" group and a StoreCapacity message as value. The gossip network allows simple key/value information to be propagated, but also provides for "min" and "max" groups. These groups limit their size to some preset limit and only propagate the minimum values or maximum values, depending on the group type. This allows a cluster with 10,000 storage devices to only propagate updated capacity information for (say) the 50 least-utilized stores and the 50 most-utilized stores, using a min group and a max group.

When a range needs to rebalance a replica, it consults its zone config for that replica's index, gets the list of required attributes, then trolls over the capacity groups which are advertised on the gossip network. If using the zone config above, the range were to rebalance the third replica, it would look in the gossip network for a capacity min group matching attributes {us-west-1b, ssd}. All capacity groups which are supersets of the required attributes should be considered. We currently do a weighted-random selection from amongst the available stores, weighted by available capacity (these amounts are measured as percentages).

A decent amount of the code for all of this is there and tested, though the whole end-to-end process has not been tested. Here's where the existing code is:

proto/config.proto -- ZoneConfig
server/zone_cli.go -- command-line utilities for setting, getting & listing zone configs
server/zone.go -- REST API for setting, getting & listing zone configs
storage/allocator.go -- does random weighted selection
storage/store.go -- gossips capacity periodically

Currently, the storage.allocator class requires a StoreFinder implementation. There's only one for the unittests at the moment. This will be the code which trolls the available capacity groups on the gossip network for attribute matches. Right now, we only consider capacity. I think for the beta this is more than adequate, so I don't think we should focus on machine load or other measures which will likely become important future work.

What needs to be done for beta:

  • Change from weighted random selection to choosing a set of N candidates randomly from gossiped stores matching the required attributes and choose the one with the least utilization.
  • Provide a StoreFinder implementation using storage.Store which finds candidates for allocator.
  • Store finder probably should provide next-closest matches if the requested attributes yield no matches.
  • Gossip node load to aid in rebalancing decisions.
  • Periodic check from each store to determine whether the current load / capacity puts the store above average for the system (would consult current load vs. the max group for node load in gossip). If node is loaded, needs to initiate a rebalance.
  • Each store should be allowed at most one active rebalance.
  • Stores need some means of selecting a range replica to rebalance. This could probably just be randomly chosen for now. These kinds of rebalances are optional.
  • Need a method which takes a proto.RangeDescriptor and a proto.ZoneConfig and returns whether or not rebalancing is required due to some mismatch (e.g. wrong number of replicas, mismatched attributes). If rebalancing is required, the index of the first replica to rebalance should be returned. These kinds of rebalances are mandatory.
  • A more end-to-end unittest with 100 stores advertising capacities on gossip network and verify that allocator works as expected.
  • Work to integrate with Raft/Range -- this is TBD

More advanced stuff to consider:

  • Figure out how to capture correlated risk factors. Perhaps Pete could remember exactly what Google provided. My recollection is you had a hierarchical description of risk. From top down it looked like: datacenter, power distribution unit, rack, machine, device. My feeling is that only the datacenter would be something you'd add to node attributes (remember that anything you add to the attributes list further fragments the gossip groups, so you don't want to go hog wild). Maybe we could add these items to the StoreCapacity message and somehow fold them into the weighted selection.

Followup discussion:
Spencer: I think perhaps the right solution [for the beta] is to gossip StoreCapacity for each and every store. We'll need to make sure that we only update gossip when the capacity has changed by some reasonable increment, so it's not re-gossiped on every write. Fast-changing information like node load, however, could still be gossiped using min / max groups so that each node can determine whether it's suffering relative to the other worst sufferers in the cluster, before initiating a rebalance.

@andybons andybons modified the milestone: v0.1 (Beta) Jan 13, 2015
@tbg
Copy link
Member

tbg commented Jun 24, 2015

closing in favor of #620 #768

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants