-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The rebalance operation #24
Comments
From Spencer on the google group: Every node (a node is a physical or virtual machine) which is added to the system is added with a set of "attributes". Attributes are simply arbitrary strings. In addition, each store (a store is a device: e.g. spinny disk or flash storage drive) has a set of attributes. Attributes are chosen by system administrators to have useful meanings. For example, a very common node attribute would be the datacenter name (e.g. "sjc1b", "iad1"). Common store attributes might be "hdd" or "ssd". These attributes allow a node/store combination to advertise themselves for allocation. In other words, a particular raft range might need a new replica in datacenter "iad1" requiring an "ssd" storage device. Cockroach allows the monolithic key space to be divided into virtual configuration "zones". Each zone is specified by a proto.ZoneConfig message and covers a prefix of the key space. At cluster bootstrap time, a default ZoneConfig is established for the empty key prefix, meaning it applies to the entire key space. The ZoneConfig indicates how many replicas, and for each replica, what the required matching attributes must be. From cockroach help set-zone:
In this example, the zone will have three replicas, and requires replica 1 to be from an SSD store on a node with "us-east-1a" as an attribute, replica 2 to be SSD with "us-east-1b", etc.. Each store takes its parent node's attributes and its own attributes, concatenates them, sorts them, and then advertises itself on the gossip network using its sorted attribute list as the name of the gossip "min" group and a StoreCapacity message as value. The gossip network allows simple key/value information to be propagated, but also provides for "min" and "max" groups. These groups limit their size to some preset limit and only propagate the minimum values or maximum values, depending on the group type. This allows a cluster with 10,000 storage devices to only propagate updated capacity information for (say) the 50 least-utilized stores and the 50 most-utilized stores, using a min group and a max group. When a range needs to rebalance a replica, it consults its zone config for that replica's index, gets the list of required attributes, then trolls over the capacity groups which are advertised on the gossip network. If using the zone config above, the range were to rebalance the third replica, it would look in the gossip network for a capacity min group matching attributes {us-west-1b, ssd}. All capacity groups which are supersets of the required attributes should be considered. We currently do a weighted-random selection from amongst the available stores, weighted by available capacity (these amounts are measured as percentages). A decent amount of the code for all of this is there and tested, though the whole end-to-end process has not been tested. Here's where the existing code is: proto/config.proto -- ZoneConfig Currently, the storage.allocator class requires a StoreFinder implementation. There's only one for the unittests at the moment. This will be the code which trolls the available capacity groups on the gossip network for attribute matches. Right now, we only consider capacity. I think for the beta this is more than adequate, so I don't think we should focus on machine load or other measures which will likely become important future work. What needs to be done for beta:
More advanced stuff to consider:
Followup discussion: |
rabbitmq client
No description provided.
The text was updated successfully, but these errors were encountered: