Voldemort Rebalancing

Voldemort Rebalancing project aims to provide capability to add/delete nodes, move data around in a running voldemort cluster without downtime and with minimal impact on online cluster performance. Few example scenarios where it can be used are

Dynamic addition of new nodes to the cluster.
Deletion of nodes from cluster.
Load balancing of data inside a cluster.

Requirements

These are the requirements for Voldemort rebalancing project.

No downtime
No functional impact on client
Maintain data consistency guarantees while rebalancing.
Minimal and tunable performance impact on the cluster.
push button user interface

Rebalancing Process

These are the main assumptions/design choices

Only one rebalancing process at one time on the cluster should be running.
Partitions to key mapping is not changed during rebalancing instead partitions ownerships (node to partition mapping) is changed and entire partitions are migrated across nodes.
User need to create and specify the targetCluster.xml (the final cluster configuration he desires)

Rebalancing Terminology

The Rebalancing process has three participants during rebalancing.

Rebalance Controller : Initiates rebalancing on the cluster by providing a target cluster.xml, can be started on any node which can make socket calls to all nodes in the cluster.
Stealer Nodes: The node which get new partitions.
- When adding new nodes the new nodes will be the stealer nodes
- When doing data load balancing, some nodes from the cluster itself will act as stealer nodes.
Donor Nodes: The nodes from where the data is copied.

How to start rebalancing

Step 0: Make sure cluster is not doing Rebalancing, User is responsible for making sure only one rebalancing controller is trying to rebalance at a given time
Step 1 : Create a target cluster.xml.
- If adding new nodes, add the new nodes to cluster.xml with intended partition ownerships.
- If doing data load balancing just change the intended partition ownerships.
Step 2 : Check all the nodes in the cluster are up and running at the right ports as specified in the target cluster.xml.
- If adding new nodes to the cluster, start them with an cluster.xml with empty partition list.
Step 3: Start a rebalance Controller.

bin/voldemort-rebalance.sh --url bootstrapURL --cluster targetCluster.xml  --parallelism maxParallelRebalancing --no-delete
Arguments:
bootstrapUrl: bootstrap voldemort server url.(should point to a node in the cluster and not the new nodes)
targetCluster.xml: The final desired cluster configuration.
maxParallelRebalancing: maximum parallel transfers to start
--no-delete: If set, do not delete data from original nodes after transfering.

Step 4 : monitor the output on RebalanceController shell, this should output the rebalancing steps, success/failed information.

Rebalance lifecycle

Rebalance Controller lifeCycle.

Rebalance Controller job is to compute what changes need to be made in the cluster, initiates the rebalancing and then wait for their completion.

Step 0 : Get the current cluster state from the cluster using bootstrap URL.
Step 1: Compare the targetCluster.xml provided and add all new nodes to currentCluster.xml
Step 2: Propagate the currentCluster.xml to all nodes
Step 1: Create a Rebalancing Cluster plan based on the desired target cluster.xml and the current state of the cluster.
Step 2: In parallel (upto maxParallelRebalancing number)
- Poll the Queue (RebalancingNodePlan task queue) from Rebalancing Cluster plan and get rebalancing info for one stealer node.
  1. Pick one donorNode from the RebalancingNodePlan list.
  2. Get the current latest cluster.xml from the cluster.
  3. Create an new updated cluster.xml with partition ownership change from donorNode to stealerNode for desired partitions.
  4. send a rebalance request to the stealer node and save the returned rebalance async Id.
  5. Commit this new cluster.xml changes on the cluster, if failed revert back all nodes to the current cluster.xml.
  6. wait for completion of the rebalance process using the rebalance Async Id.
  7. Mark success or failure accordingly.
The RebalancingController is done when the entire plan is attempted.

StealerNode Lifecycle

The stealerNode is responsible for doing actual rebalancing and handling all failure scenarios.

Step 1: Receives a rebalanceNode request through the AdminClient.
Step 2: Check current local state that is not doing rebalancing for any other partition set.
Step 3: Fail if is already doing rebalancing with error message
Step 4: Starts the rebalancing process as an async operation and return the async Id back to caller.
Step 5: Monitor the async fetch operation internally and keep updating progress/success or failures.

DonorNode lifecycle

The donorNode has no knowledge for rebalancing at all and keeps behaving normally.

Data consistency during rebalancing

Rebalancing process has to maintain data consistency guarantees during rebalancing, We are doing it through a proxy based design.
Once rebalancing starts StealerNode is the new master for all rebalancing partitions, All the clients talk directly to stealerNode for all the requests for these partitions. Stealer Node internally make proxy calls to the original donor node to return correct data back to the client.The process steps are

Client request stealerNode for key ‘k1’ belonging to partition ‘p1’ which is currently being migrated/rebalanced.
StealerNode looks at the key and understands that this key is part of a rebalancing partition
StealerNode makes a proxy call to donorNode and get the List of values as returned by the donorNode.
- StealerNode makes getAll call to donorNode if client made a getAll() call.
StealerNode does local put for all (key,value) pairs ignoring all ObsoleteVersionException
StealerNode now should have all the versions from the original node and now does normal local get/put/getAll operations.
- In this scheme all new puts goes only to stealerNode as intended, a proxy get() call is still made to maintain version correctness.

Client Bootstrapping

Voldemort client currently bootstrap from the bootstrap URL at the start time and use the returned cluster/stores metadata for all subsequent operation. Rebalancing mandates that the cluster/store metadata can change during rebalancing and we needed a mechanism to tell client that they should rebootstrap if they have old metadata.

Client Side Routing bootstrapping: This is the most common way of using Voldemort, where the client is smart, boostrap at the start and then uses the correct routing to query Voldemort in a single hop.
1. Voldemort server now throws an InvalidMetadataException if it sees a request from a client for a partition which do not belong to them.
2. On seeing the InvalidMetadataException the client is supposed to re-bootstrap from the bootstrap url which will have the correct cluster metadata.

ServerSide routing bootstrapping: The other way of using Voldemort is to make calls to any server with enable_routing flag in request set to true and server re-routes the query to correct node. So client gets the values in multiple (2) hops.
1. ServerSide routing was being handled by RoutedStore class initialized on the VoldemortServer
2. We have extended this class to have similar feature to catch InvalidMetadataException and bootstrap as needed.
3. Clients using Serverside Routing (python/ruby) should not need any change at all as this layer will hide that detail from them.

Rebalancing Failures Handling

If the rebalance Client log shows no exception/warning. Everything went smoothly and you have a new rebalanced cluster. The trouble spot is when bad things happened and how to correct them if needed. These are the failure scenarios we have considered in the design.

RebalancingController dies: Rebalancing controller can timeout or die while rebalancing is happening. Since the actual process is still running as Async operation it will not affect it.
- Restarting RebalanceController with same targetCluster.xml will make sure we will restart the remaining left rebalancing transfers.
- The currentCluster configuration is read from the cluster each time before starting new rebalancing so attempt to give already rebalanced cluster.xml would be a no-op.
StealerNode/DonorNode dies : Stealer Node persist rebalancing state before starting actual data transfers, If DonorNode fails or it fails itself during rebalancing, It will wait for some time and attempt again until succeed or maximum number of attempts tried (tunable settings are in VoldemortConfig)
Proxy calls to DonorNode timeouts: If DonorNode dies mid-flight during rebalancing and the client make calls to stealer node about keys in currently rebalancing partitions, StealerNode will receive a timeoutException/UnreachablestoreException from donorNode and propagate to the client as ProxyUnreachableException.
- StealerNode will continue attempting rebalancing from donorNode untill maximum number of attempts tries fails.

Voldemort Rebalancing

Requirements

Rebalancing Process

Rebalancing Terminology

How to start rebalancing

Rebalance lifecycle

Rebalance Controller lifeCycle.

StealerNode Lifecycle

DonorNode lifecycle

Data consistency during rebalancing

Client Bootstrapping

Rebalancing Failures Handling

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally