Changes: -- Join checks against current ring_size rather than application variable. -- Join requires joining node to not already be part of a cluster. -- Down fails if the cluster is in legacy gossip mode. -- riak_core_ring:ready_members returns nodes guaranteed safe for requests.
Add ability to manually trigger vnode handoff, regardless of inactivity. Change the vnode manager to use this feature to force pending ownership transfers therefore allowing the cluster to rebalance under load. These forced transfers are throttled by only forcing the first N pending transfers scheduled by the claimant, where N is the application variable riak_core/forced_ownership_handoff (default 8). Additional handoff can still occur due to inactivity timeouts, but all handoff ultimately remains limited by the handoff_concurrency setting.
Add a preliminary version of a vnode manager which will someday replace much of the pid-tracking functionality currently in riak_core_vnode_master. Current vnode_manager provides easy interface for finding vnode mod/idx/pid information and informs vnodes about ring changes as appropriate. Move vnode forward-on-ownership-change logic out of request critical path, taking advantage of vnode manager's vnode ring notification feature.
Change riak_core_ring to have two different versioned records corresponding to the new and old ring data-structure, and add update/downgrade functions that convert between the two formats. Add legacy gossip mode that uses the old ring reconciliation logic as well as the old gossip/claim procedure. The legacy mode uses the old logic but encapsulates its data inside the new ring format (using upgrade/downgrade) in order to minimize code duplication. This mode is enabled by setting the application environment variable riak_core/legacy_gossip to true. Add member metadata to the new ring format and the related get_member_meta, update_member_meta accessors. Add support for rolling upgrades and mixed-gossip hybrid clusters. The appropriate ring format is negotiated through member metadata when possible, falling back to RPC queries when necessary. The cluster gossip protocol is determined as follows: -- If all nodes support the new membership protocol and are not running in legacy mode, the new protocol is used. -- If an old node or legacy-mode node joins the cluster, the entire cluster downgrades to legacy mode. -- If the old/legacy nodes leave the cluster, the new nodes return to the new protocol.
Add 'joining' member status to new cluster membership model and implementation. When a node joins a cluster, it comes in with status 'joining' rather than 'valid'. The claimant then moves a node from 'joining' to 'valid' after it ensures all cluster members have learned of the new node joining the cluster. This change guarantees that all 'valid' members vote on ring ready consensus under various failure scenarios. Add 'down' member status to new cluster membership model and implementation. The state is designed to allow a user to mark a down node as 'down' in order to allow the rest of the cluster to converge. A vote from a 'down' node is not necessary for ring ready consensus, and therefore a 'down' node's ring state may become outdated. If a 'down' node gossips to another node that believes it to be down (such as after coming back online), the other node tells the 'down' node to rejoin the cluster, therefore making its state current. Nodes do not gossip to 'down' nodes, and 'down' node ownership is not changed during a rebalance. Incorporate minor changes and bug fixes: -- Fix next merging bug in riak_core_ring:remove_node and model. -- Fix bug with nodes moving from 'leaving' to 'exiting' while having pending indices. -- Fix negative random seed bug in join/membership model. -- Change core vnode so that a vnode does not shutdown if a completed handoff was to node that is now known to be 'invalid'. -- Change update_ring to remove tuples from next for all invalid nodes, even for completed transfers. -- Change leave in model to be a local transition like in the implementation. -- Handle "neither claimant valid" case in reconcile_ring.
Add pending ring percent as a field in member_status that displays a node's ring ownership after all pending ownership transfers have completed. Fix riak_core_ring:ring_ready_info so that only nodes considered for ring convergence are checked.
Change gossip to use ring_trans rather than set_my_ring in order to prevent a data race with concurrent ring changes based on join/leave/remove commands. Change the refresh_ring logic to be guarded by cluster name, avoiding the case where a stable refresh cast arrives at a node after it has already shutdown and been restarted. Since the restarted node will have a new cluster name, the stable cast can be detected and ignored.
Fix existing riak_core tests to work with the new cluster membership code, as well as add several new tests that cover the new reconciliation logic. Update riak_core_ring:rename_node to support the new members and seen fields.
Change recursive_gossip to be done in two parts. The initial ring change uses random_recursive_gossip to start the recursive gossip at a random starting node, the reconciliation logic then continues to use the fixed recursive_gossip logic to propagate the gossip forward. This change decreases gossip hot spots when using recursive_gossip.
Consolidate code for random gossip into new riak_core_gossip:random_gossip, and update all locations that manually implement random gossip to use this function. Add new deterministic gossip code, riak_core_gossip:recursive_gossip, that sends a node's ring to its children vertices in a tree decomposition of the cluster members list. Change the "gossip on ring changed" code to use recursive_gossip. Continue using random_gossip for periodic (gossip_interval) gossip.
Merge vclocks on all ring reconciliation paths, revert to old ring ready behavior, and have all_members/1 use the private get_members/1.