Since calls to calculate stats are spawn_linked to ensure that the calculating processes exits if the cache exits, the cache will crash if a spawned process exits. This is bad. This change traps_exits on spawned processes and enforms awaiting callers of the error, without crashing the cache or effecting other registered mods.
Change riak_core_claimant to only force a ring update of a stalled ring if the ring is "ready" and has therefore already converged across the cluster. Without this change, ring convergence always appears to be stalled if a node is offline and therefore the force update happens over and over.
Change the handoff sender to use RPC to query the handoff_ip from the handoff listener rather than directly issuing a gen_server call. The call approach crashes against older nodes that do not expect a handoff_ip message.
Claimant transitions are triggered by ring update events. However, if a ring update event is somehow missed, and the ring reaches a steady state, then there will be no more update events and the claimant will stall. Likewise, when a older node joins the cluster, there is a race between when the capabilities system negotiates that the claimant should perform automatic joining and when the claimant transition occurs. Thus, the claimant may not properly auto join an older node in a mixed cluster. This is resolved in this commit by having the claimant periodically check for a stalled ring and trigger a force update if the ring is stalled. A stall is detected by checking if running the claimant logic against the current ring would generate a ring that has different cluster state than the current ring. If so, it is considered stalled and the update forced.
Reintroduce gen_server to riak_core_stat rather than spawn per update Due to review, also make stat cache ets table protected. Add infinity timeout to stat calculation call Doc the TTL param Make error response more meaningful Expose staleness Calculate cache timestamp as early as possible Add unit test for the cache
Treat old-nodes joining the cluster the same as new nodes joining the cluster with the "auto-join" property, since older nodes do not support staged joins.
The vnode rolling start built into the vnode manager causes a race condition in the `all_nodes` call where a vnode is started under the sup before the vnode manager sees it. Thus the children under the sup and the vnodes tracked by the manager do not agree.