Change riak_core_ring_manager and riak_core_app so that the ring manager is responsible for loading the ring file from the disk rather than starting with an initially empty ring and then relying upon the riak_core app to later load the ring. This avoids a race condition with the ring manager writing the empty ring to the disk before the riak_core app loads the prior ring. Note: Riak previously relied upon starting with a fresh ring in order to ensure secondary vnodes were started in case any had fallback data that needed to be handed off. The act of starting secondaries has long since been moved to the riak_core_vnode_manager that periodically starts up secondary vnodes over time, therefore there is no longer any need to start with a fresh ring. This commit will therefore always load a saved ring when the ring_manager starts, rather than starting with a fresh ring.
When the stat cache crashes, we must re-register stat mods with the cache so that it works when re-started. Delete stats before register This is to ensure that a restarted riak_core_stat will not leave any orphaned folsom stats. Folsom needs some work to handle crashing owners better. Some table in folsom are owned by the creating process, and some by folsom. If riak_core_stat crashes some folsom can be left inconsistent. This cleans up at start time.
Since calls to calculate stats are spawn_linked to ensure that the calculating processes exits if the cache exits, the cache will crash if a spawned process exits. This is bad. This change traps_exits on spawned processes and enforms awaiting callers of the error, without crashing the cache or effecting other registered mods.
Change riak_core_claimant to only force a ring update of a stalled ring if the ring is "ready" and has therefore already converged across the cluster. Without this change, ring convergence always appears to be stalled if a node is offline and therefore the force update happens over and over.
Change the handoff sender to use RPC to query the handoff_ip from the handoff listener rather than directly issuing a gen_server call. The call approach crashes against older nodes that do not expect a handoff_ip message.
Claimant transitions are triggered by ring update events. However, if a ring update event is somehow missed, and the ring reaches a steady state, then there will be no more update events and the claimant will stall. Likewise, when a older node joins the cluster, there is a race between when the capabilities system negotiates that the claimant should perform automatic joining and when the claimant transition occurs. Thus, the claimant may not properly auto join an older node in a mixed cluster. This is resolved in this commit by having the claimant periodically check for a stalled ring and trigger a force update if the ring is stalled. A stall is detected by checking if running the claimant logic against the current ring would generate a ring that has different cluster state than the current ring. If so, it is considered stalled and the update forced.
Reintroduce gen_server to riak_core_stat rather than spawn per update Due to review, also make stat cache ets table protected. Add infinity timeout to stat calculation call Doc the TTL param Make error response more meaningful Expose staleness Calculate cache timestamp as early as possible Add unit test for the cache