There are a few places I didn't touch as it was unclear if the values needed to be monotonic or not. Specifically core_claimant, core_gossip, core_ring and core_ring_manager.
The capability system caches prior probes of legacy app vars when dealing with legacy nodes. Prior to this commit, the logic was simple. If there were any cached results, no probes were performed. Unfortunately, this could lead to a race condition. If capabilities were probed before all applications (eg. riak_core, riak_kv) had started and registered their capabilities, the cache would only include some results, and no probes would be performed for the newly registered capabilities. This commit makes things more fine-grained, checking for cached results of individual capabilities. This change does nothing for non-legacy nodes. All nodes that support the capability system natively already worked with delayed registration.
Changed the force update logic in riak_core_claimant to not perform a forced update if we have pending staged joins and no auto-joining nodes. Forcing a ring update because of staged joins will not actually change the ring, because staged joins will not transition until committed. This was a false positive detection of a stalled ring.
All stat mods depend on folsom, yet they are not linked to it. This change brings folsom under supervision of a core stat sup, which also supervises the riak stat subsystem. Now when folsom exits everyone gets to restart clean and recover. riak_core_sup | riak_core_stat_sup (rest_for_one) \ - folsom_sup - riak_core_stats_sup (one_for_one) \ - riak_*_stat - riak_stat_cache riak_core_stats_sup will start and supervise gen_server stat mods at registration time, and will re-start them should the sup crash.
Change riak_core_ring_manager and riak_core_app so that the ring manager is responsible for loading the ring file from the disk rather than starting with an initially empty ring and then relying upon the riak_core app to later load the ring. This avoids a race condition with the ring manager writing the empty ring to the disk before the riak_core app loads the prior ring. Note: Riak previously relied upon starting with a fresh ring in order to ensure secondary vnodes were started in case any had fallback data that needed to be handed off. The act of starting secondaries has long since been moved to the riak_core_vnode_manager that periodically starts up secondary vnodes over time, therefore there is no longer any need to start with a fresh ring. This commit will therefore always load a saved ring when the ring_manager starts, rather than starting with a fresh ring.
When the stat cache crashes, we must re-register stat mods with the cache so that it works when re-started. Delete stats before register This is to ensure that a restarted riak_core_stat will not leave any orphaned folsom stats. Folsom needs some work to handle crashing owners better. Some table in folsom are owned by the creating process, and some by folsom. If riak_core_stat crashes some folsom can be left inconsistent. This cleans up at start time.
Since calls to calculate stats are spawn_linked to ensure that the calculating processes exits if the cache exits, the cache will crash if a spawned process exits. This is bad. This change traps_exits on spawned processes and enforms awaiting callers of the error, without crashing the cache or effecting other registered mods.