Ensure legacy nodes are probed when new capabilities registered #219

Merged
merged 1 commit into from Aug 3, 2012

Projects

None yet

4 participants

@jtuple

The capability system caches prior probes of legacy app vars when dealing
with legacy nodes. Prior to this commit, the logic was simple. If there
were any cached results, no probes were performed. Unfortunately, this
could lead to a race condition. If capabilities were probed before all
applications (eg. riak_core, riak_kv) had started and registered their
capabilities, the cache would only include some results, and no probes
would be performed for the newly registered capabilities. This commit
makes things more fine-grained, checking for cached results of individual
capabilities.

This change does nothing for non-legacy nodes. All nodes that support
the capability system natively already worked with delayed registration.

@jtuple jtuple Ensure legacy nodes are probed when new capabilities registered
The capability system caches prior probes of legacy app vars when dealing
with legacy nodes. Prior to this commit, the logic was simple. If there
were any cached results, no probes were performed. Unfortunately, this
could lead to a race condition. If capabilities were probed before all
applications (eg. riak_core, riak_kv) had started and registered their
capabilities, the cache would only include some results, and no probes
would be performed for the newly registered capabilities. This commit
makes things more fine-grained, checking for cached results of individual
capabilities.

This change does nothing for non-legacy nodes. All nodes that support
the capability system natively already worked with delayed registration.
2007c4a
@jtuple

Very easy to verify using the rolling_capabilities test added in basho/riak_test#13

Test will fail on 1.2 branch, and pass on this branch.

@jonmeredith

+1 on code review, waiting for confirmation on riak_test run.

@Vagabond

riak_test run passed.

@jonmeredith

+1 merge

@jtuple jtuple merged commit 6e5ab7c into 1.2 Aug 3, 2012
@rzezeski

Isn't there also a race between a capability being added and the relevant application being started on the legacy node being probed? That is, query_capability treats undefined as resolved but it could be undefined for two reasons: it really is undefined or the application isn't started. If this race does exist do we just not care about falling back to the default?

@jtuple

Ryan, yes, but that's very rare. The default is always safe, so highly unlikely race is no big deal. This particular "race" that was fixed was hit 0% of the time a few weeks back, and 100% of the time since stats changes slowed down riak_kv start-up. Falling back to the defaults 100% of the time during a rolling upgrade was painful enough to block.

For your scenario, you would need to be restarting a legacy node concurrently with updating/restarting a 1.2 node. People general don't ever restart more than 1 node at a time in practice. Also, in your case, the race is only until the application is loaded. Which happens very early in the boot cycle. The fixed race that deals with registration is worse off, because capabilities aren't registered until after the app has fully started up, launched all supervisors, and registered itself with riak_core.

@seancribbs seancribbs deleted the jdb-fix-legacy-capabilities branch Apr 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment