Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Ensure legacy nodes are probed when new capabilities registered #219

Merged
merged 1 commit into from

4 participants

@jtuple

The capability system caches prior probes of legacy app vars when dealing
with legacy nodes. Prior to this commit, the logic was simple. If there
were any cached results, no probes were performed. Unfortunately, this
could lead to a race condition. If capabilities were probed before all
applications (eg. riak_core, riak_kv) had started and registered their
capabilities, the cache would only include some results, and no probes
would be performed for the newly registered capabilities. This commit
makes things more fine-grained, checking for cached results of individual
capabilities.

This change does nothing for non-legacy nodes. All nodes that support
the capability system natively already worked with delayed registration.

@jtuple jtuple Ensure legacy nodes are probed when new capabilities registered
The capability system caches prior probes of legacy app vars when dealing
with legacy nodes. Prior to this commit, the logic was simple. If there
were any cached results, no probes were performed. Unfortunately, this
could lead to a race condition. If capabilities were probed before all
applications (eg. riak_core, riak_kv) had started and registered their
capabilities, the cache would only include some results, and no probes
would be performed for the newly registered capabilities. This commit
makes things more fine-grained, checking for cached results of individual
capabilities.

This change does nothing for non-legacy nodes. All nodes that support
the capability system natively already worked with delayed registration.
2007c4a
@jtuple

Very easy to verify using the rolling_capabilities test added in basho/riak_test#13

Test will fail on 1.2 branch, and pass on this branch.

@jonmeredith
Owner

+1 on code review, waiting for confirmation on riak_test run.

@Vagabond

riak_test run passed.

@jonmeredith
Owner

+1 merge

@jtuple jtuple merged commit 6e5ab7c into 1.2
@rzezeski

Isn't there also a race between a capability being added and the relevant application being started on the legacy node being probed? That is, query_capability treats undefined as resolved but it could be undefined for two reasons: it really is undefined or the application isn't started. If this race does exist do we just not care about falling back to the default?

@jtuple

Ryan, yes, but that's very rare. The default is always safe, so highly unlikely race is no big deal. This particular "race" that was fixed was hit 0% of the time a few weeks back, and 100% of the time since stats changes slowed down riak_kv start-up. Falling back to the defaults 100% of the time during a rolling upgrade was painful enough to block.

For your scenario, you would need to be restarting a legacy node concurrently with updating/restarting a 1.2 node. People general don't ever restart more than 1 node at a time in practice. Also, in your case, the race is only until the application is loaded. Which happens very early in the boot cycle. The fixed race that deals with registration is worse off, because capabilities aren't registered until after the app has fully started up, launched all supervisors, and registered itself with riak_core.

@seancribbs seancribbs deleted the jdb-fix-legacy-capabilities branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Aug 3, 2012
  1. @jtuple

    Ensure legacy nodes are probed when new capabilities registered

    jtuple authored
    The capability system caches prior probes of legacy app vars when dealing
    with legacy nodes. Prior to this commit, the logic was simple. If there
    were any cached results, no probes were performed. Unfortunately, this
    could lead to a race condition. If capabilities were probed before all
    applications (eg. riak_core, riak_kv) had started and registered their
    capabilities, the cache would only include some results, and no probes
    would be performed for the newly registered capabilities. This commit
    makes things more fine-grained, checking for cached results of individual
    capabilities.
    
    This change does nothing for non-legacy nodes. All nodes that support
    the capability system natively already worked with delayed registration.
This page is out of date. Refresh to see the latest.
Showing with 23 additions and 11 deletions.
  1. +23 −11 src/riak_core_capability.erl
View
34 src/riak_core_capability.erl
@@ -205,11 +205,12 @@ init_state(Registered) ->
handle_call({register, Capability, Info}, _From, State) ->
State2 = register_capability(node(), Capability, Info, State),
- State3 = renegotiate_capabilities(State2),
- publish_supported(State3),
- update_local_cache(State3),
- save_registered(State3#state.registered),
- {reply, ok, State3};
+ State3 = update_supported(State2),
+ State4 = renegotiate_capabilities(State3),
+ publish_supported(State4),
+ update_local_cache(State4),
+ save_registered(State4#state.registered),
+ {reply, ok, State4};
handle_call({ring_changed, Ring}, _From, State) ->
State2 = update_supported(Ring, State),
@@ -265,6 +266,10 @@ reload(State) ->
save_registered(State3#state.registered),
State3.
+update_supported(State) ->
+ {ok, Ring} = riak_core_ring_manager:get_raw_ring(),
+ update_supported(Ring, State).
+
%% Update this node's view of cluster capabilities based on a received ring
update_supported(Ring, State) ->
AllSupported = get_supported_from_ring(Ring),
@@ -278,11 +283,7 @@ update_supported(Ring, State) ->
{[], []} ->
add_node(Node, Supported, StateAcc);
{[], _} ->
- %% While the ring has no knowledge of Node's
- %% supported modes, the local view has prior
- %% knowledge. Do nothing and continue to use
- %% the existing modes.
- StateAcc;
+ add_node(Node, Supported, StateAcc);
{Same, Same} ->
StateAcc;
{_, _} ->
@@ -552,15 +553,26 @@ get_supported_from_ring(Ring) ->
%% Determine capabilities of legacy nodes based on app.config settings and
%% the provided app-var -> mode mapping associated with capabilities when
%% registered.
-query_capabilities(Node, #state{registered=Registered}) ->
+query_capabilities(Node, State=#state{registered=Registered}) ->
+ %% Only query results we do not already have local knowledge of
+ Known = dict:from_list(get_supported(Node, State)),
lists:mapfoldl(fun({Capability, Info}, ResolvedAcc) ->
{Resv, Cap} = query_capability(Node,
+ Known,
Capability,
Info#capability.default,
Info#capability.legacy),
{Cap, ResolvedAcc and Resv}
end, true, Registered).
+query_capability(Node, Known, Capability, DefaultSup, LegacyVar) ->
+ case dict:find(Capability, Known) of
+ {ok, Supported} ->
+ {true, {Capability, Supported}};
+ error ->
+ query_capability(Node, Capability, DefaultSup, LegacyVar)
+ end.
+
query_capability(_, Capability, DefaultSup, undefined) ->
Default = {Capability, [DefaultSup]},
{true, Default};
Something went wrong with that request. Please try again.