Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Ring Resizing #301

Closed
wants to merge 30 commits into from

2 participants

@jrwest

Ring Resizing enables riak_core applications to change the number of partitions (running vnodes) during normal operations, under load. Previously, a cluster was limited to always having ring_creation_size partitions. In order to change the number of partitions, a seperate cluster would need to be stood up along side the original and the data would be migrated between the two by external means.

The intended purpose of this feature is to support users who create a cluster with either too few or two many partitions and have a need to change this without all the hassle mentioned above. It is not intended as a scaling feature for clusters to add or remove concurrent processing ability. Riak already has support for that via adding/removing nodes. However, since the number of partitions ultimately limits the number of nodes in the cluster, ring resizing can be used to increase capacity in that regard. In short, the feature is intended for infrequent use in specific scenarios.

This feature is a work-in-progress that will undergo several more iterations. Some details on that work as well as issues with the current implementation are discussed below.

For a more detailed description of the implemention see here.

Goals

Besides the obvious goal that the cluster should safely transition between the old and new rings -- ensuring requests for unmodified keys return the same data and new writes are reflected in bolth rings. The implemention was designed with the following goals in mind:

  • During the operation requests for existing keys should not return not_found
  • The minimum number of transfers to ensure safety should be made
  • The implementation should be ignorant of the N values used to store keys in Riak The implementation should be as ignorant as possible of the N values used to store keys in riak
  • The operation should be cancelable and support forceful replacement of nodes in the case that SHTF

The Rest of the Story

There are some caveats and future work that are worth mentioning here.

  • Rehashing: When the ring expands some existing partitions are not moved to new owners. In this case none of their data is removed. In the future, a "rehashing" operation will be submitted during the cleanup cycle that removes keys no longer owned by these existing partitions.
  • Repl: replication does not support clusters of different ring sizes. The only available approach that currently works is to disable realtime until bolth cluster have been resized and then restablish the repl link, at which point a full-sync will probably be necessary.
  • Perf. Improvements: There is a known optimization for how to determine if keys should be transferred that is not included here because it still has some kinks to work out. Performance improvements are also expected thanks to the many improvements coming to handoff/gossip in general. Different optimization now included see [1] for microbenchmark.
  • Memory & Storage Considerations: Like the old method of resizing (using two clusters), to safely transition to the new ring, in the worst possible case, the entire data set is copied (an ugly consequence of not cleaning up data until the new ring is installed). Clusters should have the available disk space (& memory if using bitcask) to account for this.

[1] https://gist.github.com/jrwest/4940f009b2adb928ef59

jrwest added some commits
@jrwest jrwest add support for resizable ring
Allows the ring to be resized as a cluster operation similar to join/leave.
Expansion is only allowed, but the restriction is temprorary, and shrinking
requires only small modifications.

When the operation is staged/commited the claimant calulates a resized ring,
running it through claim. It then schedules a "resize operation" for each
partition, via the next list. Each partition is scheduled for the operation because
during resize, each index potentially has data to send to others. Per partition,
the resize operation consists of several "resize transfers", each of which must be
completed to complete the entire operation. The first of these resize transfers is
scheduled by the claimant. Subsequent transfers are scheduled by the vnode that
owns the partition. When all of the transfers for a partition of have completed,
the operation is marked as such in the next list. Once the operation has completed
for *all* partitions the claimant installs the resized ring.

The handoff subsystem has been modified to add a new transfer type: resize_transfer.
In addition, the handoff sender now supports applying a function to keys that are
not sent (due to the handoff filter). The function operates on the unsent key and
a per-sender defined accumulator. Resize transfers are triggered like hinted and
ownership handoff by the vnode manager and vnode. The primary difference between
those transfers and a resize_transfer is that the latter includes a filter
that only sends keys between the source and target partitions if, for both, the
key hashes to the same position in the preflist. Resize transfers also include an
unsent keys function and accumulator that determines which partitions the key
would be destined for. The vnode uses this information to schedule its subsequent
resize transfers. A partition which has not been transferred to yet and is in the
returned list will be scheduled. No more resize transfers will be schedueld when
the returned list contains no partitions or only partitions that have already been
scheduled.

One major difference between typical transfers and a resize operation is the partition
does not delete its data (or unregister) after completing. For some partitions, they will
remain running but a portion of their data should be removed. Other partitions will no
longer live on their original owner should have all of their data deleted. This is
not addressed in this commit.

During resizing, how vnodes forward has been modified, as well. Like regular handoff
forwarding is performed both explicitly (when the vnode returns {forward,...} from
handle_handoff_command) and implicitly (after a transfer has completed). The difference
is that since the operation consists of multiple transfers, messages can be implicitly
and explcitly forwarded at the same time. Messages whose "hash" map to a portion of
the keyspace already transferred as part of the operation will be implicitly forwarded.
Messages whose "hash" map to a portion of the keyspace not yet transferred can be
explicitly forwarded. To determine the hash of the request, a new vnode callback has
been added: request_hash. This callback is optional, unless the application supports
dynamic ring. The callback can also return undefined instead of a hash. In which case,
the message will be handled locally and not forwarded implicitly. Special care must also
be taken to handle messages on a vnode forwarded from another, because primaries take part
in forwarding during resize. Coverage requests are not forwarded during a resize.
4146c5a
@jrwest jrwest begin cleanup after resized ring
After a ring resizes there are several cleanup operations that must
be performed. In some cases a partition no longer exists (if the ring
shrinks) or its owner has been moved to another node. In these cases the
vnode must be unregistered and all its data deleted. In other cases, where
the vnode does continue to exist and has not moved, some data must be removed
but the vnode continues to run. This commit addresses the former.

After completion, the resized ring will have deletions scheduled for the
appropriate indexes. The deletions are scheduled like any other ring transition.
Although this is considered part of the resize operation, at this time the
future ring has been installed and the vnodes targetted for cleanup appear as fallbacks.
After each deletion is complete, the vnode unregisters. After all deletions are
complete the full resize operation is considered complete.
3ca6c71
@jrwest jrwest ability to shrink ring
not many changes were necessary with the exception that when the ring shrinks
some indexes disappear, which needs to be handled specially. Note: this only works
assuming NewRingSize < MaxN, where MaxN is the largest N value used on any bucket
in the cluster
03b10ca
@jrwest jrwest prevent staging resize when not all nodes support it d10de0d
@jrwest jrwest update console output to account for changing ring size dd0cfb2
@jrwest jrwest reduce impact on gossip during resize using 'set_only' 275feb8
@jrwest jrwest support for when SHTF during ring resize
allow ring resize operation to be aborted. The operation is staged like
any other cluster op. The abort will only be staged/performed if the
operation is in the process of performing resize transfers (resized ring
has not been installed). Otherwise, there is nothing to abort, so
the operation is ignored.

Support force-replace during resize. Works like force-replace during
typical cluster operations. Transfers are rescheduled on the replacing
node under the assumption that the replacing node may have been rebuilt
from a recent backup.

force-remove is not supported because it may end up causing too
many transfers to be rescheduled on the remaining nodes, overloading
them. Instead force-remove is supported by performing an abort
of the resize operation first.
bc19fd3
@jrwest jrwest move mgmt of new vnode proxies during resize to ring handler
also add stopping of extra proxies abandoned during an abort
of a resize or a completed shrinking of the ring
1f0f7f5
@jrwest jrwest honor forced_ownership_handoff for resize operations
ensures that at most forced_ownership_handoff vnodes will be performing
resize transfers accross the cluster at once

also, the previous code was just wrong
7ea88c4
@jrwest jrwest use object_info/1 to hash keys for filtering during resize
using riak_core_util:chash_key was not flexible enough for other
applications (e.g. riak_search). This makes object_info/1 a required
callback for applications implementing dynamic ring.
4aecde1
@jrwest jrwest re-implement forwarding during ring resizing
all requests have opportunity to be handled locally before being
forwarded to ensure no data loss during abort. This means that ring
resizing only uses explicit forwarding via handle_handoff_command.
57e7dc5
@jrwest jrwest add ring resizing impl. doc 1299cf5
jrwest added some commits
@jrwest jrwest improve performance and fix several bugs in future index calculation
* make future index a constant time calculation (it was O(RingSize * FutureRingSize) before)
* for no possible reason the notsent acc was a linked list instead of a set
* during shrinking the n-value for a preflist could be, depending on ownership assignment and
  order of transfers, implicitly grown (e.g. N=3 -> N=5) leaving behind unreachable data. This
  is because primaries involved in shrinking may transfer data out after receiving data in. In
  this case it is necessary to identify keys that are owned by the source partition in the *future*
  ring. This is done by detecting two conditions, the first is when the position of the source in
  the current preflist is greater than or equal to the size of the new ring (an impossible N value
  in the new ring) and an optional N-value threshold which solves a specific case when halving the ring
  where the first condition is not triggered until data has already been copied where it shouldn't be
e4d4671
@jrwest jrwest fix race condition on node-startup with proxy shutdown check
module not yet registered with riak core but proxy_sup can have childspecs
for app's vnode supervisors before this happens
5f3a640
@jrwest jrwest remove unneeded change to riak_core_ring:all_next_owners
its implementation was a bit misleading with the changes and was only
needed to support one part of the codebase (starting extra vnode proxies
when ring expands)
1f85e23
@jrwest jrwest add eqc test for riak_core_ring:future_index 8842397
@jrwest jrwest stop-gap prevention of ring resizing for unsupported applications
* see comment in riak_core_claimant on why this is a hack
* search will not, intially, be supported
* control may break as ring status evolves to provide more observability during
  the operation. failures would be transient but its easier to require disabling
  it then requiring it not be accessed during the operation. In the future, control
  will be aware of the operation and this can be dropped altogether
8441aff
@jrwest jrwest generlize how vnode builds bucket->nval map for a resize_transfer 7f6a56c
@jrwest jrwest referenced this pull request in basho/riak_test
Merged

Ring Resizing Test #274

jrwest added some commits
@jrwest jrwest pass transfer type to Mod:handoff_starting 2820171
@jrwest jrwest remove debug log that snuck in to previous commit 8cd077b
@jrwest jrwest don't crash plan display on resize abort f27a7c9
@jrwest jrwest prevent passing of commands to vnode mod after post-resize delete
unlike when vnodes are marked as deleted during normal ownership transfer,
the vnodes that are deleted during the cleanup process do not forward. As
a result, they may receive messages between the period they unregister and when
they are shutdown.
134e3ac
@jrwest jrwest cleanup straggling resize abort metadata 7fc1611
src/riak_core_claimant.erl
@@ -436,6 +488,10 @@ filter_changes(Changes, Ring) ->
end, Changes).
%% @private
+filter_changes_pred(_, {resize, _}, _, _) ->
+ true;
+filter_changes_pred(_, abort_resize, _, _) ->
+ true;
@jtuple
jtuple added a note

Why do we need the two special cases for resize and abort_resize here? Is the "not invalid" catch-all case on line 502 not work? Will we ever have invalid members involved in a transfer?

@jrwest
jrwest added a note

we don't need them. artifact of some original work when I was less familiar w/ the claimant. will remove.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_claimant.erl
((14 lines not shown))
end.
%% @private
+maybe_compute_resize(Orig, MbResized) ->
+ OrigSize = riak_core_ring:num_partitions(Orig),
+ NewSize = riak_core_ring:num_partitions(MbResized),
+
+ case OrigSize =/= NewSize of
+ false -> MbResized;
+ true -> compute_resize(Orig, MbResized)
+ end.
+
+%% @private
+compute_resize(Orig, Resized) ->
+ %% work with the resized and balanced ring, the
+ %% original ring will be reset when changes are scheduled
+ CState0 = riak_core_ring:future_ring(Resized),
+
@jtuple
jtuple added a note

So, we have Orig, Resized and CState0. I believe this corresponds to existing/pre-resized ring, resized-but-not-yet-balanced ring, and balanced ring respectively, right? Perhaps a comment here would be helpful. Specifically, noting the difference between Resized and future_ring(Resized).

@jrwest
jrwest added a note

yes that is correct. will add a function comment with a bit more detail about the arguments and a body comment about the use of future_ring(Resized)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_claimant.erl
((53 lines not shown))
+
+ riak_core_ring:set_pending_resize(CState1, Orig).
+
+%% @private
+schedule_first_resize_transfer(smaller, {Idx,_}=IdxOwner, none, Resized) ->
+ %% partition no longer exists in new ring, use first successor
+ Target = hd(riak_core_ring:preflist(<<Idx:160/integer>>, Resized)),
+ riak_core_ring:schedule_resize_transfer(Resized, IdxOwner, Target);
+schedule_first_resize_transfer(_Type,{Idx, Owner}=IdxOwner, Owner, Resized) ->
+ %% partition is not being moved during resizing, use first predecessor
+ Target = hd(chash:predecessors(Idx-1, riak_core_ring:chash(Resized))),
+ riak_core_ring:schedule_resize_transfer(Resized, IdxOwner, Target);
+schedule_first_resize_transfer(_,{Idx, _Owner}=IdxOwner, NextOwner, Resized) ->
+ %% index is being moved, schedule this resize transfer first
+ riak_core_ring:schedule_resize_transfer(Resized, IdxOwner, {Idx, NextOwner}).
+
@jtuple
jtuple added a note

I believe we talked about adding some comments to the schedule_first_resize_transfer function here about why successor vs predecessor vs moving/non-moving and how shrinking vs expanding fits into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple jtuple commented on the diff
src/riak_core_console.erl
((8 lines not shown))
ok;
+ {true, _} ->
+ io:format("Ring is resizing. see riak-admin ring-status for transfer details.~n");
@jtuple
jtuple added a note

A shame we can't print more details inline here in the output rather than punting to another command. Something to consider changing in a future release. Works for now.

@jrwest
jrwest added a note

agreed. i originally played with printing the resize transfers but since we don't know all of them up front it was kind of funky. Printing that every partition will do a resize operation wasn't very helpful either. hoping further use will highlight what will be good to print here.

@jrwest
jrwest added a note

also to be fair, ring status could have more useful information too. Right now its at the resize operation (next list) level not resize transfer level. more visibility w/o needing to riak attach will be something I want to focus on for next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple jtuple commented on the diff
src/riak_core_handoff_manager.erl
@@ -240,13 +249,13 @@ handle_cast({status_update, ModSrcTgt, StatsUpdate}, State=#state{handoffs=HS})
{noreply, State#state{handoffs=HS2}}
end;
-handle_cast({send_handoff, Mod, {Src, Target}, ReqOrigin,
+handle_cast({send_handoff, Type, Mod, {Src, Target}, ReqOrigin,
@jtuple
jtuple added a note

As we discussed, changing the message signature here breaks repairs when there are mixed-clusters (eg. during an upgrade/downgrade). This is because xfer/3 does a direct cross-node cast that uses the signature from the old code. Could potentially fix by supporting a shim match here that just calls the other, but likely not worth it for 1.4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple jtuple commented on the diff
src/riak_core_handoff_manager.erl
@@ -73,12 +74,20 @@ start_link() ->
init([]) ->
{ok, #state{excl=ordsets:new(), handoffs=[]}}.
-add_outbound(Module,Idx,Node,VnodePid) ->
+-spec add_outbound(ho_type(),atom(),integer(),term(),pid()) -> {ok, pid()} | {error, term()}.
+add_outbound(HOType,Module,Idx,Node,VnodePid) ->
+ add_outbound(HOType,Module,Idx,Idx,Node,VnodePid).
+
+-spec add_outbound(ho_type(),atom(),integer(),integer(),term(),pid()) ->
+ {ok, pid()} | {error, term()}.
+add_outbound(HOType,Module,SrcIdx,TargetIdx,Node,VnodePid) ->
case application:get_env(riak_core, disable_outbound_handoff) of
@jtuple
jtuple added a note

Changing the add_outbound function signature breaks the use in this eunit test. Not sure if that test even matters (it's conditionally guarded after all). But, figured I'd note the issue.

@jrwest
jrwest added a note

changed the test so if you add the define it compiles but the test fails anyways because it requires riak_core_handoff_receiver/sender_sup:

ring-resizing/riak_core $ ./rebar skip_deps=true eunit suite=riak_core_handoff_manager
==> riak_core (eunit)
src/riak_core_handoff_manager.erl:612: function add_outbound/4 undefined
ring-resizing/riak_core $ ./rebar skip_deps=true eunit suite=riak_core_handoff_manager
==> riak_core (eunit)
Compiled src/riak_core_handoff_manager.erl
riak_core_handoff_manager:603: handoff_test_...*skipped*
module 'riak_core_handoff_manager'
*unexpected termination of test process*
::{noproc,{gen_server,call,
                      [riak_core_handoff_receiver_sup,count_children,
                       infinity]}}
....
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_ring.erl
((52 lines not shown))
+ %% this case, return undefined. The position may also be >= an optional N value for
+ %% the key, if this is true undefined is also returned
+ case check_invalid_future_index(OrigDist, NextCount, NValCheck) of
+ true -> undefined;
+ false ->
+ %% Determine the partition (head of preflist) that will own the key in the future ring
+ FuturePos = ((CHashInt div NextInc) + 1),
+ NextOwner = FuturePos * NextInc,
+
+ %% Determine the partition that the key should be transferred to (has same position
+ %% in future preflist as source partition does in current preflist)
+ RingTop = trunc(math:pow(2,160)-1),
+ case NextOwner + (NextInc * OrigDist) of
+ FutureIndex when FutureIndex >= RingTop -> FutureIndex - RingTop;
+ FutureIndex -> FutureIndex
+ end
@jtuple
jtuple added a note

Isn't this entire case statement the same as the more concise:

(NextOwner + (NextInc * OrigDist)) rem RingTop

No need to change or anything, just took me a second to realize it was just a simple wrap-around calculation.

@jrwest
jrwest added a note

it should be what you have above. I actually noticed that when I was reviewing large ring work but forgot to change. Went ahead and made the change since riak_core_ring_eqc:prop_future_index() still passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_ring.erl
((64 lines not shown))
+ case NextOwner + (NextInc * OrigDist) of
+ FutureIndex when FutureIndex >= RingTop -> FutureIndex - RingTop;
+ FutureIndex -> FutureIndex
+ end
+ end.
+
+check_invalid_future_index(OrigDist, NextCount, NValCheck) ->
+ OverRingSize = OrigDist >= NextCount,
+ OverNVal = case NValCheck of
+ undefined -> false;
+ _ -> OrigDist >= NValCheck
+ end,
+ OverRingSize orelse OverNVal.
+
+-spec is_future_index(chash:index(), integer(), integer(), chstate()) -> boolean().
+is_future_index(CHashKey, OrigIdx, TargetIdx, State) ->
@jtuple
jtuple added a note

Short edoc comment would be great here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_ring.erl
@@ -701,6 +860,263 @@ pending_changes(State) ->
set_pending_changes(State, Transfers) ->
State?CHSTATE{next=Transfers}.
+%% @doc Given a ring, `Resizing', that has been resized (and presumably rebalanced)
+%% schedule a resize transition for `Orig'.
+-spec set_pending_resize(chstate(), chstate()) -> chstate().
+set_pending_resize(Resizing, Orig) ->
+ %% all existing indexes must transfer data when the ring is being resized
+ Next = [{Idx, Owner, '$resize', [], awaiting} ||
+ {Idx, Owner} <- riak_core_ring:all_owners(Orig)],
+ %% Resizing is assumed to have a modified chring, we need to put back
+ %% the original chring to not install the resized on pre-emptively. The
@jtuple
jtuple added a note

"resized (on) pre-emptively" -> "resized (one) pre-emptively"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_ring.erl
((16 lines not shown))
+ set_resized_ring(set_pending_changes(ResetRing, Next), FutureCHash).
+
+-spec maybe_abort_resize(chstate()) -> {boolean(), chstate()}.
+maybe_abort_resize(State0) ->
+ Resizing = is_resizing(State0),
+ PostResize = is_post_resize(State0),
+ PendingAbort = is_resize_aborted(State0),
+ case PendingAbort andalso Resizing andalso not PostResize of
+ true ->
+ State1 = State0?CHSTATE{next=[]},
+ State2 = clear_all_resize_transfers(State1),
+ State3 = remove_meta('$resized_ring_abort', State2),
+ {true, remove_meta('$resized_ring', State3)};
+ false ->
+ {false, State0}
+ end.
@jtuple
jtuple added a note

Random style note, but using State rather than State0 here would be great. I know we use State0 when we want to do minimal changes for diffs, but since this is all new code, there's really no minimal change benefit here.

@jrwest
jrwest added a note

will change. wasn't aware we have a preference. I can never decide so I tend to go back and forth depending on nearby code. Now its easy :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple jtuple commented on the diff
src/riak_core_ring.erl
((39 lines not shown))
+ {integer(), term()},
+ integer() | {integer(), term()}) -> chstate().
+schedule_resize_transfer(State, Source, TargetIdx) when is_integer(TargetIdx) ->
+ TargetNode = index_owner(future_ring(State), TargetIdx),
+ schedule_resize_transfer(State, Source, {TargetIdx, TargetNode});
+schedule_resize_transfer(State, Source, Source) ->
+ State;
+schedule_resize_transfer(State, Source, Target) ->
+ Transfers = resize_transfers(State, Source),
+ case lists:keymember(Target, 1, Transfers) of
+ true -> State;
+ false ->
+ Transfers1 = lists:keystore(Target, 1, Transfers,
+ {Target, ordsets:new(), awaiting}),
+ set_resize_transfers(State, Source, Transfers1)
+ end.
@jtuple
jtuple added a note

My notes say we talked about commenting this function wrt to the conditional check-before-adding logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_ring.erl
((42 lines not shown))
+ TargetNode = index_owner(future_ring(State), TargetIdx),
+ schedule_resize_transfer(State, Source, {TargetIdx, TargetNode});
+schedule_resize_transfer(State, Source, Source) ->
+ State;
+schedule_resize_transfer(State, Source, Target) ->
+ Transfers = resize_transfers(State, Source),
+ case lists:keymember(Target, 1, Transfers) of
+ true -> State;
+ false ->
+ Transfers1 = lists:keystore(Target, 1, Transfers,
+ {Target, ordsets:new(), awaiting}),
+ set_resize_transfers(State, Source, Transfers1)
+ end.
+
+-spec reschedule_resize_transfers(chstate(), [integer()], term(), term()) -> chstate().
+reschedule_resize_transfers(State=?CHSTATE{next=Next}, Node, NewNode) ->
@jtuple
jtuple added a note

This entire reschedule_resize_transfers stuff is dense and confusing. We talked about it a bit and about adding some comments + possibly rewriting to make cleaner. Especially the distinction between rescheduling requests originating FROM the node that is being replaced vs rescheduling transfers being sent TO the the node being replaced.

@jrwest
jrwest added a note

while addressing this, noticed the spec above is for the totally wrong function (facepalm).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple jtuple commented on the diff
src/riak_core_ring.erl
((96 lines not shown))
+ NewTransfers = [{F(Target), ordsets:new(), awaiting} || {Target, _, _} <- Transfers],
+ set_resize_transfers(clear_resize_transfers(OldSource, State), NewSource, NewTransfers).
+
+%% @doc returns the first awaiting resize_transfer for a {SourceIdx, SourceNode}
+%% pair. If all transfers for the pair are complete, undefined is returned
+-spec awaiting_resize_transfer(chstate(), {integer(), term()}, atom()) ->
+ {integer(), term()} | undefined.
+awaiting_resize_transfer(State, Source, Mod) ->
+ ResizeTransfers = resize_transfers(State, Source),
+ Awaiting = [{Target, Mods, Status} || {Target, Mods, Status} <- ResizeTransfers,
+ Status =/= complete,
+ not ordsets:is_element(Mod, Mods)],
+ case Awaiting of
+ [] -> undefined;
+ [{Target, _, _} | _] -> Target
+ end.
@jtuple
jtuple added a note

No big deal for now, but in the future maybe add early exit here. Ie. stop after first awaiting transfer found rather than iterating over the entire list and returning the first result. Doesn't really matter, but adding a PR comment is cheap. So, here ya go.

@jrwest
jrwest added a note

agreed. going to leave it for now since this is always going to be a small list (not ring size dependent and always much smaller than ring size) but have a note to change along w/ future work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_ring.erl
((98 lines not shown))
+
+%% @doc returns the first awaiting resize_transfer for a {SourceIdx, SourceNode}
+%% pair. If all transfers for the pair are complete, undefined is returned
+-spec awaiting_resize_transfer(chstate(), {integer(), term()}, atom()) ->
+ {integer(), term()} | undefined.
+awaiting_resize_transfer(State, Source, Mod) ->
+ ResizeTransfers = resize_transfers(State, Source),
+ Awaiting = [{Target, Mods, Status} || {Target, Mods, Status} <- ResizeTransfers,
+ Status =/= complete,
+ not ordsets:is_element(Mod, Mods)],
+ case Awaiting of
+ [] -> undefined;
+ [{Target, _, _} | _] -> Target
+ end.
+
+%% @doc return the status of a resize_transfer for `Source' (a index-node pair). undefined
@jtuple
jtuple added a note

"(a) index" -> "(an) index"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_ring.erl
((108 lines not shown))
+ case Awaiting of
+ [] -> undefined;
+ [{Target, _, _} | _] -> Target
+ end.
+
+%% @doc return the status of a resize_transfer for `Source' (a index-node pair). undefined
+%% is returned if no such transfer is scheduled. complete is returned if the transfer
+%% is marked as such or `Mod' is contained in the completed modules set. awaiting is
+%% returned otherwise
+-spec resize_transfer_status(chstate(), {integer(), term()}, {integer(), term()}, atom()) ->
+ awaiting | complete | undefined.
+resize_transfer_status(State, Source, Target, Mod) ->
+ ResizeTransfers = resize_transfers(State, Source),
+ %% change to use keyfind
+ StatusMods = [{Status, Mods} || {T, Mods, Status} <- ResizeTransfers,
+ T =:= Target],
@jtuple
jtuple added a note

Reminder that you have a note here to change to using keyfind. Perhaps after 1.4?

@jrwest
jrwest added a note

tentatively changed it to a real TODO at least. will try to just fix it up after other changes but if not at least its more than a straggling comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_ring.erl
((113 lines not shown))
+%% @doc return the status of a resize_transfer for `Source' (a index-node pair). undefined
+%% is returned if no such transfer is scheduled. complete is returned if the transfer
+%% is marked as such or `Mod' is contained in the completed modules set. awaiting is
+%% returned otherwise
+-spec resize_transfer_status(chstate(), {integer(), term()}, {integer(), term()}, atom()) ->
+ awaiting | complete | undefined.
+resize_transfer_status(State, Source, Target, Mod) ->
+ ResizeTransfers = resize_transfers(State, Source),
+ %% change to use keyfind
+ StatusMods = [{Status, Mods} || {T, Mods, Status} <- ResizeTransfers,
+ T =:= Target],
+ IsComplete = case StatusMods of
+ [{complete, _}] -> true;
+ [{awaiting, Mods}] -> ordsets:is_element(Mod, Mods);
+ _ -> undefined
+ end,
@jtuple
jtuple added a note

Are we confidence this will always return a single element list in normal operation? Any weird corner case?

@jrwest
jrwest added a note

should always be a single element b/c of the check that you mentioned needs a comment in schedule_resize_transfer [1]. Originally was going to say, lets just change it to pull the head of the list but actually i think I'd prefer it crash because if we have multiple resize transfers scheduled for the same target we a have a problem and this may be way less subtle than some other funky things i can imagine happening. Also, thats where the keyfind comment originated from fwiw (not really smart use of LC here on my part imo)

[1] #301 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_ring.erl
((195 lines not shown))
+ ({_, _, _, _, complete}) -> false
+ end,
+ Next).
+
+-spec complete_resize_transfers(chstate(), {integer(),term()}, atom()) -> [{integer(),term()}].
+complete_resize_transfers(State, Source, Mod) ->
+ [Target || {Target, Mods, Status} <- resize_transfers(State, Source),
+ Status =:= complete orelse ordsets:is_element(Mod, Mods)].
+
+-spec deletion_complete(chstate(), integer(), atom()) -> chstate().
+deletion_complete(State, Idx, Mod) ->
+ transfer_complete(State, Idx, Mod).
+
+resize_transfers(State, Source) ->
+ {ok, Transfers} = get_meta({resize, Source}, [], State),
+ Transfers.
@jtuple
jtuple added a note

Add a -spec to resize_transfers/2 to make dialyzer more useful, since metadata queries always return to term().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_ring.erl
((225 lines not shown))
+ {ok, CHRing} -> {ok, CHRing};
+ _ -> undefined
+ end.
+
+-spec set_resized_ring(chstate(), chash:chash()) -> chstate().
+set_resized_ring(State, FutureCHash) ->
+ update_meta('$resized_ring', FutureCHash, State).
+
+cleanup_after_resize(State) ->
+ update_meta('$resized_ring', '$cleanup', State).
+
+
+-spec vnode_type(chstate(),integer()) -> primary |
+ {fallback, term()} |
+ new_primary |
+ resized_primary.
@jtuple
jtuple added a note

Think spec might be off, don't think new_primary is something that we can ever return here.

@jrwest
jrwest added a note

heh, wrote the spec before the function then realized new_primary made no sense and forgot to remove. will do so.

@jrwest
jrwest added a note

ah wait, i see the problem. new_primary should be future_primary. sorry, commented before looking at the function beforehand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple jtuple commented on the diff
src/riak_core_ring_handler.erl
@@ -36,6 +36,8 @@ init([]) ->
handle_event({ring_update, Ring}, State) ->
%% Make sure all vnodes are started...
ensure_vnodes_started(Ring),
+ maybe_start_vnode_proxies(Ring),
+ maybe_stop_vnode_proxies(Ring),
@jtuple
jtuple added a note

Any implications for normal Riak operation (ie. users that never use resizing) by adding this start/stop proxy code? There's nothing I could find after looking things over, but figured it was worth discussing.

@jrwest
jrwest added a note

I couldn't think of anything, it should basically be some extra useless work that has no effects. I don't love this but I figure it will be better once its merged with the changes in #319.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_vnode.erl
@@ -103,8 +105,9 @@ behaviour_info(_Other) ->
mod :: module(),
modstate :: term(),
forward :: node(),
@jtuple
jtuple added a note

forward can now be a list as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_vnode.erl
((6 lines not shown))
handoff_pid :: pid(),
+ handoff_type :: hinted_handoff | ownership_transfer | resize_transfer,
@jtuple
jtuple added a note

So, there's no repair handoff type. Looking here it looks like repair is flagged as a ownership_transfer. I guess I should test this. In any case, do we care about repair being a type?

@jrwest
jrwest added a note

hrm, i was under the impression this was not the code path taken by repair which is why I don't have the type for the vnode state but i do have it here. iirc since repair operates on primaries it just sends the fold request and the primaries continue to call handle_command instead of doing the handle_handoff_command and forwarding dance. I remember having a discussion w/ @rzezeski about this briefly a while back but I could have totally misunderstood what he told me, need to look at the code as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple jtuple commented on the diff
src/riak_core_vnode.erl
@@ -241,23 +250,53 @@ continue(State, NewModState) ->
%% new owner, or simply because the ring has yet to converge on the new owner.
%% In the forwarding state, all vnode commands and coverage commands are
%% forwarded to the new owner for processing.
+%%
+%% The above becomes a bit more complicated when the vnode takes part in resizing
+%% the ring, since several transfers with a single vnode as the source are necessary
+%% to complete the operation. A vnode will remain in the handoff state, for, potentially,
+%% more than one transfer and may be in the handoff state despite there being no active
+%% transfers with this vnode as the source. During this time requests that can be forwarded
+%% to a partition for which the transfer has already completed, are forwarded. All other
+%% requests are passed to handle_handoff_command.
+forward_or_vnode_command(Sender, Request, State=#state{forward=Forward,
+ mod=Mod,
+ index=Index}) ->
+ Resizing = is_list(Forward),
@jtuple
jtuple added a note

The majority of forward_or_vnode_command/3 is more or less the same as code in active. Compare lines 264-292 vs lines 414-437. Any way to refactor out and combine this logic into one place? Bonus points for it being a pure function that we can easily trace-on, run unit/EQC tests against, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_vnode.erl
((19 lines not shown))
+ false ->
+ RequestHash = undefined
+ end,
+ case {Forward, RequestHash} of
+ %% typical vnode operation, no forwarding set, handle request locally
+ {undefined, _} -> vnode_command(Sender, Request, State);
+ %% implicit forwarding after ownership_transfer/hinted_handoff
+ {F, _} when not is_list(F) -> vnode_forward({Index, Forward}, Sender, Request, State);
+ %% during resize we can't forward a request w/o request hash, always handle locally
+ {_, undefined} -> vnode_command(Sender, Request, State);
+ %% possible forwarding during ring resizing
+ {_, _} ->
+ %% during ring resizing if we have completed a transfer to the index that will
+ %% handle request in future ring we forward to it. Otherwise we delegate
+ %% to the local vnode like other requests during handoff
+ %% TODO: how to handle all resize ops completed case (need an extra check here)
@jtuple
jtuple added a note

We've talked about the above TODO and punting most likely. Just noting it officially in this PR. We did mention possibly modifying the TODO to be more clear about the specific problem: eg. N-value change during resize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple jtuple commented on the diff
src/riak_core_vnode.erl
((44 lines not shown))
-vnode_command(Sender, Request, State=#state{index=Index,
- mod=Mod,
+vnode_command(_Sender, _Request, State=#state{modstate={deleted,_}}) ->
+ continue(State);
@jtuple
jtuple added a note

This "do nothing when deleted" case is new. Why do we need it? Does it impact anything else in Riak?

@jrwest
jrwest added a note

This just prevents some annoying log messages from occurring during the cleanup stage of resizing when vnodes are being shut down. This commit has more details (the messages mentioned are things like merge_callback that fired while the shutdown messages were being sent). If you remove it and run the riak_test it should still pass but you may see some (harmless) riak_kv_vnode crashes because {deleted, ModState} is passed to handle_command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_vnode.erl
@@ -333,6 +385,13 @@ vnode_handoff_command(Sender, Request, State=#state{index=Index,
{stop, Reason, State#state{modstate=NewModState}}
end.
+vnode_forward(ForwardTo, Sender, Request, State) ->
+ lager:debug("Forwarding {~p,~p} -> ~p~n",
+ [State#state.index, node(), ForwardTo]),
+ riak_core_vnode_master:command(ForwardTo, Request, Sender,
+ riak_core_vnode_master:reg_name(State#state.mod)),
+ continue(State).
@jtuple
jtuple added a note

Since vnode_forward/4 now provides a single point for all forwarding -- traditional explicit, traditional implicit, and resize -- it seems like it would make sense to have a "forward type" parameter passed into this function that we could trace on if needed as well add to the lager:debug output. Thoughts?

@jrwest
jrwest added a note

agreed. i have to look at the call sites but i think it would be easy to add. will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple jtuple commented on the diff
src/riak_core_vnode.erl
@@ -341,14 +400,57 @@ active(?COVERAGE_REQ{keyspaces=KeySpaces,
sender=Sender}, State) ->
%% Coverage request handled in handoff and non-handoff. Will be forwarded if set.
vnode_coverage(Sender, Request, KeySpaces, State);
-active(?VNODE_REQ{sender=Sender, request=Request},
- State=#state{handoff_node=HN}) when HN =:= none ->
+active(?VNODE_REQ{sender=Sender, request={resize_forward, Request}}, State) ->
vnode_command(Sender, Request, State);
@jtuple
jtuple added a note

So, what's the purpose behind the resize_forward prefix rather than just forwarding the request directly (like non-repair forwarding does)? Is it to prevent multiple forwarding hops, eg: forward A -> forward B -> forward C? Is there an existing bug in normal Riak we should be cared about (that still just forwards requests w/o a prefix)?

@jrwest
jrwest added a note

yep it prevents that case exactly. there is not an existing bug in riak because only primaries forward and we only forward to non-primaries. With resizing, especially shrinking, we commonly forward from primary to primary.

@jrwest
jrwest added a note

but...perhaps adding something similar for normal forwarding could be useful for tracing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple jtuple commented on the diff
src/riak_core_vnode.erl
@@ -356,14 +458,27 @@ active({send_manager_event, Event}, State) ->
State2 = start_manager_event_timer(Event, State),
continue(State2);
active({trigger_handoff, TargetNode}, State) ->
- maybe_handoff(State, TargetNode);
+ active({trigger_handoff, State#state.index, TargetNode}, State);
+active({trigger_handoff, TargetIdx, TargetNode}, State) ->
+ maybe_handoff(TargetIdx, TargetNode, State);
+active(trigger_delete, State=#state{mod=Mod,modstate=ModState,index=Idx}) ->
+ case mark_delete_complete(Idx, Mod) of
+ {ok, _NewRing} ->
+ {ok, NewModState} = Mod:delete(ModState),
+ lager:debug("~p ~p vnode deleted", [Idx, Mod]);
+ _ -> NewModState = ModState
+ end,
+ maybe_shutdown_pool(State),
+ riak_core_vnode_manager:unregister_vnode(Idx, Mod),
+ continue(State#state{modstate={deleted,NewModState}});
@jtuple
jtuple added a note

We talked about this a bit. Why the conditional delete but not conditional shutdown? Is this needed? Under what conditions? Do we care? Ie. not changing code is safer in case we forgot about some weird corner case that caused you to implement this in the first place.

@jrwest
jrwest added a note

so looking at the code again, i believe this was written to be defensible to "unpredictable" corner cases. The Mod:delete is not run if either the operation has already been marked as complete (in which case running it again would be fine) or if the vnode type is not what we expected (in which case we may have hit some weird case where we shouldn't delete the data now). Initially i thought to get rid of the case clause but I think now it should remain. I think its still safe to shutdown the pool and such because we will unregister the vnode and it can be restarted if needed but at least the data is still there. alternatively, could differentiate between the wrong vnode type case and the already marked complete case to be extra clear/safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple jtuple commented on the diff
src/riak_core_vnode.erl
((22 lines not shown))
+ [].
+
+mark_delete_complete(Idx, Mod) ->
+ Result = riak_core_ring_manager:ring_trans(
+ fun(Ring, _) ->
+ Type = riak_core_ring:vnode_type(Ring, Idx),
+ {_, Next, Status} = riak_core_ring:next_owner(Ring, Idx),
+ case {Type, Next, Status} of
+ {resized_primary, '$delete', awaiting} ->
+ Ring3 = riak_core_ring:deletion_complete(Ring, Idx, Mod),
+ %% Use local ring optimization like mark_handoff_complete
+ {set_only, Ring3};
+ {{fallback, _}, '$delete', awaiting} ->
+ Ring3 = riak_core_ring:deletion_complete(Ring, Idx, Mod),
+ %% Use local ring optimization like mark_handoff_complete
+ {set_only, Ring3};
@jtuple
jtuple added a note

Curious why fallbacks would have data to delete. I imagine there's a good reason, might as well throw a comment in. Annoying we have duplicate code in both case clauses, but there's plenty of bigger fish to fry.

@jrwest
jrwest added a note

this comment in the vnode manager explains why, i can copy it over. agree the duplication is unfortunate, perhaps more an issue with the return values of riak_core_ring:vnode_type which make it difficult to pattern match on multiple at once if fallback is one of them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
src/riak_core_vnode_manager.erl
((6 lines not shown))
Ready = riak_core_ring:ring_ready(Ring),
- case determine_handoff_target(Ready, Owner, NextOwner) of
+ IsResizing = riak_core_ring:is_resizing(Ring),
+ Me = node(),
+ case {Type, NextOwner, Ready, IsResizing} of
@jtuple
jtuple added a note

Suggest breaking out this case statement into a function similar to what determine_handoff used to be. Makes it possible to trace if needed + easy to write eunit tests against if we ever care.

@jrwest
jrwest added a note

i actually took determine_handoff out because I found it weird to follow the code at the time. in hindsight i think you are correct. will change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jtuple

Overall comment about using try/catch in tons of places, many of which catch directly on error:badarg. Think the approach works for now, but it would be great if we could change this in a future revision. I'm mainly worried about errors from things other than we expect (esp. as code evolves) as well as generating errors that don't match the specific error:badarg type crashing.

@jrwest

yea not a fan of the try/catch either. the reasoning behind it was to change as little of the existing semantics as possible. In a future version I'd like to move the error handling into the respective functions.

jrwest added some commits
@jrwest jrwest cleanup claimant code around ring resizing
adresses review comments:
  * removes unneeded clauses in filter_changes_pred/4
  * add comments in several places to clarify behaviour
91bb0d0
@jrwest jrwest cleanup riak_core_ring code around ring resizing
address review comments:
  * add comments and typespecs, fix typos
  * refactor reschedule_resize_transfers/3
  * refactor resize_transfer_status/4
  * simplify future_index/4 a bit
ea6df34
@jrwest jrwest fix guarded handoff_manager test that no longer compiles
NOTE: test still fails but that is unrelated to interface changes
that broke compilation
520a8ab
@jrwest jrwest bring back determine_handoff_target in vnode mgr 3e02ca9
@jrwest jrwest clean ring resizing code in riak_core_vnode
address review comments:
  * fix forward typespec in state record
  * refactor small similarity btwn active & forward_or_vnode_command
  * pass forwarding type to vnode_forward for easier tracing
14024f1
@jrwest jrwest add single node restriction to ring resizing eb16f2b
@jrwest jrwest update ring resizing doc 2b627b7
@jtuple

Latest commits address my review concerns. Re-ran riak_test, manually played with code both before and after latest commits, etc. Going to +1 this for merge for 1.4 with the knowledge that we're shipping this feature "dark".

Overall, resizing works as intended. Both scaling up and down in ring size. Tested various resize operations while loading a cluster, shutting nodes down, marking nodes down, etc. All worked without any major issue.

There are a few minor issues that would be great to fix for a future release or at least are useful to note for the time being:

  1. If a node is offline and marked as down, Riak will keep trying to initiate resize transfers to that node and immediately fail. This will spam the logs over and over/etc. For normal Riak handoff, marking a node as down forces the claimant to remove transfers scheduled to said node, and therefore we never attempt to handoff to said node. Thus, the behavior is different for resizing. This is a fact of life given that resize doesn't use the claimant for scheduling transfers after the first. However, we can improve the situation in the future by looking at node watcher and not attempting handoff to nodes that are offline.

  2. A known limitation of this resize work is that we only delete entire vnodes. If a resize causes some (by not all) data owned by a vnode to be rehomed to a new vnode, we don't delete the data on the original vnode. This was a limitation we decided to allow for this iteration of ring resizing. However, a related problem is that the AAE trees for vnodes will still include hashes for the old data. Exchanges with siblings that correctly don't have that data will trigger repairs for keys that shouldn't be owned. These repairs won't actually trigger new writes, however, and will actually cause the AAE tree to realize it shouldn't own said key. However, this will trigger additional, unnecessary read repairs for a period of time. An alternative approach is to clear all AAE trees after a resize, but then that would require rebuilding all trees -- potentially more expensive than extra read repairs. A smarter approach that actually cleans up moved data both in the K/V backend and the AAE hashtree should be a top priority for the next major release.

  3. Resizing with Bitcask while there are incoming writes is rather painful. More so if performing other operations that perform folds (such as AAE tree building or full-sync replication). When there are incoming writes, Bitcask only allows a single fold to occur at a time. For normal Riak, there's only ever a single outbound handoff for a given vnode. However, with resize, it's possible for a vnode to have multiple outbound handoffs running concurrently to transfer different portions of it's keyspace to different target vnodes. Bitcask will allow only 1 to happen a time, all others will block. Resize therefore takes much longer to complete while also (in practice) preventing any other folds from occurring during the entire resize operation.

  4. I'd love to see better insight/observability into how far along we are on a resize operation. member_status is completely useless, ring_status provides only coarse-grained details (eg. have no idea how many sub-transfers are outstanding for each shown awaiting index).

@jrwest jrwest referenced this pull request from a commit
@jrwest jrwest Merge #301 into staging-1.4
some changes were made as part of this merge to address conflicting
changes not necessarily found by git:

* the (awful) use of try/catch in parts of the ring resizing code were
modified because large ring changed the expected error that is caught
* riak_core_vnode_manager:check_precomputed_ring/3 was modified to take
5 arguments to handle the resizing case specially, which it was not
previously aware of

Conflicts:
	src/riak_core_handoff_manager.erl
	src/riak_core_handoff_sender.erl
	src/riak_core_ring_handler.erl
	src/riak_core_vnode.erl
	src/riak_core_vnode_manager.erl
bbfe069
@jrwest

merged w/ #329

@jrwest jrwest closed this
@jrwest jrwest referenced this pull request from a commit
@jrwest jrwest remove legacy forwarding code that snuck in as part of 632af2b
Handling of forwarding code was changed in #301. Changes made as part
of the commit mentioned above were most likely made prior to that PRs merge.

Remove the forwarding code as it is redundant (the decision to forward
is handled further up the call stack). It caused vnode when resizing to
crash, however, since the check was not as comprehensive as its
caller.  The vnode would crash when attempting to forward to a
partition identified by `{Index, [{Index, Owner}]}` which is invalid.
77db027
@seancribbs seancribbs deleted the jrw-dynamic-ring branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Apr 12, 2013
  1. @jrwest

    add support for resizable ring

    jrwest authored
    Allows the ring to be resized as a cluster operation similar to join/leave.
    Expansion is only allowed, but the restriction is temprorary, and shrinking
    requires only small modifications.
    
    When the operation is staged/commited the claimant calulates a resized ring,
    running it through claim. It then schedules a "resize operation" for each
    partition, via the next list. Each partition is scheduled for the operation because
    during resize, each index potentially has data to send to others. Per partition,
    the resize operation consists of several "resize transfers", each of which must be
    completed to complete the entire operation. The first of these resize transfers is
    scheduled by the claimant. Subsequent transfers are scheduled by the vnode that
    owns the partition. When all of the transfers for a partition of have completed,
    the operation is marked as such in the next list. Once the operation has completed
    for *all* partitions the claimant installs the resized ring.
    
    The handoff subsystem has been modified to add a new transfer type: resize_transfer.
    In addition, the handoff sender now supports applying a function to keys that are
    not sent (due to the handoff filter). The function operates on the unsent key and
    a per-sender defined accumulator. Resize transfers are triggered like hinted and
    ownership handoff by the vnode manager and vnode. The primary difference between
    those transfers and a resize_transfer is that the latter includes a filter
    that only sends keys between the source and target partitions if, for both, the
    key hashes to the same position in the preflist. Resize transfers also include an
    unsent keys function and accumulator that determines which partitions the key
    would be destined for. The vnode uses this information to schedule its subsequent
    resize transfers. A partition which has not been transferred to yet and is in the
    returned list will be scheduled. No more resize transfers will be schedueld when
    the returned list contains no partitions or only partitions that have already been
    scheduled.
    
    One major difference between typical transfers and a resize operation is the partition
    does not delete its data (or unregister) after completing. For some partitions, they will
    remain running but a portion of their data should be removed. Other partitions will no
    longer live on their original owner should have all of their data deleted. This is
    not addressed in this commit.
    
    During resizing, how vnodes forward has been modified, as well. Like regular handoff
    forwarding is performed both explicitly (when the vnode returns {forward,...} from
    handle_handoff_command) and implicitly (after a transfer has completed). The difference
    is that since the operation consists of multiple transfers, messages can be implicitly
    and explcitly forwarded at the same time. Messages whose "hash" map to a portion of
    the keyspace already transferred as part of the operation will be implicitly forwarded.
    Messages whose "hash" map to a portion of the keyspace not yet transferred can be
    explicitly forwarded. To determine the hash of the request, a new vnode callback has
    been added: request_hash. This callback is optional, unless the application supports
    dynamic ring. The callback can also return undefined instead of a hash. In which case,
    the message will be handled locally and not forwarded implicitly. Special care must also
    be taken to handle messages on a vnode forwarded from another, because primaries take part
    in forwarding during resize. Coverage requests are not forwarded during a resize.
  2. @jrwest

    begin cleanup after resized ring

    jrwest authored
    After a ring resizes there are several cleanup operations that must
    be performed. In some cases a partition no longer exists (if the ring
    shrinks) or its owner has been moved to another node. In these cases the
    vnode must be unregistered and all its data deleted. In other cases, where
    the vnode does continue to exist and has not moved, some data must be removed
    but the vnode continues to run. This commit addresses the former.
    
    After completion, the resized ring will have deletions scheduled for the
    appropriate indexes. The deletions are scheduled like any other ring transition.
    Although this is considered part of the resize operation, at this time the
    future ring has been installed and the vnodes targetted for cleanup appear as fallbacks.
    After each deletion is complete, the vnode unregisters. After all deletions are
    complete the full resize operation is considered complete.
  3. @jrwest

    ability to shrink ring

    jrwest authored
    not many changes were necessary with the exception that when the ring shrinks
    some indexes disappear, which needs to be handled specially. Note: this only works
    assuming NewRingSize < MaxN, where MaxN is the largest N value used on any bucket
    in the cluster
  4. @jrwest
  5. @jrwest
  6. @jrwest
  7. @jrwest

    support for when SHTF during ring resize

    jrwest authored
    allow ring resize operation to be aborted. The operation is staged like
    any other cluster op. The abort will only be staged/performed if the
    operation is in the process of performing resize transfers (resized ring
    has not been installed). Otherwise, there is nothing to abort, so
    the operation is ignored.
    
    Support force-replace during resize. Works like force-replace during
    typical cluster operations. Transfers are rescheduled on the replacing
    node under the assumption that the replacing node may have been rebuilt
    from a recent backup.
    
    force-remove is not supported because it may end up causing too
    many transfers to be rescheduled on the remaining nodes, overloading
    them. Instead force-remove is supported by performing an abort
    of the resize operation first.
  8. @jrwest

    move mgmt of new vnode proxies during resize to ring handler

    jrwest authored
    also add stopping of extra proxies abandoned during an abort
    of a resize or a completed shrinking of the ring
  9. @jrwest

    honor forced_ownership_handoff for resize operations

    jrwest authored
    ensures that at most forced_ownership_handoff vnodes will be performing
    resize transfers accross the cluster at once
    
    also, the previous code was just wrong
  10. @jrwest

    use object_info/1 to hash keys for filtering during resize

    jrwest authored
    using riak_core_util:chash_key was not flexible enough for other
    applications (e.g. riak_search). This makes object_info/1 a required
    callback for applications implementing dynamic ring.
Commits on Apr 15, 2013
  1. @jrwest

    re-implement forwarding during ring resizing

    jrwest authored
    all requests have opportunity to be handled locally before being
    forwarded to ensure no data loss during abort. This means that ring
    resizing only uses explicit forwarding via handle_handoff_command.
  2. @jrwest

    add ring resizing impl. doc

    jrwest authored
Commits on May 1, 2013
  1. @jrwest

    improve performance and fix several bugs in future index calculation

    jrwest authored
    * make future index a constant time calculation (it was O(RingSize * FutureRingSize) before)
    * for no possible reason the notsent acc was a linked list instead of a set
    * during shrinking the n-value for a preflist could be, depending on ownership assignment and
      order of transfers, implicitly grown (e.g. N=3 -> N=5) leaving behind unreachable data. This
      is because primaries involved in shrinking may transfer data out after receiving data in. In
      this case it is necessary to identify keys that are owned by the source partition in the *future*
      ring. This is done by detecting two conditions, the first is when the position of the source in
      the current preflist is greater than or equal to the size of the new ring (an impossible N value
      in the new ring) and an optional N-value threshold which solves a specific case when halving the ring
      where the first condition is not triggered until data has already been copied where it shouldn't be
  2. @jrwest

    fix race condition on node-startup with proxy shutdown check

    jrwest authored
    module not yet registered with riak core but proxy_sup can have childspecs
    for app's vnode supervisors before this happens
  3. @jrwest

    remove unneeded change to riak_core_ring:all_next_owners

    jrwest authored
    its implementation was a bit misleading with the changes and was only
    needed to support one part of the codebase (starting extra vnode proxies
    when ring expands)
  4. @jrwest
Commits on May 3, 2013
  1. @jrwest

    stop-gap prevention of ring resizing for unsupported applications

    jrwest authored
    * see comment in riak_core_claimant on why this is a hack
    * search will not, intially, be supported
    * control may break as ring status evolves to provide more observability during
      the operation. failures would be transient but its easier to require disabling
      it then requiring it not be accessed during the operation. In the future, control
      will be aware of the operation and this can be dropped altogether
  2. @jrwest
Commits on May 9, 2013
  1. @jrwest
Commits on May 13, 2013
  1. @jrwest
  2. @jrwest
  3. @jrwest

    prevent passing of commands to vnode mod after post-resize delete

    jrwest authored
    unlike when vnodes are marked as deleted during normal ownership transfer,
    the vnodes that are deleted during the cleanup process do not forward. As
    a result, they may receive messages between the period they unregister and when
    they are shutdown.
  4. @jrwest
Commits on May 24, 2013
  1. @jrwest

    cleanup claimant code around ring resizing

    jrwest authored
    adresses review comments:
      * removes unneeded clauses in filter_changes_pred/4
      * add comments in several places to clarify behaviour
  2. @jrwest

    cleanup riak_core_ring code around ring resizing

    jrwest authored
    address review comments:
      * add comments and typespecs, fix typos
      * refactor reschedule_resize_transfers/3
      * refactor resize_transfer_status/4
      * simplify future_index/4 a bit
  3. @jrwest

    fix guarded handoff_manager test that no longer compiles

    jrwest authored
    NOTE: test still fails but that is unrelated to interface changes
    that broke compilation
  4. @jrwest
  5. @jrwest

    clean ring resizing code in riak_core_vnode

    jrwest authored
    address review comments:
      * fix forward typespec in state record
      * refactor small similarity btwn active & forward_or_vnode_command
      * pass forwarding type to vnode_forward for easier tracing
Commits on May 25, 2013
  1. @jrwest
Commits on May 28, 2013
  1. @jrwest

    update ring resizing doc

    jrwest authored
Something went wrong with that request. Please try again.