False/bogus/needless sibling creation during handoff #1046

slfritchie · 2014-11-17T15:30:02Z

First, many thanks to @shino for being tenacious and putting together the necessary evil. I would've floundered without it.

Prerequisite: Set up Riak 1.4.x using Shino's recipe at https://gist.github.com/shino/f30eb66d8a53b8d71224 ... I used Riak 1.4.10.

My annotated log messages discussing the problem are at:
https://gist.github.com/slfritchie/442e09035c70c5ab6240 ... the first put is boring. The next two are intentionally simulating a read-update-put cycle which is written correctly but is 100% unlucky with regard to concurrent operation interleaving. The 2nd put is the winner and thus is boring.

The 3rd put is exciting because it causes the vector clock to be coordinated twice, as the vnode level and not at the FSM level. The put's 1st coordinator vnode is the vnode that is also handing off, so the vnode (riak_kv and riak_core interacting here, alas) forwards the op to the handoff destination vnode. The put op still has the [coord] options list, so the handoff receiver also acts as a coordinator. As far as Shino and I can tell, and probably @russelldb also, this 2nd coord put is bad. It creates an a new actor ID in the vclock (call it ID_c2), and all subsequent vnode ops that are forwarded to the vnode, by a single, sequential, correctly-written client, never have the ID_c2 actor in the vclock because the get ops are never sent to the handoff destination vnode, and the handoff destination vnode is the only one that has a record of the ID_c2 actor.

Thus each single, sequential write to this key will create a new sibling and will continue to create siblings until handoff is finished.

In an ideal world, the coord option would be removed from an op that's forwarded. Unfortunately, that's impossible with the current API between riak_core_vnode and riak_kv_vnode:handle_handoff_command(), specifically the {forward, ...} return tuple.

Russell tells me that the forwarded object should be the result of the 1st coordinator put, so it's quite likely that the current forward-the-original-object-without-modification is wrong for a second reason, and we haven't noticed yet (but future CRDT could be tripped up by this error).

Basho folk: see also the Eng chat room for the morning of 2014-11-17 Monday US time.

The text was updated successfully, but these errors were encountered:

jonmeredith · 2014-11-17T16:03:10Z

Could the handle_handoff_command on the handoff node just strip the coord from the put request?

slfritchie · 2014-11-17T16:05:35Z

@jonmeredith That code is twisted. I spent a long time before I realized the following thingies below, from internal chat.

AFAICT, in that step 2 above, the local vnode that's the coordinator vnode and frontier'izing and all that good stuff *and* it happens to be doing handoff but the handoff activity has nothing absolutely nothing to do with the BKey being worked on right now ... , riak_core_vnode:active() calls riak_core_vnode:handle_handoff_command(), which then calls riak_kv_vnode:handle_handoff_command(), which then returns a {forward, State} thingie, then core forwards the exact same op to the handoff destination.

and

I'm also irked that the local vnode put code path ends up calling a function called riak_kv_vnode:handle_handoff_command() when there hasn't been any handoff.

russelldb · 2014-11-17T16:06:15Z

I think it needs to do more than that. But it needs to modify the request, which means a small change to riak_core to allow that. I’m on with it.

Just change https://github.com/basho/riak_core/blob/1.4/src/riak_core_vnode.erl#L369 to allow either {forward, NewModState} or {forward, ModdedRequest, NewModState}. Backwards compatible too.

On 17 Nov 2014, at 16:03, Jon Meredith notifications@github.com wrote:

Could the handle_handoff_command on the handoff node just strip the coord from the put request?

—
Reply to this email directly or view it on GitHub.

slfritchie · 2014-11-17T16:06:17Z

Er, sorry ... that was intended to show that handle_handoff_command() doesn't just do stuff on a handoff receiver's execution path.

slfritchie · 2014-11-17T16:12:49Z

FWIW, I built Riak 1.3.2 from source, using the app.config patches from Shino's gist.

The sibling explosion is later, but it still happens. Instead of 10 seconds delay in visit_item() as he originally suggests, perhaps using 1 second could get the transfers to happen a bit more quickly and reveal the extra siblings?

A list of something like [1,2] is good. A list of [1,2,15,16,17] is bad. If using sh -c 'while [ 1 ] ; do env ERL_LIBS=deps escript ./handoff_and_siblings.erl ; /bin/echo -n . ; done' to be punishing, then (real result) [1,2,110,125] is also quite bad. 8-)

{ok, C} = riak:local_client().
lists:usort([begin K = list_to_binary("k-" ++ integer_to_list(X)), length(riak_object:get_contents(element(2,C:get(<<"b">>, K)))) end || X <- lists:seq(1,100) ]).

jonmeredith · 2014-11-17T16:16:16Z

At https://github.com/basho/riak_kv/blob/1.4/src/riak_kv_vnode.erl#L872 can
we do something like this? Or am I misunderstanding the problem?

handle_handoff_command(Req=?KV_PUT_REQ{}, Sender, State) ->
Req1 = strip_coord(Req),
{noreply, NewState} = handle_command(Req1, Sender, State),
{forward, NewState};

strip_coord(?KV_PUT_REQ{options= Options} = Req) ->
Req?KV_PUT_REQ{options = proplists:delete(coord, Options)}.

On Mon, Nov 17, 2014 at 9:06 AM, Russell Brown notifications@github.com
wrote:

I think it needs to do more than that. But it needs to modify the request,
which means a small change to riak_core to allow that. I’m on with it.

Just change
https://github.com/basho/riak_core/blob/1.4/src/riak_core_vnode.erl#L369
to allow either {forward, NewModState} or {forward, ModdedRequest, NewModState}. Backwards compatible too.

On 17 Nov 2014, at 16:03, Jon Meredith notifications@github.com wrote:

Could the handle_handoff_command on the handoff node just strip the
coord from the put request?

—
Reply to this email directly or view it on GitHub.

—
Reply to this email directly or view it on GitHub
#1046 (comment).

Jon Meredith
Chief Architect
Basho Technologies, Inc.
jmeredith@basho.com

slfritchie · 2014-11-17T16:17:11Z

Can't do that -- riak_kv_vnode:handle_handoff_command() is called in the the 1st coordinating vnode's execution path. Madness!

russelldb · 2014-11-17T16:19:25Z

Since vnode1 is handing off, handle_handoff_command is called. This makes sense to me.
However, the return from vnode1 of {forward, NewModState} just tells riak_core_vnode to forward the original request on.

Removing coord is not enough, anyway. The request is a put request, you need to generate a frontier on it. It must be coordinated somewhere.

jonmeredith · 2014-11-17T16:36:48Z

Thanks for the clarification, adding support for returning a {forward, NewReq, NewModState} definitely seems like the answer.

kellymclaughlin · 2014-11-17T16:45:10Z

Wow, nice work on this diagnosis @shino, @slfritchie, and @russelldb!

shino · 2014-11-18T02:13:25Z

I want to ask one questions for necessary condition.
Is a pair of interleaved read-modify-writes necessary to generate siblings at forwarded (target) vnode?

russelldb · 2014-11-18T09:36:55Z

@shino I don't think you need any interleaving.

Node D has been added to the system and claims partition X from node A
Node A is now "handing off" partition X
Client reads key K
Node D's partition X will not but contacted for the read until handoff is over
Client modifies K and sends to Riak
Node A vnode X coordinates the write (stores clock [{ax, 1}]
Node A vnode X forwards the write to Node D vnode X
Node D, X coordinates the write (stores clock [{dx, 1}])
Client reads key K
Node D's partition X will not but contacted for the read until handoff is over
Client modifies and writes K
Node A, X overwrites local value with clock [{ax, 2}]
Node A, X forwards to Node D, X
Node D, X coordindates write.
Incoming clock is [{ax, 1}], local clock is [{dx, 1}], result? 2 sibling values and clock [{ax,1}, {dx, 2}]

Now repeat, D,X again handles incoming [{ax, 2}] which conflicts with [{ax,1}, {dx, 2}], another sibling, (now 3) and so on for each new write to A,X that is forwarded.

The client never sees the [{dx, n}] entry, so it's update never dominate them.

Does it sound right to you?

The fix works by

coordinating on Node A, X first
Modifying the request forwarded to include the result of A,X write, and strip coord from options, so that D, X only merges/stores the received object as a replica.

shino · 2014-11-18T12:27:40Z

Does it sound right to you?

Perfect explanation, thank you!

andrewjstone · 2014-11-20T04:51:16Z

Finally took the time to read through this issue. Great work Scott, Russell
and Shino!

On Tue, Nov 18, 2014 at 7:27 AM, Shunichi Shinohara <
notifications@github.com> wrote:

Does it sound right to you?

Perfect explanation, thank you!

—
Reply to this email directly or view it on GitHub
#1046 (comment).

slfritchie · 2014-11-20T05:00:12Z

Fixed by #1047, closing.

…s during handoff Forward port of this issue from 1.4 to 2.0: basho/riak_kv#1046 Original fix is: 76cbdc2 There is a matching PR for riak_kv

requests during handoff Forward port of this issue from 1.4 to 2.0: #1046 Original fix is: d3c8c4b There is a matching PR for riak_core

c.f. basho/riak_kv#1047 basho/riak_kv#1046

russelldb mentioned this issue Nov 17, 2014

Allow the vnode to change/provide the request to be forwarded basho/riak_core#651

Merged

slfritchie mentioned this issue Nov 20, 2014

Do not forward coordinating put requests as this leads to siblings #1047

Merged

slfritchie closed this as completed Nov 20, 2014

shino mentioned this issue Nov 21, 2014

RiakCS 1.5.2 shows sibling explosion while doing rapid updates during join/leave operations basho/riak_cs#1005

Closed

gordonguthrie pushed a commit that referenced this issue Dec 15, 2014

Fixes sibling explosion bug caused by forwarding coordination

a0fb2bf

requests during handoff Forward port of this issue from 1.4 to 2.0: #1046 Original fix is: d3c8c4b There is a matching PR for riak_core

This was referenced Dec 15, 2014

Fixes sibling explosion bug caused by forwarding coordination requests d... basho/riak_core#672

Merged

Fixes sibling explosion bug caused by forwarding coordination #1062

Merged

seancribbs added a commit to basho/riak_test that referenced this issue Dec 30, 2014

Fix intercepts to work with new do_put return value.

a90a988

c.f. basho/riak_kv#1047 basho/riak_kv#1046

seancribbs mentioned this issue Dec 30, 2014

Fix intercepts to work with new do_put return value. basho/riak_test#721

Merged

seancribbs added a commit to basho/riak_test that referenced this issue Jan 8, 2015

Fix intercepts to work with new do_put return value.

a4bf8b2

c.f. basho/riak_kv#1047 basho/riak_kv#1046

seancribbs added a commit to basho/riak_test that referenced this issue Feb 12, 2015

Fix intercepts to work with new do_put return value.

bd09a03

c.f. basho/riak_kv#1047 basho/riak_kv#1046

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False/bogus/needless sibling creation during handoff #1046

False/bogus/needless sibling creation during handoff #1046

slfritchie commented Nov 17, 2014

jonmeredith commented Nov 17, 2014

slfritchie commented Nov 17, 2014

russelldb commented Nov 17, 2014

slfritchie commented Nov 17, 2014

slfritchie commented Nov 17, 2014

jonmeredith commented Nov 17, 2014

slfritchie commented Nov 17, 2014

russelldb commented Nov 17, 2014

jonmeredith commented Nov 17, 2014

kellymclaughlin commented Nov 17, 2014

shino commented Nov 18, 2014

russelldb commented Nov 18, 2014

shino commented Nov 18, 2014

andrewjstone commented Nov 20, 2014

slfritchie commented Nov 20, 2014

False/bogus/needless sibling creation during handoff #1046

False/bogus/needless sibling creation during handoff #1046

Comments

slfritchie commented Nov 17, 2014

jonmeredith commented Nov 17, 2014

slfritchie commented Nov 17, 2014

russelldb commented Nov 17, 2014

slfritchie commented Nov 17, 2014

slfritchie commented Nov 17, 2014

jonmeredith commented Nov 17, 2014

slfritchie commented Nov 17, 2014

russelldb commented Nov 17, 2014

jonmeredith commented Nov 17, 2014

kellymclaughlin commented Nov 17, 2014

shino commented Nov 18, 2014

russelldb commented Nov 18, 2014

shino commented Nov 18, 2014

andrewjstone commented Nov 20, 2014

slfritchie commented Nov 20, 2014