Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False/bogus/needless sibling creation during handoff #1046

Closed
slfritchie opened this issue Nov 17, 2014 · 15 comments
Closed

False/bogus/needless sibling creation during handoff #1046

slfritchie opened this issue Nov 17, 2014 · 15 comments

Comments

@slfritchie
Copy link
Contributor

First, many thanks to @shino for being tenacious and putting together the necessary evil. I would've floundered without it.

Prerequisite: Set up Riak 1.4.x using Shino's recipe at https://gist.github.com/shino/f30eb66d8a53b8d71224 ... I used Riak 1.4.10.

My annotated log messages discussing the problem are at:
https://gist.github.com/slfritchie/442e09035c70c5ab6240 ... the first put is boring. The next two are intentionally simulating a read-update-put cycle which is written correctly but is 100% unlucky with regard to concurrent operation interleaving. The 2nd put is the winner and thus is boring.

The 3rd put is exciting because it causes the vector clock to be coordinated twice, as the vnode level and not at the FSM level. The put's 1st coordinator vnode is the vnode that is also handing off, so the vnode (riak_kv and riak_core interacting here, alas) forwards the op to the handoff destination vnode. The put op still has the [coord] options list, so the handoff receiver also acts as a coordinator. As far as Shino and I can tell, and probably @russelldb also, this 2nd coord put is bad. It creates an a new actor ID in the vclock (call it ID_c2), and all subsequent vnode ops that are forwarded to the vnode, by a single, sequential, correctly-written client, never have the ID_c2 actor in the vclock because the get ops are never sent to the handoff destination vnode, and the handoff destination vnode is the only one that has a record of the ID_c2 actor.

Thus each single, sequential write to this key will create a new sibling and will continue to create siblings until handoff is finished.

In an ideal world, the coord option would be removed from an op that's forwarded. Unfortunately, that's impossible with the current API between riak_core_vnode and riak_kv_vnode:handle_handoff_command(), specifically the {forward, ...} return tuple.

Russell tells me that the forwarded object should be the result of the 1st coordinator put, so it's quite likely that the current forward-the-original-object-without-modification is wrong for a second reason, and we haven't noticed yet (but future CRDT could be tripped up by this error).

Basho folk: see also the Eng chat room for the morning of 2014-11-17 Monday US time.

@jonmeredith
Copy link
Contributor

Could the handle_handoff_command on the handoff node just strip the coord from the put request?

@slfritchie
Copy link
Contributor Author

@jonmeredith That code is twisted. I spent a long time before I realized the following thingies below, from internal chat.

AFAICT, in that step 2 above, the local vnode that's the coordinator vnode and frontier'izing and all that good stuff *and* it happens to be doing handoff but the handoff activity has nothing absolutely nothing to do with the BKey being worked on right now ... , riak_core_vnode:active() calls riak_core_vnode:handle_handoff_command(), which then calls riak_kv_vnode:handle_handoff_command(), which then returns a {forward, State} thingie, then core forwards the exact same op to the handoff destination.

and

I'm also irked that the local vnode put code path ends up calling a function called riak_kv_vnode:handle_handoff_command() when there hasn't been any handoff.

@russelldb
Copy link
Member

I think it needs to do more than that. But it needs to modify the request, which means a small change to riak_core to allow that. I’m on with it.

Just change https://github.com/basho/riak_core/blob/1.4/src/riak_core_vnode.erl#L369 to allow either {forward, NewModState} or {forward, ModdedRequest, NewModState}. Backwards compatible too.

On 17 Nov 2014, at 16:03, Jon Meredith notifications@github.com wrote:

Could the handle_handoff_command on the handoff node just strip the coord from the put request?


Reply to this email directly or view it on GitHub.

@slfritchie
Copy link
Contributor Author

Er, sorry ... that was intended to show that handle_handoff_command() doesn't just do stuff on a handoff receiver's execution path.

@slfritchie
Copy link
Contributor Author

FWIW, I built Riak 1.3.2 from source, using the app.config patches from Shino's gist.

The sibling explosion is later, but it still happens. Instead of 10 seconds delay in visit_item() as he originally suggests, perhaps using 1 second could get the transfers to happen a bit more quickly and reveal the extra siblings?

A list of something like [1,2] is good. A list of [1,2,15,16,17] is bad. If using sh -c 'while [ 1 ] ; do env ERL_LIBS=deps escript ./handoff_and_siblings.erl ; /bin/echo -n . ; done' to be punishing, then (real result) [1,2,110,125] is also quite bad. 8-)

{ok, C} = riak:local_client().
lists:usort([begin K = list_to_binary("k-" ++ integer_to_list(X)), length(riak_object:get_contents(element(2,C:get(<<"b">>, K)))) end || X <- lists:seq(1,100) ]).

@jonmeredith
Copy link
Contributor

At https://github.com/basho/riak_kv/blob/1.4/src/riak_kv_vnode.erl#L872 can
we do something like this? Or am I misunderstanding the problem?

handle_handoff_command(Req=?KV_PUT_REQ{}, Sender, State) ->
Req1 = strip_coord(Req),
{noreply, NewState} = handle_command(Req1, Sender, State),
{forward, NewState};

strip_coord(?KV_PUT_REQ{options= Options} = Req) ->
Req?KV_PUT_REQ{options = proplists:delete(coord, Options)}.

On Mon, Nov 17, 2014 at 9:06 AM, Russell Brown notifications@github.com
wrote:

I think it needs to do more than that. But it needs to modify the request,
which means a small change to riak_core to allow that. I’m on with it.

Just change
https://github.com/basho/riak_core/blob/1.4/src/riak_core_vnode.erl#L369
to allow either {forward, NewModState} or {forward, ModdedRequest, NewModState}. Backwards compatible too.

On 17 Nov 2014, at 16:03, Jon Meredith notifications@github.com wrote:

Could the handle_handoff_command on the handoff node just strip the
coord from the put request?


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub
#1046 (comment).

Jon Meredith
Chief Architect
Basho Technologies, Inc.
jmeredith@basho.com

@slfritchie
Copy link
Contributor Author

Can't do that -- riak_kv_vnode:handle_handoff_command() is called in the the 1st coordinating vnode's execution path. Madness!

@russelldb
Copy link
Member

Since vnode1 is handing off, handle_handoff_command is called. This makes sense to me.
However, the return from vnode1 of {forward, NewModState} just tells riak_core_vnode to forward the original request on.

Removing coord is not enough, anyway. The request is a put request, you need to generate a frontier on it. It must be coordinated somewhere.

@jonmeredith
Copy link
Contributor

Thanks for the clarification, adding support for returning a {forward, NewReq, NewModState} definitely seems like the answer.

@kellymclaughlin
Copy link
Contributor

Wow, nice work on this diagnosis @shino, @slfritchie, and @russelldb!

@shino
Copy link

shino commented Nov 18, 2014

I want to ask one questions for necessary condition.
Is a pair of interleaved read-modify-writes necessary to generate siblings at forwarded (target) vnode?

@russelldb
Copy link
Member

@shino I don't think you need any interleaving.

Node D has been added to the system and claims partition X from node A
Node A is now "handing off" partition X
Client reads key K
Node D's partition X will not but contacted for the read until handoff is over
Client modifies K and sends to Riak
Node A vnode X coordinates the write (stores clock [{ax, 1}]
Node A vnode X forwards the write to Node D vnode X
Node D, X coordinates the write (stores clock [{dx, 1}])
Client reads key K
Node D's partition X will not but contacted for the read until handoff is over
Client modifies and writes K
Node A, X overwrites local value with clock [{ax, 2}]
Node A, X forwards to Node D, X
Node D, X coordindates write.
Incoming clock is [{ax, 1}], local clock is [{dx, 1}], result? 2 sibling values and clock [{ax,1}, {dx, 2}]

Now repeat, D,X again handles incoming [{ax, 2}] which conflicts with [{ax,1}, {dx, 2}], another sibling, (now 3) and so on for each new write to A,X that is forwarded.

The client never sees the [{dx, n}] entry, so it's update never dominate them.

Does it sound right to you?


The fix works by

  1. coordinating on Node A, X first
  2. Modifying the request forwarded to include the result of A,X write, and strip coord from options, so that D, X only merges/stores the received object as a replica.

@shino
Copy link

shino commented Nov 18, 2014

Does it sound right to you?

Perfect explanation, thank you!

@andrewjstone
Copy link
Contributor

Finally took the time to read through this issue. Great work Scott, Russell
and Shino!

On Tue, Nov 18, 2014 at 7:27 AM, Shunichi Shinohara <
notifications@github.com> wrote:

Does it sound right to you?

Perfect explanation, thank you!


Reply to this email directly or view it on GitHub
#1046 (comment).

@slfritchie
Copy link
Contributor Author

Fixed by #1047, closing.

gordonguthrie pushed a commit to basho/riak_core that referenced this issue Dec 15, 2014
…s during

handoff

Forward port of this issue from 1.4 to 2.0:
basho/riak_kv#1046

Original fix is:
76cbdc2

There is a matching PR for riak_kv
gordonguthrie pushed a commit that referenced this issue Dec 15, 2014
requests during handoff

Forward port of this issue from 1.4 to 2.0:
#1046

Original fix is:
d3c8c4b

There is a matching PR for riak_core
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants