Vnode nonblocking reply, First draft (3rd edition), ready for some review #299

Merged
merged 2 commits into from May 2, 2013

Projects

None yet

4 participants

@slfritchie

Vnode replies always go via reply(), and reply() always uses unreliable
messaging. (As opposed to the usual (and more reliable) send-and-pray
messaging.)

During handoff, all forwarding requests use unreliable vnode master
commands to avoid net_kernel blocking interference.

My testing procedure:

  1. Use "make stagedevrel" on box A for nodes 1-4. Configure for
    multi-box use (i.e. change all 127.0.0.1 appearances in vm.args and
    app.config as appropriate).

  2. Use "make stage" on box B for node 5. Configure for multi-box
    use.

  3. Start all nodes, join them all together, and commit.

  4. Run basho_bench on box A, using this config:

    {mode, {rate, 50}}.
    {report_interval, 1}.
    {duration, 30}.
    {concurrent, 50}.

    {driver, basho_bench_driver_riakc_pb}.
    {riakc_pb_ips, [
    {{127,0,0,1}, 10017}, %% {Ip, Port}
    {{127,0,0,1}, [10027, 10037, 10047]} %% {Ip, Ports}
    ]}.
    {riakc_pb_replies, 1}.
    {operations, [{get, 1}, {update, 1}]}.
    {pb_timeout_general, 5000}.

    {key_generator, {int_to_bin, {uniform_int, 990000}}}.
    {value_generator, {fixed_bin, 10}}.

  5. On box B:

    sh -c 'for i in 1 2 3 4 5; do date; sleep 1 ; ( date ; ifdown eth0 ; sleep 100 ; ifup eth0 ; date ) > /tmp/slf 2>& 1 ; cat /tmp/slf; sleep 70; done'

  6. Check the results via:

    awk -F, '{if ($5 == 0) { print "ok"; } else { print "error" } }' tests/current/summary.csv | uniq -c | less

  7. There's a pretty darn obvious qualitative way to view the results
    of the pre-patch and post-patch behavior. The sampling rate used by
    the basho_bench config above, i.e., {report_interval, 1}, creates
    some noise in the step 6 results. So, the best advice that I can give
    for looking at the quantative data is to look for ~100 seconds of ok
    stability after node 1's network interface has been shut off.

7a. Good results from step 6:

  1 error
 19 ok
  3 error
  2 ok
  4 error
  1 ok
 57 error (definitely down ifdown iteration  1)
  2 ok
  1 error
105 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  4 error
  2 ok
  4 error
  1 ok
 61 error (definitely down ifdown iteration  2)
  1 ok
  2 error
101 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  3 error
  2 ok
 55 error (definitely down ifdown iteration  3)
  2 ok
  2 error
112 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  4 error
  1 ok
 58 error (definitely down ifdown iteration  4)
  1 ok
  3 error
109 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  3 error
  2 ok
  4 error
  1 ok
  4 error
  1 ok
 52 error (definitely down ifdown iteration  5)
  1 ok
  4 error
188 ok  (about ~100 seconds of goodness after down ifdown iteration  1)
        (end of interface flapping, ignore the quantity here)

7b. Bad results from step 6:

  1 error
 24 ok
  4 error (probably down ifdown iteration  1)
  1 ok
 10 error (definitely down ifdown iteration  1)
  1 ok
 46 error
  1 ok
  7 error
106 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  3 error
  2 ok
  4 error
  1 ok
 61 error (definitely down ifdown iteration 2 ... but this is
           where it becomes much less obvious where ifdown and
           ifup events actually happened)
  2 ok
  5 error
  1 ok
 21 error
 75 ok
  4 error
  1 ok
  5 error
  1 ok
  4 error
  1 ok
 46 error (down ifdown iteration  3?)
  1 ok
 38 error
 75 ok
  3 error
  2 ok
  4 error
  1 ok
  5 error
  1 ok
 84 error (Huge problem ... Step 7a's results never had an error
           rate larger than 61 samples in a row.)
 75 ok
  3 error
  2 ok
  5 error
  1 ok
 57 error (probably down here ifdown iteration  5)
  1 ok
 32 error (if "57 error" was ifdown, then node 1's interface is
           now up, and this error rate is too high.)
974 ok (end of interface flapping, ignore the quantity here)
@slfritchie slfritchie First draft (3rd edition), ready for some review
Vnode replies always go via reply(), and reply() always uses unreliable
messaging.  (As opposed to the usual (and more reliable) send-and-pray
messaging.)

During handoff, all forwarding requests use unreliable vnode master
commands to avoid net_kernel blocking interference.
5ffd522
@jonmeredith

Does this bound the largest message you can retrieve from a vnode to the size of the disterl buffer?

E.g. if disterl buffer was 1Mb, and you tried to retrieve an object that was 2Mb, what is the behavior of the VM?
1) Exceed the limit and queue that one object, then deny all after until transmit.
2) Enforce the limit and deny the send.

@slfritchie

@jonmeredith With "erl -sname foo +zdbbl 1024" on both sides, and the destination node's shell registered via register(x, self()), I see this the first send always succeeding, both when sending 1MB and 5MB chunks. The second succeeds, I'm assuming because these nodes are both otherwise idle and are going to start sending the first chunk, enough so that the second block gets through. The third, however, is out of luck.

Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:6:6] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.1  (abort with ^G)
(foo@spare1)1> M1 = list_to_binary(lists:duplicate(1024*1024, $x)).
<<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"...>>
(foo@spare1)2> [erlang:send({x, foo@big8}, {X, M1}, [noconnect, nosuspend]) || X <- lists:seq(1,10)].
[noconnect,noconnect,noconnect,noconnect,noconnect,
 noconnect,noconnect,noconnect,noconnect,noconnect]
(foo@spare1)3> net_adm:ping('foo@big8').
pong
(foo@spare1)4> [erlang:send({x, foo@big8}, {X, M1}, [noconnect, nosuspend]) || X <- lists:seq(1,10)].
[ok,ok,nosuspend,nosuspend,nosuspend,nosuspend,nosuspend,
 nosuspend,nosuspend,nosuspend]
(foo@spare1)5> M5 = list_to_binary(lists:duplicate(5*1024*1024, $x)).
<<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"...>>
(foo@spare1)6> [erlang:send({x, foo@big8}, {X, M5}, [noconnect, nosuspend]) || X <- lists:seq(1,10)].
[ok,ok,nosuspend,nosuspend,nosuspend,nosuspend,nosuspend,
 nosuspend,nosuspend,nosuspend]

If I do the sending a bit differently:

[begin erlang:send({x, foo@big8}, {X, M5}, [noconnect, nosuspend]), timer:sleep(2) end || X <- lists:seq(1,1000)].

... then I see that we indeed get what we expect: sending resumes when the buffer drains, and we get about two of the 5MB chunks through at each time:

> Flush = fun(F) -> receive {N, Bin} -> io:format("seq ~p bytes ~p, ", [N, size(Bin)]), F(F); X -> io:format("unknown ~P, ", [X, 10]), F(F) after 0 -> io:format("\n") end end.             

> 7> Flush(Flush).
seq 1 bytes 5242880, seq 2 bytes 5242880, seq 22 bytes 5242880, seq 23 bytes 5242880, seq 50 bytes 5242880, seq 51 bytes 5242880, seq 77 bytes 5242880, seq 78 bytes 5242880, seq 106 bytes 5242880, seq 107 bytes 5242880, seq 133 bytes 5242880, seq 134 bytes 5242880, seq 162 bytes 5242880, seq 163 bytes 5242880, seq 191 bytes 5242880, seq 192 bytes 5242880, seq 218 bytes 5242880, seq 219 bytes 5242880, seq 247 bytes 5242880, seq 248 bytes 5242880, seq 274 bytes 5242880, seq 275 bytes 5242880, seq 303 bytes 5242880, seq 304 bytes 5242880, seq 331 bytes 5242880, seq 332 bytes 5242880, seq 358 bytes 5242880, seq 359 bytes 5242880, seq 387 bytes 5242880, seq 388 bytes 5242880, seq 416 bytes 5242880, seq 417 bytes 5242880, seq 443 bytes 5242880, seq 444 bytes 5242880, seq 472 bytes 5242880, seq 473 bytes 5242880, seq 499 bytes 5242880, seq 500 bytes 5242880, seq 528 bytes 5242880, seq 529 bytes 5242880, seq 555 bytes 5242880, seq 556 bytes 5242880, seq 584 bytes 5242880, seq 585 bytes 5242880, seq 611 bytes 5242880, seq 612 bytes 5242880, seq 640 bytes 5242880, seq 641 bytes 5242880, seq 669 bytes 5242880, seq 670 bytes 5242880, seq 696 bytes 5242880, seq 697 bytes 5242880, seq 725 bytes 5242880, seq 726 bytes 5242880, seq 752 bytes 5242880, seq 753 bytes 5242880, seq 781 bytes 5242880, seq 782 bytes 5242880, seq 808 bytes 5242880, seq 809 bytes 5242880, seq 837 bytes 5242880, seq 838 bytes 5242880, seq 864 bytes 5242880, seq 865 bytes 5242880, seq 893 bytes 5242880, seq 894 bytes 5242880, seq 922 bytes 5242880, seq 923 bytes 5242880, seq 949 bytes 5242880, seq 950 bytes 5242880, seq 977 bytes 5242880, seq 978 bytes 5242880,
ok
@slfritchie

@jonmeredith So, this patch could solve 80% of the blocking problem by using only [noconnect] for the send. That takes care of the synchronous gen_server call to the net_kernel to reconnect. The remaining 20% chance of the vnode blocking (wild-ass guessing, it's likely less than that) due to busy_dist_port could be punted to a proxy-style reply sending, for example, #293.

I have a prototype for #293, FWIW.

@jtuple Any opinions?

@jrwest

ran scenario similar to one described above w/ riak master as well as this patch, believe I got similar results [1] but @slfritchie could you verify.

[1] https://gist.github.com/jrwest/5e152446a3684f35a476

@jrwest jrwest and 1 other commented on an outdated diff Apr 23, 2013
src/riak_core_send_msg.erl
+ bang_unreliable(To, {Tag, Reply}).
+
+cast_unreliable(Dest, Request) ->
+ bang_unreliable(Dest, {'$gen_cast', Request}).
+
+%% NOTE: We'ed peeked inside gen_fsm.erl's guts to see its internals.
+send_event_unreliable({global, _Name} = GlobalTo, Event) ->
+ erlang:error({unimplemented_send, GlobalTo, Event});
+send_event_unreliable({via, _Mod, _Name} = ViaTo, Event) ->
+ erlang:error({unimplemented_send, ViaTo, Event});
+send_event_unreliable(Name, Event) ->
+ bang_unreliable(Name, {'$gen_event', Event}),
+ ok.
+
+bang_unreliable(Dest, Msg) ->
+ catch erlang:send(Dest, Msg, [noconnect, nosuspend]).
@jrwest
jrwest Apr 23, 2013

not sure about the catch here. It seems to change the semantics from the original implementation, irrespective of dropping messages due to noconnect, nosuspend. For example, pre-this change unregistered_local_processs ! something would crash the vnode, here it will fail silently. The former may not necessarily be the better behaviour but do we want to make that sort of change in this PR? Seems like we can remove the catch and still get what we want (from the documentation I dont see erlang:send/3 throwing errors any differently than !)

@slfritchie
slfritchie Apr 24, 2013

This isn't a general library, it's for use with riak_core, so I don't have a problem with the change in semantics.

I will add a commit that will change the return value of bang_unreliable/2 to always return Msg.

@jrwest
jrwest Apr 24, 2013

my concern was about the change of semantics w/in riak core for other applications outside of Riak. I agree this behaviour is better, I just wonder if we really need to address it here {shrug}.

@jrwest jrwest commented on the diff Apr 23, 2013
src/riak_core_send_msg.erl
+
+%% NOTE: We'ed peeked inside gen_server.erl's guts to see its internals.
+reply_unreliable({To, Tag}, Reply) ->
+ bang_unreliable(To, {Tag, Reply}).
+
+cast_unreliable(Dest, Request) ->
+ bang_unreliable(Dest, {'$gen_cast', Request}).
+
+%% NOTE: We'ed peeked inside gen_fsm.erl's guts to see its internals.
+send_event_unreliable({global, _Name} = GlobalTo, Event) ->
+ erlang:error({unimplemented_send, GlobalTo, Event});
+send_event_unreliable({via, _Mod, _Name} = ViaTo, Event) ->
+ erlang:error({unimplemented_send, ViaTo, Event});
+send_event_unreliable(Name, Event) ->
+ bang_unreliable(Name, {'$gen_event', Event}),
+ ok.
@jrwest
jrwest Apr 23, 2013

why ok here but not for bang/cast_unreliable? don't we lose the ability to process the return value from the caller in this case?

@slfritchie
slfritchie Apr 24, 2013

That's the spec for gen_event:notify() and sync_notify().

@jrwest jrwest commented on the diff Apr 23, 2013
src/riak_core_vnode_master.erl
-%% Send the command to an individual Index/Node combination
-command({Index, Pid}, Msg, Sender, _VMaster) when is_pid(Pid) ->
- gen_fsm:send_event(Pid, make_request(Msg, Sender, Index));
-command({Index,Node}, Msg, Sender, VMaster) ->
- proxy_cast({VMaster, Node}, make_request(Msg, Sender, Index)).
+command2(DestTuple, Msg, Sender, VMaster, How) when is_tuple(DestTuple) ->
+ %% Final case, tuple = single destination ... so make a list and
@jrwest
jrwest Apr 23, 2013

stragling comment?

@slfritchie
slfritchie Apr 24, 2013

Nope, it's the last clause of command2().

@jrwest

Assuming @slfritchie confirms my test results are inline with his, this fix clearly addresses the issue. Left a few comments on the code. One concern I have is right now if we start dropping a whole lot of messages we are flying blind. I know adding stats may open a whole new can of worms but it would be nice to have some way to observe if when we are dropping loads of messages, even if its not an outright counter or sliding window.

@slfritchie

Yup, Jordan's results agree with mine. Via chat, I've suggested series of multi-minute outages while running:

# riak-admin top -interval 1 -sort msg_q -lines 4 2>&1  | egrep --line-buffered -v '          0 [a-zA-Z]|^$|Load:|processes.*code|runq.*atom.*ets|=======|--------|Current Function|riak@10\.' | awk'{printf("%s %s\n", system("echo -n `date`"), $0)}'

... which can highlight multi-minute bad behavior without this PR and never such bad behavior with it.

https://gist.github.com/slfritchie/41e4dd0e7790635e164b

@slfritchie

cc: @aphyr ... Kyle, my scatterbrained memory says that you were asking via IRC or Twitter about riak_core behavior that might be addressed (at least partially) by this PR?

@evanmcc
@slfritchie

@evanmcc I don't know if the magic '@' mentions will generate notifications to GitHub users outside of the organization ... and if he's paying attention to GitHub notifications. :-) If you could pester him in parallel, I'd appreciate it.

@ghost

Oh, it's OK. I see all @mentions of aphyr on the internet, whether the system is built to send notifications or not. ;-)

This looks great. I'd be happy to check out a branch and give it a shot with my partition test suite, whenever you're ready..

@slfritchie

Does this bound the largest message you can retrieve from a vnode to the size of the disterl buffer?

@jonmeredith Allow one to exceed.

@slfritchie

my concern was about the change of semantics w/in riak core for other applications outside of Riak. I agree this behaviour is better, I just wonder if we really need to address it here {shrug}.

After looking for catch and try/catch use, I'm a bit surprised to see how few times riak_core_vnode is guarding against vnode callback module misbehavior. I hereby declare that lack of such defensiveness is silly, and my first act to fix it is to leave my catch in. @jonmeredith @jtuple Opinions?

@slfritchie

@kingsbury Kyle, this PR only addresses vnodes being blocked while replying, being hosed by net_kernel sync connection attempts or being suspended due to busy_dist_port. There are plenty of other places where riak_core messaging patterns could be interrupted in bad ways by inter-node messaging. But you're welcome to take it for a spin, it sounds like review is close to being done here.

@slfritchie

@jsmartin Can you try this patch in your special hardware demo environment to see if it fixes the problem? Feel free to pester me via email or internal chat.

@slfritchie

[4:37 PM] James Martin: @Scott @JonMeredith stuff looks good to me.. cluster timeouts are in-line with net_ticktime

@slfritchie slfritchie merged commit b2bb8a3 into master May 2, 2013

1 check passed

Details default The Travis build passed
@ghost

I'll give this a shot tonight! :)

@jrwest jrwest referenced this pull request May 10, 2013
Merged

Backport of #299 to 1.3 #317

@seancribbs seancribbs deleted the gh274-vnode-noblocking-reply branch Apr 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment