Vnode nonblocking reply, First draft (3rd edition), ready for some review #299

slfritchie · 2013-04-12T15:51:33Z

Vnode replies always go via reply(), and reply() always uses unreliable
messaging. (As opposed to the usual (and more reliable) send-and-pray
messaging.)

During handoff, all forwarding requests use unreliable vnode master
commands to avoid net_kernel blocking interference.

My testing procedure:

Use "make stagedevrel" on box A for nodes 1-4. Configure for
multi-box use (i.e. change all 127.0.0.1 appearances in vm.args and
app.config as appropriate).
Use "make stage" on box B for node 5. Configure for multi-box
use.
Start all nodes, join them all together, and commit.
Run basho_bench on box A, using this config:

{mode, {rate, 50}}.
{report_interval, 1}.
{duration, 30}.
{concurrent, 50}.

{driver, basho_bench_driver_riakc_pb}.
{riakc_pb_ips, [
{{127,0,0,1}, 10017}, %% {Ip, Port}
{{127,0,0,1}, [10027, 10037, 10047]} %% {Ip, Ports}
]}.
{riakc_pb_replies, 1}.
{operations, [{get, 1}, {update, 1}]}.
{pb_timeout_general, 5000}.

{key_generator, {int_to_bin, {uniform_int, 990000}}}.
{value_generator, {fixed_bin, 10}}.
On box B:

sh -c 'for i in 1 2 3 4 5; do date; sleep 1 ; ( date ; ifdown eth0 ; sleep 100 ; ifup eth0 ; date ) > /tmp/slf 2>& 1 ; cat /tmp/slf; sleep 70; done'
Check the results via:

awk -F, '{if ($5 == 0) { print "ok"; } else { print "error" } }' tests/current/summary.csv | uniq -c | less
There's a pretty darn obvious qualitative way to view the results
of the pre-patch and post-patch behavior. The sampling rate used by
the basho_bench config above, i.e., {report_interval, 1}, creates
some noise in the step 6 results. So, the best advice that I can give
for looking at the quantative data is to look for ~100 seconds of ok
stability after node 1's network interface has been shut off.

7a. Good results from step 6:

  1 error
 19 ok
  3 error
  2 ok
  4 error
  1 ok
 57 error (definitely down ifdown iteration  1)
  2 ok
  1 error
105 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  4 error
  2 ok
  4 error
  1 ok
 61 error (definitely down ifdown iteration  2)
  1 ok
  2 error
101 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  3 error
  2 ok
 55 error (definitely down ifdown iteration  3)
  2 ok
  2 error
112 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  4 error
  1 ok
 58 error (definitely down ifdown iteration  4)
  1 ok
  3 error
109 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  3 error
  2 ok
  4 error
  1 ok
  4 error
  1 ok
 52 error (definitely down ifdown iteration  5)
  1 ok
  4 error
188 ok  (about ~100 seconds of goodness after down ifdown iteration  1)
        (end of interface flapping, ignore the quantity here)

7b. Bad results from step 6:

  1 error
 24 ok
  4 error (probably down ifdown iteration  1)
  1 ok
 10 error (definitely down ifdown iteration  1)
  1 ok
 46 error
  1 ok
  7 error
106 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  3 error
  2 ok
  4 error
  1 ok
 61 error (definitely down ifdown iteration 2 ... but this is
           where it becomes much less obvious where ifdown and
           ifup events actually happened)
  2 ok
  5 error
  1 ok
 21 error
 75 ok
  4 error
  1 ok
  5 error
  1 ok
  4 error
  1 ok
 46 error (down ifdown iteration  3?)
  1 ok
 38 error
 75 ok
  3 error
  2 ok
  4 error
  1 ok
  5 error
  1 ok
 84 error (Huge problem ... Step 7a's results never had an error
           rate larger than 61 samples in a row.)
 75 ok
  3 error
  2 ok
  5 error
  1 ok
 57 error (probably down here ifdown iteration  5)
  1 ok
 32 error (if "57 error" was ifdown, then node 1's interface is
           now up, and this error rate is too high.)
974 ok (end of interface flapping, ignore the quantity here)

Vnode replies always go via reply(), and reply() always uses unreliable messaging. (As opposed to the usual (and more reliable) send-and-pray messaging.) During handoff, all forwarding requests use unreliable vnode master commands to avoid net_kernel blocking interference.

jonmeredith · 2013-04-15T15:59:49Z

Does this bound the largest message you can retrieve from a vnode to the size of the disterl buffer?

E.g. if disterl buffer was 1Mb, and you tried to retrieve an object that was 2Mb, what is the behavior of the VM?

Exceed the limit and queue that one object, then deny all after until transmit.
Enforce the limit and deny the send.

slfritchie · 2013-04-16T03:35:30Z

@jonmeredith With "erl -sname foo +zdbbl 1024" on both sides, and the destination node's shell registered via register(x, self()), I see this the first send always succeeding, both when sending 1MB and 5MB chunks. The second succeeds, I'm assuming because these nodes are both otherwise idle and are going to start sending the first chunk, enough so that the second block gets through. The third, however, is out of luck.

Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:6:6] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.1  (abort with ^G)
(foo@spare1)1> M1 = list_to_binary(lists:duplicate(1024*1024, $x)).
<<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"...>>
(foo@spare1)2> [erlang:send({x, foo@big8}, {X, M1}, [noconnect, nosuspend]) || X <- lists:seq(1,10)].
[noconnect,noconnect,noconnect,noconnect,noconnect,
 noconnect,noconnect,noconnect,noconnect,noconnect]
(foo@spare1)3> net_adm:ping('foo@big8').
pong
(foo@spare1)4> [erlang:send({x, foo@big8}, {X, M1}, [noconnect, nosuspend]) || X <- lists:seq(1,10)].
[ok,ok,nosuspend,nosuspend,nosuspend,nosuspend,nosuspend,
 nosuspend,nosuspend,nosuspend]
(foo@spare1)5> M5 = list_to_binary(lists:duplicate(5*1024*1024, $x)).
<<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"...>>
(foo@spare1)6> [erlang:send({x, foo@big8}, {X, M5}, [noconnect, nosuspend]) || X <- lists:seq(1,10)].
[ok,ok,nosuspend,nosuspend,nosuspend,nosuspend,nosuspend,
 nosuspend,nosuspend,nosuspend]

If I do the sending a bit differently:

[begin erlang:send({x, foo@big8}, {X, M5}, [noconnect, nosuspend]), timer:sleep(2) end || X <- lists:seq(1,1000)].

... then I see that we indeed get what we expect: sending resumes when the buffer drains, and we get about two of the 5MB chunks through at each time:

> Flush = fun(F) -> receive {N, Bin} -> io:format("seq ~p bytes ~p, ", [N, size(Bin)]), F(F); X -> io:format("unknown ~P, ", [X, 10]), F(F) after 0 -> io:format("\n") end end.             

> 7> Flush(Flush).
seq 1 bytes 5242880, seq 2 bytes 5242880, seq 22 bytes 5242880, seq 23 bytes 5242880, seq 50 bytes 5242880, seq 51 bytes 5242880, seq 77 bytes 5242880, seq 78 bytes 5242880, seq 106 bytes 5242880, seq 107 bytes 5242880, seq 133 bytes 5242880, seq 134 bytes 5242880, seq 162 bytes 5242880, seq 163 bytes 5242880, seq 191 bytes 5242880, seq 192 bytes 5242880, seq 218 bytes 5242880, seq 219 bytes 5242880, seq 247 bytes 5242880, seq 248 bytes 5242880, seq 274 bytes 5242880, seq 275 bytes 5242880, seq 303 bytes 5242880, seq 304 bytes 5242880, seq 331 bytes 5242880, seq 332 bytes 5242880, seq 358 bytes 5242880, seq 359 bytes 5242880, seq 387 bytes 5242880, seq 388 bytes 5242880, seq 416 bytes 5242880, seq 417 bytes 5242880, seq 443 bytes 5242880, seq 444 bytes 5242880, seq 472 bytes 5242880, seq 473 bytes 5242880, seq 499 bytes 5242880, seq 500 bytes 5242880, seq 528 bytes 5242880, seq 529 bytes 5242880, seq 555 bytes 5242880, seq 556 bytes 5242880, seq 584 bytes 5242880, seq 585 bytes 5242880, seq 611 bytes 5242880, seq 612 bytes 5242880, seq 640 bytes 5242880, seq 641 bytes 5242880, seq 669 bytes 5242880, seq 670 bytes 5242880, seq 696 bytes 5242880, seq 697 bytes 5242880, seq 725 bytes 5242880, seq 726 bytes 5242880, seq 752 bytes 5242880, seq 753 bytes 5242880, seq 781 bytes 5242880, seq 782 bytes 5242880, seq 808 bytes 5242880, seq 809 bytes 5242880, seq 837 bytes 5242880, seq 838 bytes 5242880, seq 864 bytes 5242880, seq 865 bytes 5242880, seq 893 bytes 5242880, seq 894 bytes 5242880, seq 922 bytes 5242880, seq 923 bytes 5242880, seq 949 bytes 5242880, seq 950 bytes 5242880, seq 977 bytes 5242880, seq 978 bytes 5242880,
ok

slfritchie · 2013-04-16T03:42:07Z

@jonmeredith So, this patch could solve 80% of the blocking problem by using only [noconnect] for the send. That takes care of the synchronous gen_server call to the net_kernel to reconnect. The remaining 20% chance of the vnode blocking (wild-ass guessing, it's likely less than that) due to busy_dist_port could be punted to a proxy-style reply sending, for example, #293.

I have a prototype for #293, FWIW.

@jtuple Any opinions?

Fix return value -spec for riak_kv_bitcask_backend:get

jrwest · 2013-04-23T22:41:19Z

ran scenario similar to one described above w/ riak master as well as this patch, believe I got similar results [1] but @slfritchie could you verify.

[1] https://gist.github.com/jrwest/5e152446a3684f35a476

jrwest · 2013-04-23T22:45:49Z

src/riak_core_send_msg.erl

+    ok.
+
+bang_unreliable(Dest, Msg) ->
+    catch erlang:send(Dest, Msg, [noconnect, nosuspend]).


not sure about the catch here. It seems to change the semantics from the original implementation, irrespective of dropping messages due to noconnect, nosuspend. For example, pre-this change unregistered_local_processs ! something would crash the vnode, here it will fail silently. The former may not necessarily be the better behaviour but do we want to make that sort of change in this PR? Seems like we can remove the catch and still get what we want (from the documentation I dont see erlang:send/3 throwing errors any differently than !)

This isn't a general library, it's for use with riak_core, so I don't have a problem with the change in semantics.

I will add a commit that will change the return value of bang_unreliable/2 to always return Msg.

my concern was about the change of semantics w/in riak core for other applications outside of Riak. I agree this behaviour is better, I just wonder if we really need to address it here {shrug}.

jrwest · 2013-04-23T22:51:58Z

Assuming @slfritchie confirms my test results are inline with his, this fix clearly addresses the issue. Left a few comments on the code. One concern I have is right now if we start dropping a whole lot of messages we are flying blind. I know adding stats may open a whole new can of worms but it would be nice to have some way to observe if when we are dropping loads of messages, even if its not an outright counter or sliding window.

slfritchie · 2013-04-24T00:06:29Z

Yup, Jordan's results agree with mine. Via chat, I've suggested series of multi-minute outages while running:

# riak-admin top -interval 1 -sort msg_q -lines 4 2>&1  | egrep --line-buffered -v '          0 [a-zA-Z]|^$|Load:|processes.*code|runq.*atom.*ets|=======|--------|Current Function|riak@10\.' | awk'{printf("%s %s\n", system("echo -n `date`"), $0)}'

... which can highlight multi-minute bad behavior without this PR and never such bad behavior with it.

https://gist.github.com/slfritchie/41e4dd0e7790635e164b

slfritchie · 2013-04-24T00:13:53Z

cc: @aphyr ... Kyle, my scatterbrained memory says that you were asking via IRC or Twitter about riak_core behavior that might be addressed (at least partially) by this PR?

evanmcc · 2013-04-24T00:19:25Z

Yup, I had meant to point it out to him this morning via email, but
blanked. My apologies.

slfritchie · 2013-04-24T00:20:42Z

@evanmcc I don't know if the magic '@' mentions will generate notifications to GitHub users outside of the organization ... and if he's paying attention to GitHub notifications. :-) If you could pester him in parallel, I'd appreciate it.

ghost · 2013-04-24T00:22:30Z

Oh, it's OK. I see all @mentions of aphyr on the internet, whether the system is built to send notifications or not. ;-)

This looks great. I'd be happy to check out a branch and give it a shot with my partition test suite, whenever you're ready..

slfritchie · 2013-04-24T19:49:13Z

Does this bound the largest message you can retrieve from a vnode to the size of the disterl buffer?

@jonmeredith Allow one to exceed.

slfritchie · 2013-04-24T19:53:46Z

my concern was about the change of semantics w/in riak core for other applications outside of Riak. I agree this behaviour is better, I just wonder if we really need to address it here {shrug}.

After looking for catch and try/catch use, I'm a bit surprised to see how few times riak_core_vnode is guarding against vnode callback module misbehavior. I hereby declare that lack of such defensiveness is silly, and my first act to fix it is to leave my catch in. @jonmeredith @jtuple Opinions?

jrwest · 2013-04-24T20:40:50Z

@slfritchie wfm

slfritchie · 2013-04-24T20:46:50Z

@kingsbury Kyle, this PR only addresses vnodes being blocked while replying, being hosed by net_kernel sync connection attempts or being suspended due to busy_dist_port. There are plenty of other places where riak_core messaging patterns could be interrupted in bad ways by inter-node messaging. But you're welcome to take it for a spin, it sounds like review is close to being done here.

slfritchie · 2013-04-30T17:43:36Z

@jsmartin Can you try this patch in your special hardware demo environment to see if it fixes the problem? Feel free to pester me via email or internal chat.

slfritchie · 2013-05-02T21:51:34Z

[4:37 PM] James Martin: @scott @jonmeredith stuff looks good to me.. cluster timeouts are in-line with net_ticktime

Vnode nonblocking reply, First draft (3rd edition), ready for some review

ghost · 2013-05-02T21:58:23Z

I'll give this a shot tonight! :)

Backport of #299 to 1.3

slfritchie mentioned this pull request Apr 16, 2013

Non-blocking send (1 of 2) #294

Closed

seancribbs added a commit that referenced this pull request Apr 21, 2013

Merge pull request #299 from basho/slf-bitcask-typefix

2ef2b4c

Fix return value -spec for riak_kv_bitcask_backend:get

jrwest reviewed Apr 23, 2013
View reviewed changes

Review: fix return value of bang_unreliable/2

fef1bb3

slfritchie added a commit that referenced this pull request May 2, 2013

Merge pull request #299 from basho/gh274-vnode-noblocking-reply

b2bb8a3

Vnode nonblocking reply, First draft (3rd edition), ready for some review

slfritchie merged commit b2bb8a3 into master May 2, 2013

jrwest mentioned this pull request May 10, 2013

Backport of #299 to 1.3 #317

Merged

jrwest added a commit that referenced this pull request May 11, 2013

Merge pull request #317 from basho/jrw-gh274-13backport

c94ab1e

Backport of #299 to 1.3

jrwest mentioned this pull request Jun 23, 2013

possible bug in riak_core_vnode.erl for vnode_coverage. #342

Closed

slfritchie mentioned this pull request Sep 27, 2013

Non-blocking send (2 of 2) #293

Closed

seancribbs deleted the gh274-vnode-noblocking-reply branch April 1, 2015 23:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vnode nonblocking reply, First draft (3rd edition), ready for some review #299

Vnode nonblocking reply, First draft (3rd edition), ready for some review #299

slfritchie commented Apr 12, 2013

jonmeredith commented Apr 15, 2013

slfritchie commented Apr 16, 2013

slfritchie commented Apr 16, 2013

jrwest commented Apr 23, 2013

jrwest Apr 23, 2013

slfritchie Apr 24, 2013

jrwest Apr 24, 2013

jrwest commented Apr 23, 2013

slfritchie commented Apr 24, 2013

slfritchie commented Apr 24, 2013

evanmcc commented Apr 24, 2013

slfritchie commented Apr 24, 2013

ghost commented Apr 24, 2013

slfritchie commented Apr 24, 2013

slfritchie commented Apr 24, 2013

jrwest commented Apr 24, 2013

slfritchie commented Apr 24, 2013

slfritchie commented Apr 30, 2013

slfritchie commented May 2, 2013

ghost commented May 2, 2013

Vnode nonblocking reply, First draft (3rd edition), ready for some review #299

Vnode nonblocking reply, First draft (3rd edition), ready for some review #299

Conversation

slfritchie commented Apr 12, 2013

jonmeredith commented Apr 15, 2013

slfritchie commented Apr 16, 2013

slfritchie commented Apr 16, 2013

jrwest commented Apr 23, 2013

jrwest Apr 23, 2013

Choose a reason for hiding this comment

slfritchie Apr 24, 2013

Choose a reason for hiding this comment

jrwest Apr 24, 2013

Choose a reason for hiding this comment

jrwest commented Apr 23, 2013

slfritchie commented Apr 24, 2013

slfritchie commented Apr 24, 2013

evanmcc commented Apr 24, 2013

slfritchie commented Apr 24, 2013

ghost commented Apr 24, 2013

slfritchie commented Apr 24, 2013

slfritchie commented Apr 24, 2013

jrwest commented Apr 24, 2013

slfritchie commented Apr 24, 2013

slfritchie commented Apr 30, 2013

slfritchie commented May 2, 2013

ghost commented May 2, 2013