Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vnode nonblocking reply, First draft (3rd edition), ready for some review #299

Merged
merged 2 commits into from
May 2, 2013

Conversation

slfritchie
Copy link
Contributor

Vnode replies always go via reply(), and reply() always uses unreliable
messaging. (As opposed to the usual (and more reliable) send-and-pray
messaging.)

During handoff, all forwarding requests use unreliable vnode master
commands to avoid net_kernel blocking interference.

My testing procedure:

  1. Use "make stagedevrel" on box A for nodes 1-4. Configure for
    multi-box use (i.e. change all 127.0.0.1 appearances in vm.args and
    app.config as appropriate).

  2. Use "make stage" on box B for node 5. Configure for multi-box
    use.

  3. Start all nodes, join them all together, and commit.

  4. Run basho_bench on box A, using this config:

    {mode, {rate, 50}}.
    {report_interval, 1}.
    {duration, 30}.
    {concurrent, 50}.

    {driver, basho_bench_driver_riakc_pb}.
    {riakc_pb_ips, [
    {{127,0,0,1}, 10017}, %% {Ip, Port}
    {{127,0,0,1}, [10027, 10037, 10047]} %% {Ip, Ports}
    ]}.
    {riakc_pb_replies, 1}.
    {operations, [{get, 1}, {update, 1}]}.
    {pb_timeout_general, 5000}.

    {key_generator, {int_to_bin, {uniform_int, 990000}}}.
    {value_generator, {fixed_bin, 10}}.

  5. On box B:

    sh -c 'for i in 1 2 3 4 5; do date; sleep 1 ; ( date ; ifdown eth0 ; sleep 100 ; ifup eth0 ; date ) > /tmp/slf 2>& 1 ; cat /tmp/slf; sleep 70; done'

  6. Check the results via:

    awk -F, '{if ($5 == 0) { print "ok"; } else { print "error" } }' tests/current/summary.csv | uniq -c | less

  7. There's a pretty darn obvious qualitative way to view the results
    of the pre-patch and post-patch behavior. The sampling rate used by
    the basho_bench config above, i.e., {report_interval, 1}, creates
    some noise in the step 6 results. So, the best advice that I can give
    for looking at the quantative data is to look for ~100 seconds of ok
    stability after node 1's network interface has been shut off.

7a. Good results from step 6:

  1 error
 19 ok
  3 error
  2 ok
  4 error
  1 ok
 57 error (definitely down ifdown iteration  1)
  2 ok
  1 error
105 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  4 error
  2 ok
  4 error
  1 ok
 61 error (definitely down ifdown iteration  2)
  1 ok
  2 error
101 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  3 error
  2 ok
 55 error (definitely down ifdown iteration  3)
  2 ok
  2 error
112 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  4 error
  1 ok
 58 error (definitely down ifdown iteration  4)
  1 ok
  3 error
109 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  3 error
  2 ok
  4 error
  1 ok
  4 error
  1 ok
 52 error (definitely down ifdown iteration  5)
  1 ok
  4 error
188 ok  (about ~100 seconds of goodness after down ifdown iteration  1)
        (end of interface flapping, ignore the quantity here)

7b. Bad results from step 6:

  1 error
 24 ok
  4 error (probably down ifdown iteration  1)
  1 ok
 10 error (definitely down ifdown iteration  1)
  1 ok
 46 error
  1 ok
  7 error
106 ok (about ~100 seconds of goodness after down ifdown iteration  1)
  3 error
  2 ok
  4 error
  1 ok
 61 error (definitely down ifdown iteration 2 ... but this is
           where it becomes much less obvious where ifdown and
           ifup events actually happened)
  2 ok
  5 error
  1 ok
 21 error
 75 ok
  4 error
  1 ok
  5 error
  1 ok
  4 error
  1 ok
 46 error (down ifdown iteration  3?)
  1 ok
 38 error
 75 ok
  3 error
  2 ok
  4 error
  1 ok
  5 error
  1 ok
 84 error (Huge problem ... Step 7a's results never had an error
           rate larger than 61 samples in a row.)
 75 ok
  3 error
  2 ok
  5 error
  1 ok
 57 error (probably down here ifdown iteration  5)
  1 ok
 32 error (if "57 error" was ifdown, then node 1's interface is
           now up, and this error rate is too high.)
974 ok (end of interface flapping, ignore the quantity here)

Vnode replies always go via reply(), and reply() always uses unreliable
messaging.  (As opposed to the usual (and more reliable) send-and-pray
messaging.)

During handoff, all forwarding requests use unreliable vnode master
commands to avoid net_kernel blocking interference.
@jonmeredith
Copy link
Contributor

Does this bound the largest message you can retrieve from a vnode to the size of the disterl buffer?

E.g. if disterl buffer was 1Mb, and you tried to retrieve an object that was 2Mb, what is the behavior of the VM?

  1. Exceed the limit and queue that one object, then deny all after until transmit.
  2. Enforce the limit and deny the send.

@slfritchie
Copy link
Contributor Author

@jonmeredith With "erl -sname foo +zdbbl 1024" on both sides, and the destination node's shell registered via register(x, self()), I see this the first send always succeeding, both when sending 1MB and 5MB chunks. The second succeeds, I'm assuming because these nodes are both otherwise idle and are going to start sending the first chunk, enough so that the second block gets through. The third, however, is out of luck.

Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:6:6] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.1  (abort with ^G)
(foo@spare1)1> M1 = list_to_binary(lists:duplicate(1024*1024, $x)).
<<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"...>>
(foo@spare1)2> [erlang:send({x, foo@big8}, {X, M1}, [noconnect, nosuspend]) || X <- lists:seq(1,10)].
[noconnect,noconnect,noconnect,noconnect,noconnect,
 noconnect,noconnect,noconnect,noconnect,noconnect]
(foo@spare1)3> net_adm:ping('foo@big8').
pong
(foo@spare1)4> [erlang:send({x, foo@big8}, {X, M1}, [noconnect, nosuspend]) || X <- lists:seq(1,10)].
[ok,ok,nosuspend,nosuspend,nosuspend,nosuspend,nosuspend,
 nosuspend,nosuspend,nosuspend]
(foo@spare1)5> M5 = list_to_binary(lists:duplicate(5*1024*1024, $x)).
<<"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"...>>
(foo@spare1)6> [erlang:send({x, foo@big8}, {X, M5}, [noconnect, nosuspend]) || X <- lists:seq(1,10)].
[ok,ok,nosuspend,nosuspend,nosuspend,nosuspend,nosuspend,
 nosuspend,nosuspend,nosuspend]

If I do the sending a bit differently:

[begin erlang:send({x, foo@big8}, {X, M5}, [noconnect, nosuspend]), timer:sleep(2) end || X <- lists:seq(1,1000)].

... then I see that we indeed get what we expect: sending resumes when the buffer drains, and we get about two of the 5MB chunks through at each time:

> Flush = fun(F) -> receive {N, Bin} -> io:format("seq ~p bytes ~p, ", [N, size(Bin)]), F(F); X -> io:format("unknown ~P, ", [X, 10]), F(F) after 0 -> io:format("\n") end end.             

> 7> Flush(Flush).
seq 1 bytes 5242880, seq 2 bytes 5242880, seq 22 bytes 5242880, seq 23 bytes 5242880, seq 50 bytes 5242880, seq 51 bytes 5242880, seq 77 bytes 5242880, seq 78 bytes 5242880, seq 106 bytes 5242880, seq 107 bytes 5242880, seq 133 bytes 5242880, seq 134 bytes 5242880, seq 162 bytes 5242880, seq 163 bytes 5242880, seq 191 bytes 5242880, seq 192 bytes 5242880, seq 218 bytes 5242880, seq 219 bytes 5242880, seq 247 bytes 5242880, seq 248 bytes 5242880, seq 274 bytes 5242880, seq 275 bytes 5242880, seq 303 bytes 5242880, seq 304 bytes 5242880, seq 331 bytes 5242880, seq 332 bytes 5242880, seq 358 bytes 5242880, seq 359 bytes 5242880, seq 387 bytes 5242880, seq 388 bytes 5242880, seq 416 bytes 5242880, seq 417 bytes 5242880, seq 443 bytes 5242880, seq 444 bytes 5242880, seq 472 bytes 5242880, seq 473 bytes 5242880, seq 499 bytes 5242880, seq 500 bytes 5242880, seq 528 bytes 5242880, seq 529 bytes 5242880, seq 555 bytes 5242880, seq 556 bytes 5242880, seq 584 bytes 5242880, seq 585 bytes 5242880, seq 611 bytes 5242880, seq 612 bytes 5242880, seq 640 bytes 5242880, seq 641 bytes 5242880, seq 669 bytes 5242880, seq 670 bytes 5242880, seq 696 bytes 5242880, seq 697 bytes 5242880, seq 725 bytes 5242880, seq 726 bytes 5242880, seq 752 bytes 5242880, seq 753 bytes 5242880, seq 781 bytes 5242880, seq 782 bytes 5242880, seq 808 bytes 5242880, seq 809 bytes 5242880, seq 837 bytes 5242880, seq 838 bytes 5242880, seq 864 bytes 5242880, seq 865 bytes 5242880, seq 893 bytes 5242880, seq 894 bytes 5242880, seq 922 bytes 5242880, seq 923 bytes 5242880, seq 949 bytes 5242880, seq 950 bytes 5242880, seq 977 bytes 5242880, seq 978 bytes 5242880,
ok

@slfritchie
Copy link
Contributor Author

@jonmeredith So, this patch could solve 80% of the blocking problem by using only [noconnect] for the send. That takes care of the synchronous gen_server call to the net_kernel to reconnect. The remaining 20% chance of the vnode blocking (wild-ass guessing, it's likely less than that) due to busy_dist_port could be punted to a proxy-style reply sending, for example, #293.

I have a prototype for #293, FWIW.

@jtuple Any opinions?

seancribbs added a commit that referenced this pull request Apr 21, 2013
Fix return value -spec for riak_kv_bitcask_backend:get
@jrwest
Copy link
Contributor

jrwest commented Apr 23, 2013

ran scenario similar to one described above w/ riak master as well as this patch, believe I got similar results [1] but @slfritchie could you verify.

[1] https://gist.github.com/jrwest/5e152446a3684f35a476

ok.

bang_unreliable(Dest, Msg) ->
catch erlang:send(Dest, Msg, [noconnect, nosuspend]).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about the catch here. It seems to change the semantics from the original implementation, irrespective of dropping messages due to noconnect, nosuspend. For example, pre-this change unregistered_local_processs ! something would crash the vnode, here it will fail silently. The former may not necessarily be the better behaviour but do we want to make that sort of change in this PR? Seems like we can remove the catch and still get what we want (from the documentation I dont see erlang:send/3 throwing errors any differently than !)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a general library, it's for use with riak_core, so I don't have a problem with the change in semantics.

I will add a commit that will change the return value of bang_unreliable/2 to always return Msg.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern was about the change of semantics w/in riak core for other applications outside of Riak. I agree this behaviour is better, I just wonder if we really need to address it here {shrug}.

@jrwest
Copy link
Contributor

jrwest commented Apr 23, 2013

Assuming @slfritchie confirms my test results are inline with his, this fix clearly addresses the issue. Left a few comments on the code. One concern I have is right now if we start dropping a whole lot of messages we are flying blind. I know adding stats may open a whole new can of worms but it would be nice to have some way to observe if when we are dropping loads of messages, even if its not an outright counter or sliding window.

@slfritchie
Copy link
Contributor Author

Yup, Jordan's results agree with mine. Via chat, I've suggested series of multi-minute outages while running:

# riak-admin top -interval 1 -sort msg_q -lines 4 2>&1  | egrep --line-buffered -v '          0 [a-zA-Z]|^$|Load:|processes.*code|runq.*atom.*ets|=======|--------|Current Function|riak@10\.' | awk'{printf("%s %s\n", system("echo -n `date`"), $0)}'

... which can highlight multi-minute bad behavior without this PR and never such bad behavior with it.

https://gist.github.com/slfritchie/41e4dd0e7790635e164b

@slfritchie
Copy link
Contributor Author

cc: @aphyr ... Kyle, my scatterbrained memory says that you were asking via IRC or Twitter about riak_core behavior that might be addressed (at least partially) by this PR?

@evanmcc
Copy link
Contributor

evanmcc commented Apr 24, 2013

Yup, I had meant to point it out to him this morning via email, but
blanked. My apologies.

@slfritchie
Copy link
Contributor Author

@evanmcc I don't know if the magic '@' mentions will generate notifications to GitHub users outside of the organization ... and if he's paying attention to GitHub notifications. :-) If you could pester him in parallel, I'd appreciate it.

@ghost
Copy link

ghost commented Apr 24, 2013

Oh, it's OK. I see all @mentions of aphyr on the internet, whether the system is built to send notifications or not. ;-)

This looks great. I'd be happy to check out a branch and give it a shot with my partition test suite, whenever you're ready..

@slfritchie
Copy link
Contributor Author

Does this bound the largest message you can retrieve from a vnode to the size of the disterl buffer?

@jonmeredith Allow one to exceed.

@slfritchie
Copy link
Contributor Author

my concern was about the change of semantics w/in riak core for other applications outside of Riak. I agree this behaviour is better, I just wonder if we really need to address it here {shrug}.

After looking for catch and try/catch use, I'm a bit surprised to see how few times riak_core_vnode is guarding against vnode callback module misbehavior. I hereby declare that lack of such defensiveness is silly, and my first act to fix it is to leave my catch in. @jonmeredith @jtuple Opinions?

@jrwest
Copy link
Contributor

jrwest commented Apr 24, 2013

@slfritchie wfm

@slfritchie
Copy link
Contributor Author

@kingsbury Kyle, this PR only addresses vnodes being blocked while replying, being hosed by net_kernel sync connection attempts or being suspended due to busy_dist_port. There are plenty of other places where riak_core messaging patterns could be interrupted in bad ways by inter-node messaging. But you're welcome to take it for a spin, it sounds like review is close to being done here.

@slfritchie
Copy link
Contributor Author

@jsmartin Can you try this patch in your special hardware demo environment to see if it fixes the problem? Feel free to pester me via email or internal chat.

@slfritchie
Copy link
Contributor Author

[4:37 PM] James Martin: @scott @jonmeredith stuff looks good to me.. cluster timeouts are in-line with net_ticktime

slfritchie added a commit that referenced this pull request May 2, 2013
Vnode nonblocking reply, First draft (3rd edition), ready for some review
@slfritchie slfritchie merged commit b2bb8a3 into master May 2, 2013
@ghost
Copy link

ghost commented May 2, 2013

I'll give this a shot tonight! :)

@seancribbs seancribbs deleted the gh274-vnode-noblocking-reply branch April 1, 2015 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants