-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vnode nonblocking reply, First draft (3rd edition), ready for some review #299
Conversation
Vnode replies always go via reply(), and reply() always uses unreliable messaging. (As opposed to the usual (and more reliable) send-and-pray messaging.) During handoff, all forwarding requests use unreliable vnode master commands to avoid net_kernel blocking interference.
Does this bound the largest message you can retrieve from a vnode to the size of the disterl buffer? E.g. if disterl buffer was 1Mb, and you tried to retrieve an object that was 2Mb, what is the behavior of the VM?
|
@jonmeredith With "erl -sname foo +zdbbl 1024" on both sides, and the destination node's shell registered via
If I do the sending a bit differently:
... then I see that we indeed get what we expect: sending resumes when the buffer drains, and we get about two of the 5MB chunks through at each time:
|
@jonmeredith So, this patch could solve 80% of the blocking problem by using only [noconnect] for the send. That takes care of the synchronous I have a prototype for #293, FWIW. @jtuple Any opinions? |
Fix return value -spec for riak_kv_bitcask_backend:get
ran scenario similar to one described above w/ riak master as well as this patch, believe I got similar results [1] but @slfritchie could you verify. |
ok. | ||
|
||
bang_unreliable(Dest, Msg) -> | ||
catch erlang:send(Dest, Msg, [noconnect, nosuspend]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about the catch
here. It seems to change the semantics from the original implementation, irrespective of dropping messages due to noconnect
, nosuspend
. For example, pre-this change unregistered_local_processs ! something
would crash the vnode, here it will fail silently. The former may not necessarily be the better behaviour but do we want to make that sort of change in this PR? Seems like we can remove the catch
and still get what we want (from the documentation I dont see erlang:send/3
throwing errors any differently than !
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a general library, it's for use with riak_core, so I don't have a problem with the change in semantics.
I will add a commit that will change the return value of bang_unreliable/2
to always return Msg
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my concern was about the change of semantics w/in riak core for other applications outside of Riak. I agree this behaviour is better, I just wonder if we really need to address it here {shrug}.
Assuming @slfritchie confirms my test results are inline with his, this fix clearly addresses the issue. Left a few comments on the code. One concern I have is right now if we start dropping a whole lot of messages we are flying blind. I know adding stats may open a whole new can of worms but it would be nice to have some way to observe if when we are dropping loads of messages, even if its not an outright counter or sliding window. |
Yup, Jordan's results agree with mine. Via chat, I've suggested series of multi-minute outages while running:
... which can highlight multi-minute bad behavior without this PR and never such bad behavior with it. |
cc: @aphyr ... Kyle, my scatterbrained memory says that you were asking via IRC or Twitter about riak_core behavior that might be addressed (at least partially) by this PR? |
Yup, I had meant to point it out to him this morning via email, but |
@evanmcc I don't know if the magic '@' mentions will generate notifications to GitHub users outside of the organization ... and if he's paying attention to GitHub notifications. :-) If you could pester him in parallel, I'd appreciate it. |
Oh, it's OK. I see all @mentions of aphyr on the internet, whether the system is built to send notifications or not. ;-) This looks great. I'd be happy to check out a branch and give it a shot with my partition test suite, whenever you're ready.. |
@jonmeredith Allow one to exceed. |
After looking for catch and try/catch use, I'm a bit surprised to see how few times |
@slfritchie wfm |
@kingsbury Kyle, this PR only addresses vnodes being blocked while replying, being hosed by net_kernel sync connection attempts or being suspended due to busy_dist_port. There are plenty of other places where riak_core messaging patterns could be interrupted in bad ways by inter-node messaging. But you're welcome to take it for a spin, it sounds like review is close to being done here. |
@jsmartin Can you try this patch in your special hardware demo environment to see if it fixes the problem? Feel free to pester me via email or internal chat. |
[4:37 PM] James Martin: @scott @jonmeredith stuff looks good to me.. cluster timeouts are in-line with net_ticktime |
Vnode nonblocking reply, First draft (3rd edition), ready for some review
I'll give this a shot tonight! :) |
Vnode replies always go via reply(), and reply() always uses unreliable
messaging. (As opposed to the usual (and more reliable) send-and-pray
messaging.)
During handoff, all forwarding requests use unreliable vnode master
commands to avoid net_kernel blocking interference.
My testing procedure:
Use "make stagedevrel" on box A for nodes 1-4. Configure for
multi-box use (i.e. change all 127.0.0.1 appearances in vm.args and
app.config as appropriate).
Use "make stage" on box B for node 5. Configure for multi-box
use.
Start all nodes, join them all together, and commit.
Run basho_bench on box A, using this config:
{mode, {rate, 50}}.
{report_interval, 1}.
{duration, 30}.
{concurrent, 50}.
{driver, basho_bench_driver_riakc_pb}.
{riakc_pb_ips, [
{{127,0,0,1}, 10017}, %% {Ip, Port}
{{127,0,0,1}, [10027, 10037, 10047]} %% {Ip, Ports}
]}.
{riakc_pb_replies, 1}.
{operations, [{get, 1}, {update, 1}]}.
{pb_timeout_general, 5000}.
{key_generator, {int_to_bin, {uniform_int, 990000}}}.
{value_generator, {fixed_bin, 10}}.
On box B:
sh -c 'for i in 1 2 3 4 5; do date; sleep 1 ; ( date ; ifdown eth0 ; sleep 100 ; ifup eth0 ; date ) > /tmp/slf 2>& 1 ; cat /tmp/slf; sleep 70; done'
Check the results via:
awk -F, '{if ($5 == 0) { print "ok"; } else { print "error" } }' tests/current/summary.csv | uniq -c | less
There's a pretty darn obvious qualitative way to view the results
of the pre-patch and post-patch behavior. The sampling rate used by
the basho_bench config above, i.e.,
{report_interval, 1}
, createssome noise in the step 6 results. So, the best advice that I can give
for looking at the quantative data is to look for ~100 seconds of
ok
stability after node 1's network interface has been shut off.
7a. Good results from step 6:
7b. Bad results from step 6: