Porting parallel vnode init fix to 1.3 + revert switch#281
Conversation
Added an extra state to riak_core_vnode to decouple vnode initialization from process creation. Added a function to block until initialization is truly finished. Vnode init call now takes a list of indices or single index. The list version creates vnode processes first, then waits for them to initialize, which allows it to happen in parallel. Used the above in riak_core_ring_handler when starting services on first ring event. Tests show I/O saturation at bitcask startup now, instead of the serialized trickle we had. This code still needs some work to handle edge cases, specially around how to handle vnode initialization failing. Add concurrency control to vnode initialization New pmap with bounded concurrency utility added with unit tests. Fix cluster into to query vnode manager, not sup With the parallel vnode change, there is more reason to not query the supervisor directly, as its children may not have finished initialization yet.
To allow parallel initialization of vnodes without changing the API in a non-backwards compatible way. To use it, the vnode module has to implement start_vnodes/1.
|
Overall, things work as intended and there doesn't appear to be any major issues. Although, have a few minor comments. Also, given that this work has already merged into The main issue is the change of case lists:member({start_vnodes, 1}, Mod:module_info(exports)) of
true ->
Mod:start_vnodes(Startable);
false ->
[Mod:start_vnode(I) || I <- Startable]
end.My remaining comments are best left as line notes. Going to go add those now. |
src/riak_core_util.erl
Outdated
There was a problem hiding this comment.
I'm not a fan of using selective receive on Pid here. That can lead to O(N^2) mailbox scans, as well as keeping items in the mailbox longer then necessary. To illustrate, let's say we have pids A-D, but due to worker timing or whatever we receive messages in the order D-A. The first iteration of Collect is going to do a receive {pmap_result, <A>, R}, skip over messages D-B, find A, return. Then for B it'll start over, re-scan until it finds B at the end, return. Rinse/repeat for the rest. Likewise, if A took longer than the other workers, we'd stick around waiting for A while keeping D-B messages sitting in the mailbox the entire time.
For vnode init, this may not be a major issue. But, given that pmap/3 is designed to be a general purpose utility function, we might as well consider what happens if someone decides to use 1000 workers or something. Really don't want to scan 1 million messages in the worst case.
Any reason you can't just do a receive on {pmap_result, _, R}, thus handling each response in the order they are sent? It looks like you end up sorting the results on line 297 so you're not particularly sensitive to how you end up constructing All.
On a different note, I really wish we could use optimized receives in both this pmap/3 as well as the older pmap/2. But, the strict requirements of how you must structure your code to get the Erlang compiler to use the optimized receive won't work for the way either of these pmap implementations are coded. This is unfortunate, as each receive for the pmap_result is going to scan over all non-pmap messages in the mailbox. This is kinda annoying for the riak_core_vnode_manager which receives 100s to 10000s of messages regularly. One of the top serialization points in all of Riak.
One option would be to do the pmap itself in a separate process. Ie. calling pmap spawns a process that then spawns the workers, waits for results, then sends the combined results back to the original process. This would decouple the calling processes mailbox from the pmap process. Anyway, food for thought.
There was a problem hiding this comment.
Collect only handles the last batch of work (a number of messages never bigger than max_concurrent). Within the main fold, messages are received in the order they arrived (see line 328). I can fix this collection of the last batch for 1.4, but I could see that as something easy to mess up right before shipping this.
I'm aware of the perils of selective receives in this general purpose utility, but thought that any application sensitive to that should run pmap in a separate process that will only collect pmap messages anyway. That way it can handle its mailbox in whatever way is more efficient (say a gen_fsm would spawn a function that ran pmap and sent an event back to it, or a gen_server would send itself back a call, etc).
In the case of initialization, we have already been blocking here for hours at a time without processing messages, so I don't think that this particular case is any worse. It should be a lot better now just due to the fact that it will likely block for a 10th of the time it did before.
|
I'm a bit curious about the Am I mistaken? Or is the entire |
|
@jtuple about the started state: my reading of the OTP code was that timeout is simply the timeout of the receive of the main loop, which is executed after acking the process intialization. This is the race condition I was worried about: somebody initializating the vnode asynchronously and sending it a message immediately (or another process sending a request right there). This could potentially arrive before the receive code in gen_fsm is touched, and so that message would be processed and passed to the initial state (see loop function in gen_fsm.erl). I think this would only happen once in a blue moon, but it could unless I missed something. |
|
@jtuple Just pushed changes to address your comments (with some stuff left for later). Please let me know if you agree or have objections and will address immediately |
|
@jtuple BTW, don't forget this now has a related PR in riak_kv for this to work basho/riak_kv#511 |
|
@jtuple I went ahead and changed pmap to avoid using lists and lists:member (using sets now). Eunit test passes. Hopefully this addresses some of your concerns. Anyway, this list of pending work was bounded by max_concurrent anyway, but I guess why use lists:member when you don't have to... |
|
Like how you refactored to reuse Re-tested things manually. Also tested with and without the accompanying +1 |
Porting parallel vnode init fix to 1.3 + revert switch
No description provided.