Skip to content

Porting parallel vnode init fix to 1.3 + revert switch#281

Merged
engelsanchez merged 5 commits into1.3from
eas-parallel-vnode-init-backport
Mar 21, 2013
Merged

Porting parallel vnode init fix to 1.3 + revert switch#281
engelsanchez merged 5 commits into1.3from
eas-parallel-vnode-init-backport

Conversation

@engelsanchez
Copy link
Contributor

No description provided.

Added an extra state to riak_core_vnode to decouple vnode initialization
from process creation.
Added a function to block until initialization is truly finished.
Vnode init call now takes a list of indices or single index.
The list version creates vnode processes first, then waits for them to
initialize, which allows it to happen in parallel.
Used the above in riak_core_ring_handler when starting services
on first ring event.
Tests show I/O saturation at bitcask startup now, instead of the
serialized trickle we had.
This code still needs some work to handle edge cases, specially around
how to handle vnode initialization failing.

Add concurrency control to vnode initialization
New pmap with bounded concurrency utility added with unit tests.

Fix cluster into to query vnode manager, not sup

With the parallel vnode change, there is more reason to not query the
supervisor directly, as its children may not have finished
initialization yet.
@ghost ghost assigned jtuple Mar 11, 2013
To allow parallel initialization of vnodes without changing the API in a
non-backwards compatible way. To use it, the vnode module has to
implement start_vnodes/1.
@jtuple
Copy link
Contributor

jtuple commented Mar 18, 2013

Overall, things work as intended and there doesn't appear to be any major issues. Although, have a few minor comments.

Also, given that this work has already merged into master and this a backport, it's unclear what changes we want to do here versus on master. Could ship with this PR, change master later, or change things both here and on master. Not sure what works best.


The main issue is the change of Mod:start_vnode moving from being passed an index to a list of indices. This assumes that such a change is safe for all riak_core apps. Currently, this isn't safe for riak_search, but you already have a simple PR to fix that. Of course, we have no idea what people are doing with riak_core in the wild. One option is to instead keep the semantics of start_vnode the same and add a new function such as start_vnodes that accepts a list. And then to fallback to start_vnode if the module in question doesn't implement start_vnodes. Of course, this requires all riak_core apps to be modified with this new start_vnodes function in order to use parallel vnode initialization. On the other hand, it does make it so this is no longer an API breaking change. The main idea would be to move to something like:

case lists:member({start_vnodes, 1}, Mod:module_info(exports)) of
    true ->
        Mod:start_vnodes(Startable);
    false ->
        [Mod:start_vnode(I) || I <- Startable]
end.

My remaining comments are best left as line notes. Going to go add those now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of using selective receive on Pid here. That can lead to O(N^2) mailbox scans, as well as keeping items in the mailbox longer then necessary. To illustrate, let's say we have pids A-D, but due to worker timing or whatever we receive messages in the order D-A. The first iteration of Collect is going to do a receive {pmap_result, <A>, R}, skip over messages D-B, find A, return. Then for B it'll start over, re-scan until it finds B at the end, return. Rinse/repeat for the rest. Likewise, if A took longer than the other workers, we'd stick around waiting for A while keeping D-B messages sitting in the mailbox the entire time.

For vnode init, this may not be a major issue. But, given that pmap/3 is designed to be a general purpose utility function, we might as well consider what happens if someone decides to use 1000 workers or something. Really don't want to scan 1 million messages in the worst case.

Any reason you can't just do a receive on {pmap_result, _, R}, thus handling each response in the order they are sent? It looks like you end up sorting the results on line 297 so you're not particularly sensitive to how you end up constructing All.

On a different note, I really wish we could use optimized receives in both this pmap/3 as well as the older pmap/2. But, the strict requirements of how you must structure your code to get the Erlang compiler to use the optimized receive won't work for the way either of these pmap implementations are coded. This is unfortunate, as each receive for the pmap_result is going to scan over all non-pmap messages in the mailbox. This is kinda annoying for the riak_core_vnode_manager which receives 100s to 10000s of messages regularly. One of the top serialization points in all of Riak.

One option would be to do the pmap itself in a separate process. Ie. calling pmap spawns a process that then spawns the workers, waits for results, then sends the combined results back to the original process. This would decouple the calling processes mailbox from the pmap process. Anyway, food for thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collect only handles the last batch of work (a number of messages never bigger than max_concurrent). Within the main fold, messages are received in the order they arrived (see line 328). I can fix this collection of the last batch for 1.4, but I could see that as something easy to mess up right before shipping this.

I'm aware of the perils of selective receives in this general purpose utility, but thought that any application sensitive to that should run pmap in a separate process that will only collect pmap messages anyway. That way it can handle its mailbox in whatever way is more efficient (say a gen_fsm would spawn a function that ran pmap and sent an event back to it, or a gen_server would send itself back a call, etc).

In the case of initialization, we have already been blocking here for hours at a time without processing messages, so I don't think that this particular case is any worse. It should be a lot better now just due to the fact that it will likely block for a 10th of the time it did before.

@jtuple
Copy link
Contributor

jtuple commented Mar 18, 2013

I'm a bit curious about the started(timeout, ...) and started(wait_for_init, ....) cases in riak_core_vnode. It seems to me that we'll always transition into started(timeout, ...) immediately from init before ever receiving any wait_for_init messages, and the call to riak_core_vnode:wait_for_init in the vnode manager ends up just hitting the default case in the active state here.

Am I mistaken? Or is the entire wait_for_init code path dead?

@engelsanchez
Copy link
Contributor Author

@jtuple about the started state: my reading of the OTP code was that timeout is simply the timeout of the receive of the main loop, which is executed after acking the process intialization. This is the race condition I was worried about: somebody initializating the vnode asynchronously and sending it a message immediately (or another process sending a request right there). This could potentially arrive before the receive code in gen_fsm is touched, and so that message would be processed and passed to the initial state (see loop function in gen_fsm.erl). I think this would only happen once in a blue moon, but it could unless I missed something.

@engelsanchez
Copy link
Contributor Author

@jtuple Just pushed changes to address your comments (with some stuff left for later). Please let me know if you agree or have objections and will address immediately

@engelsanchez
Copy link
Contributor Author

@jtuple BTW, don't forget this now has a related PR in riak_kv for this to work basho/riak_kv#511

@engelsanchez
Copy link
Contributor Author

@jtuple I went ahead and changed pmap to avoid using lists and lists:member (using sets now). Eunit test passes. Hopefully this addresses some of your concerns. Anyway, this list of pending work was bounded by max_concurrent anyway, but I guess why use lists:member when you don't have to...

@jtuple
Copy link
Contributor

jtuple commented Mar 20, 2013

Like how you refactored to reuse parallel_collect_one, and the <= 1 option to vnode_parallel_start to disable the changes is a nice approach. Everything else looks great as well.

Re-tested things manually. Also tested with and without the accompanying riak_kv PR and verified that start_vnode / start_vnodes was called correctly depending on the code.

+1

engelsanchez added a commit that referenced this pull request Mar 21, 2013
Porting parallel vnode init fix to 1.3 + revert switch
@engelsanchez engelsanchez merged commit 5893d71 into 1.3 Mar 21, 2013
@engelsanchez engelsanchez deleted the eas-parallel-vnode-init-backport branch March 21, 2013 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments