Porting parallel vnode init fix to 1.3 + revert switch by engelsanchez · Pull Request #281 · basho/riak_core

engelsanchez · 2013-03-08T20:24:34Z

No description provided.

Added an extra state to riak_core_vnode to decouple vnode initialization from process creation. Added a function to block until initialization is truly finished. Vnode init call now takes a list of indices or single index. The list version creates vnode processes first, then waits for them to initialize, which allows it to happen in parallel. Used the above in riak_core_ring_handler when starting services on first ring event. Tests show I/O saturation at bitcask startup now, instead of the serialized trickle we had. This code still needs some work to handle edge cases, specially around how to handle vnode initialization failing. Add concurrency control to vnode initialization New pmap with bounded concurrency utility added with unit tests. Fix cluster into to query vnode manager, not sup With the parallel vnode change, there is more reason to not query the supervisor directly, as its children may not have finished initialization yet.

To allow parallel initialization of vnodes without changing the API in a non-backwards compatible way. To use it, the vnode module has to implement start_vnodes/1.

jtuple · 2013-03-18T22:36:55Z

Overall, things work as intended and there doesn't appear to be any major issues. Although, have a few minor comments.

Also, given that this work has already merged into master and this a backport, it's unclear what changes we want to do here versus on master. Could ship with this PR, change master later, or change things both here and on master. Not sure what works best.

The main issue is the change of Mod:start_vnode moving from being passed an index to a list of indices. This assumes that such a change is safe for all riak_core apps. Currently, this isn't safe for riak_search, but you already have a simple PR to fix that. Of course, we have no idea what people are doing with riak_core in the wild. One option is to instead keep the semantics of start_vnode the same and add a new function such as start_vnodes that accepts a list. And then to fallback to start_vnode if the module in question doesn't implement start_vnodes. Of course, this requires all riak_core apps to be modified with this new start_vnodes function in order to use parallel vnode initialization. On the other hand, it does make it so this is no longer an API breaking change. The main idea would be to move to something like:

case lists:member({start_vnodes, 1}, Mod:module_info(exports)) of
    true ->
        Mod:start_vnodes(Startable);
    false ->
        [Mod:start_vnode(I) || I <- Startable]
end.

My remaining comments are best left as line notes. Going to go add those now.

jtuple · 2013-03-18T23:03:33Z

src/riak_core_util.erl

I'm not a fan of using selective receive on Pid here. That can lead to O(N^2) mailbox scans, as well as keeping items in the mailbox longer then necessary. To illustrate, let's say we have pids A-D, but due to worker timing or whatever we receive messages in the order D-A. The first iteration of Collect is going to do a receive {pmap_result, <A>, R}, skip over messages D-B, find A, return. Then for B it'll start over, re-scan until it finds B at the end, return. Rinse/repeat for the rest. Likewise, if A took longer than the other workers, we'd stick around waiting for A while keeping D-B messages sitting in the mailbox the entire time.

For vnode init, this may not be a major issue. But, given that pmap/3 is designed to be a general purpose utility function, we might as well consider what happens if someone decides to use 1000 workers or something. Really don't want to scan 1 million messages in the worst case.

Any reason you can't just do a receive on {pmap_result, _, R}, thus handling each response in the order they are sent? It looks like you end up sorting the results on line 297 so you're not particularly sensitive to how you end up constructing All.

On a different note, I really wish we could use optimized receives in both this pmap/3 as well as the older pmap/2. But, the strict requirements of how you must structure your code to get the Erlang compiler to use the optimized receive won't work for the way either of these pmap implementations are coded. This is unfortunate, as each receive for the pmap_result is going to scan over all non-pmap messages in the mailbox. This is kinda annoying for the riak_core_vnode_manager which receives 100s to 10000s of messages regularly. One of the top serialization points in all of Riak.

One option would be to do the pmap itself in a separate process. Ie. calling pmap spawns a process that then spawns the workers, waits for results, then sends the combined results back to the original process. This would decouple the calling processes mailbox from the pmap process. Anyway, food for thought.

Collect only handles the last batch of work (a number of messages never bigger than max_concurrent). Within the main fold, messages are received in the order they arrived (see line 328). I can fix this collection of the last batch for 1.4, but I could see that as something easy to mess up right before shipping this.

I'm aware of the perils of selective receives in this general purpose utility, but thought that any application sensitive to that should run pmap in a separate process that will only collect pmap messages anyway. That way it can handle its mailbox in whatever way is more efficient (say a gen_fsm would spawn a function that ran pmap and sent an event back to it, or a gen_server would send itself back a call, etc).

In the case of initialization, we have already been blocking here for hours at a time without processing messages, so I don't think that this particular case is any worse. It should be a lot better now just due to the fact that it will likely block for a 10th of the time it did before.

jtuple · 2013-03-18T23:45:48Z

I'm a bit curious about the started(timeout, ...) and started(wait_for_init, ....) cases in riak_core_vnode. It seems to me that we'll always transition into started(timeout, ...) immediately from init before ever receiving any wait_for_init messages, and the call to riak_core_vnode:wait_for_init in the vnode manager ends up just hitting the default case in the active state here.

Am I mistaken? Or is the entire wait_for_init code path dead?

engelsanchez · 2013-03-19T20:09:36Z

@jtuple about the started state: my reading of the OTP code was that timeout is simply the timeout of the receive of the main loop, which is executed after acking the process intialization. This is the race condition I was worried about: somebody initializating the vnode asynchronously and sending it a message immediately (or another process sending a request right there). This could potentially arrive before the receive code in gen_fsm is touched, and so that message would be processed and passed to the initial state (see loop function in gen_fsm.erl). I think this would only happen once in a blue moon, but it could unless I missed something.

engelsanchez · 2013-03-19T20:57:12Z

@jtuple Just pushed changes to address your comments (with some stuff left for later). Please let me know if you agree or have objections and will address immediately

engelsanchez · 2013-03-19T21:10:20Z

@jtuple BTW, don't forget this now has a related PR in riak_kv for this to work basho/riak_kv#511

engelsanchez · 2013-03-20T01:09:13Z

@jtuple I went ahead and changed pmap to avoid using lists and lists:member (using sets now). Eunit test passes. Hopefully this addresses some of your concerns. Anyway, this list of pending work was bounded by max_concurrent anyway, but I guess why use lists:member when you don't have to...

jtuple · 2013-03-20T21:59:25Z

Like how you refactored to reuse parallel_collect_one, and the <= 1 option to vnode_parallel_start to disable the changes is a nice approach. Everything else looks great as well.

Re-tested things manually. Also tested with and without the accompanying riak_kv PR and verified that start_vnode / start_vnodes was called correctly depending on the code.

+1

Porting parallel vnode init fix to 1.3 + revert switch

engelsanchez added 2 commits March 8, 2013 14:36

Add switch to revert to old vnode init behavior

c11c948

ghost assigned jtuple Mar 11, 2013

Fix backwards compatibility of start_vnode

1e543e4

To allow parallel initialization of vnodes without changing the API in a non-backwards compatible way. To use it, the vnode module has to implement start_vnodes/1.

jtuple reviewed Mar 18, 2013
View reviewed changes

Address parallel vnode init review comments

7013484

Improve pmap perf by using sets, not lists

b204bea

engelsanchez added a commit that referenced this pull request Mar 21, 2013

Merge pull request #281 from basho/eas-parallel-vnode-init-backport

5893d71

Porting parallel vnode init fix to 1.3 + revert switch

engelsanchez merged commit 5893d71 into 1.3 Mar 21, 2013

engelsanchez deleted the eas-parallel-vnode-init-backport branch March 21, 2013 00:10

ooshlablu unassigned jtuple Apr 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Porting parallel vnode init fix to 1.3 + revert switch#281

Porting parallel vnode init fix to 1.3 + revert switch#281
engelsanchez merged 5 commits into1.3from
eas-parallel-vnode-init-backport

engelsanchez commented Mar 8, 2013

Uh oh!

jtuple commented Mar 18, 2013

Uh oh!

jtuple Mar 18, 2013

Uh oh!

engelsanchez Mar 19, 2013

Uh oh!

jtuple commented Mar 18, 2013

Uh oh!

engelsanchez commented Mar 19, 2013

Uh oh!

engelsanchez commented Mar 19, 2013

Uh oh!

engelsanchez commented Mar 19, 2013

Uh oh!

engelsanchez commented Mar 20, 2013

Uh oh!

jtuple commented Mar 20, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

engelsanchez commented Mar 8, 2013

Uh oh!

jtuple commented Mar 18, 2013

Uh oh!

jtuple Mar 18, 2013

Choose a reason for hiding this comment

Uh oh!

engelsanchez Mar 19, 2013

Choose a reason for hiding this comment

Uh oh!

jtuple commented Mar 18, 2013

Uh oh!

engelsanchez commented Mar 19, 2013

Uh oh!

engelsanchez commented Mar 19, 2013

Uh oh!

engelsanchez commented Mar 19, 2013

Uh oh!

engelsanchez commented Mar 20, 2013

Uh oh!

jtuple commented Mar 20, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments