Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor riak_core vnode management (part 1) #106

Merged
merged 4 commits into from
Dec 13, 2011

Conversation

jtuple
Copy link
Contributor

@jtuple jtuple commented Oct 31, 2011

Move vnode spawning and pid tracking logic from riak_core_master to
riak_core_manager. This change sets the stage for future work that will
make the vnodes more pure, with state transitions coordinated by the
vnode manager. A side-effect is that there is now a central process
managing pids for all registered vnode modules, rather than one master
per vnode module.

Add registered and supervised vnode proxy processes. Requests can be
sent directly to a vnode proxy process for a desired node/index/module
and the request will be forwarded to the proper vnode. The proxy process
coordinates with the vnode manager to know the proper pid for routing,
and monitors the vnode pid in order to clean-up on shutdown/crashes. While
vnodes may be spun up and down on demand by the vnode manager, the relevant
proxy processes are always alive and can be counted on for direct routing.

Change riak_core_vnode_master to use the new vnode proxy support in most
cases. All requests that are routed through a vnode master process will
now use the proxy logic to dispatch requests. Likewise, the non-sync API
exposed by the vnode master module no longer routes through the master
itself, but directly dispatches requests to the proxy processes. Sync
commands still route through the master, which then routes through the
proxy in handle_call. A side-effect is that sync commands now require
three local hops (master, proxy, vnode-pid) rather than two (master,
vnode-pid).

A few notes. You need to add {legacy_vnode_routing, false} to the riak_core portion of app.config if you want the direct routing (ie. bypassing vnode_master) behavior. This is necessary for rolling upgrade support. With or with the setting, vnode master will always route through the proxies, but the setting determines if non-sync requests are sent directly to a proxy or first through vnode master (the second case being safe for old nodes that have a vnode master but no proxies).

The changes effect the underlying vnode behavior of riak_core, so for testing I just went with basho_expect using this branch for stage injection. basho_expect individual node tests seem to pass the same as they do on master (ie. the same tests fail that fail on master for me). The cluster tests are hit or miss. A few failed, but I think it may have been an issue with the stage beam clobbering approach. Re-running with a package containing my changes to test things out.

I'll likely add an automated integration test that monitors trace messages to determine that requests are proxied as desired. For now, I've manually done similar with redbug (monitoring master, manager, and proxy process handle_calls/casts) and everything looks good.

Move vnode spawning and pid tracking logic from riak_core_master to
riak_core_manager. This change sets the stage for future work that will
make the vnodes more pure, with state transitions coordinated by the
vnode manager. A side-effect is that there is now a central process
managing pids for all registered vnode modules, rather than one master
per vnode module.

Add registered and supervised vnode proxy processes. Requests can be
sent directly to a vnode proxy process for a desired node/index/module
and the request will be forwarded to the proper vnode. The proxy process
coordinates with the vnode manager to know the proper pid for routing,
and monitors the vnode pid in order to clean-up on shutdown/crashes. While
vnodes may be spun up and down on demand by the vnode manager, the relevant
proxy processes are always alive and can be counted on for direct routing.

Change riak_core_vnode_master to use the new vnode proxy support in most
cases. All requests that are routed through a vnode master process will
now use the proxy logic to dispatch requests. Likewise, the non-sync API
exposed by the vnode master module no longer routes through the master
itself, but directly dispatches requests to the proxy processes. Sync
commands still route through the master, which then routes through the
proxy in handle_call. A side-effect is that sync commands now require
three local hops (master, proxy, vnode-pid) rather than two (master,
vnode-pid).
<- supervisor:which_children(riak_core_vnode_sup)],
IdxTable = ets:new(ets_vnodes, [{keypos, 2}]),

%% In case this the vnode master is being restarted, scan the existing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/master/manager/

@rzezeski
Copy link
Contributor

I'm probably not fully appreciating the interaction between vnodes/handoff/manager/etc but why have vnode_proxy at all? Could we simply register the vnode directly under the <Mod>_<Index> name?

@jtuple
Copy link
Contributor Author

jtuple commented Nov 14, 2011

vnodes are dynamic and come and go. If you only registered the vnode, then you could only send requests to already running vnodes -- not to vnodes that aren't currently running. This may work for primary vnodes, but not for secondary/fallback vnodes. Secondary vnodes are only started whenever a request is sent to them.

@rzezeski
Copy link
Contributor

EDIT: Disregard this, it seems the register hack will become unnecessary once we have cached preflists in place.

So we register each proxy vnode under it's unique Mod/Idx. This, of course, requires a process and atom for each registration. Given a larger ring size like 2048 and a 3 vnode behaviors we are looking at 3*2048 extra processes and atoms. Processes are cheap (though not free) so I'm not too concerned there but atoms are never reclaimed. It's a general guideline to avoid runtime atom creation and should mainly be used as constants/enumerations (i.e. for their own value). Like all guidelines they are meant to be broken now and again. There is probably enough space in atom-land for the current implementation to get along just fine (you'd have to really up the ring size and vnode behaviors). However, we have hinted at a dynamic ring size in the future which would potentially mean dynamic/changing index numbers which could lead to the atom table filling up. It's probably a case of cross the bridge when we come to it but thought I'd share anyways.

end,
Pid.

stop_proxy(Mod, Index) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used anywhere. From what I understand you always want all vnode proxies running so you would never actually stop them. So why have this API?

@rzezeski
Copy link
Contributor

After running some local benchmarks on my MBA I'm a little worried by the results given I expected this, if anything, to increase throughput and drop latency. These graphs were made using the compare script for basho_bench. They compare various bench configs before/after this patch is applied.

Just to verify I didn't reverse my fingers when generating these graphs here's what I used to gen the "Read Only" graph.

./priv/compare.r -i tests/get-only-before-vnode-proxy/ -j 'get-only-before-proxy' -k tests/get-only-after-vnode-proxy -l 'get-only-after-proxy'

Read/Write #1 (pareto/100K keys/4KB values)

rw-1

Read/Write #2 (same as above, just ran again)

rw-2

Write Only (load 100K, 4KB values using partitioned_int and 4 workers)

load

Read Only (pareto on dataset generated from previous load)

read

@rzezeski
Copy link
Contributor

I forgot to set legacy_vnode_routing to false, heh. Rerunning benchmarks. Puts dunce hat on and stands in corner

EDIT: Although, it's also nice to have a rough approximation of how the extra hop affects things in the case where legacy routing is in play (i.e. a rolling upgrade).

@rzezeski
Copy link
Contributor

So the throughput is worse with legacy routing disabled. I'm not sure why this is so. In the meantime here are more graphs.

new routing 1

new routing 2

@rzezeski
Copy link
Contributor

More of the same, this time on EC2 using 5 nodes (4 riak/ 1 basho bench)

Before Proxy

before

After Proxy

after

Comparison

compare

@rzezeski
Copy link
Contributor

I forgot to add this before, here is a filtered/truncated fprof analysis I ran on my local benchmark with vnode proxy enabled. Notice that the accumulated time for reg_name is ~10% or the total accumulated time 21194 / 211882. I wonder if the cached preflist change would bring it back to 100%+ throughput? Also notice half the time is spent in suspend. I would assume waiting on IO.

9:[{ totals,                                     1799476,61052.826,95483.357}].  %%%
1283: { {proc_lib,init_p_do_apply,3},               1270,211882.926,  126.633},     %
3843: { {gen_server,loop,6},                        2540,121746.335,  672.655},     %
3853: { {gen_server,decode_msg,8},                  2540,121744.697,  317.534},     %
3859: { {gen_server,handle_msg,5},                  2540,121744.636,  757.012},     %
3873: { {gen_server,handle_common_reply,6},         1270,121614.336,  109.907},     %
6300: { suspend,                                    6361,117250.935,    0.000},     %
7573: { {proc_lib,init_p,5},                        1270,90987.957,  367.902},     %
13925: { {gen,init_it,6},                            1270,89142.467,   62.885},     %
16467: { {gen,init_it2,7},                           1270,89079.582,   47.819},     %
19009: { {gen_fsm,init_it,6},                        1270,89031.763,  255.995},     %
27838: { {gen_fsm,loop,7},                           7878,88360.083,  580.053},     %
31930: { {gen_fsm,decode_msg,9},                     7878,88295.455,  412.694},     %
34472: { {gen_fsm,handle_msg,7},                     7878,88229.291,  956.051},     %
40766: { {gen_fsm,dispatch,4},                       7878,82296.616,  354.678},     %
48073: { {riak_kv_get_fsm,waiting_read_repair,2},    2042,33119.713,  360.769},     %
53216: { {riak_kv_get_fsm,maybe_finalize,1},         3063,32408.216,  369.776},     %
56281: { {riak_kv_get_fsm,finalize,1},               1021,31887.638,  218.214},     %
59346: { {riak_kv_get_core,final_action,1},          1021,31622.796,  338.295},     %
65137: { {riak_kv_get_core,merge,2},                 2042,31603.828,  360.533},     %
68710: { {riak_object,reconcile,2},                  1700,29280.103,  592.335},     %
76367: { {riak_object,reconcile,1},                  1700,26111.185,  391.875},     %
83224: { {riak_core_vnode_master,command,4},         4828,24554.530,  293.710},     %
88131: { {riak_core_vnode_master,proxy_cast,2},      3621,24192.150,  100.502},     %
90569: { {riak_core_vnode_master,do_proxy_cast,2},   3621,24090.983,  311.965},     %
95251: { {riak_kv_get_fsm,execute,2},                1021,21699.999,  110.648},     %
100371: { {riak_kv_vnode,get,3},                      1021,21284.489,   28.817},     %
102640: { {riak_core_vnode_proxy,reg_name,3},         3621,21194.938,  143.212},     %
105056: { {riak_core_vnode_proxy,reg_name,2},         3621,21051.726,  268.916},     %
108681: { {supervisor,handle_call,3},                 1270,15652.095,  648.218},     %
108692: { {supervisor,do_start_child_i,3},            1270,13843.279,  399.212},     %
108699: { {gen_fsm,start_link,3},                     1270,13364.503,   94.683},     %
108705: { {gen,start,5},                              1270,13269.820,   78.968},     %
108711: { {gen,do_spawn,5},                           1270,13190.852,  184.821},     %
109927: { {io_lib,format,2},                          3621,12641.794,  143.492},     %
111138: { {proc_lib,start_link,5},                    1270,12585.118,  173.877},     %
112353: { {io_lib_format,fwrite,2},                   3621,12498.302,  204.149},     %
115978: { {proc_lib,sync_wait,2},                     1270,11157.735,  429.249},     %
115983: { {riak_kv_get_fsm,start_link,4},             1021,10960.117,   53.020},     %
116837: { {riak_object,ancestors,1},                  1700,10180.839,  144.214},     %
120239: { {riak_object,'-ancestors/1-lc$^0/1-0-',2},  5100, 9355.208,  507.159},     %
124491: { {vclock,descends,2},                        67496, 9141.924, 6793.778},     %

@jtuple
Copy link
Contributor Author

jtuple commented Nov 22, 2011

I confirmed the throughput regression and looked into improving the speed of riak_core_vnode_proxy:reg_name given the 10% fprof result. I have pushed a new commit that is 4x faster than the old reg_name when tested in isolation. This change results in nearly even throughput results. If these results are reasonable, we can look into more improvements post the next release. Please re-confirm the new performance results just be sure I didn't screw things up.

Confirm regression (1:1 get:put/pareto/100K keys/4KB values)

rw-3

Results with new reg_name

rw-1

Another run with the new reg_name

rw-2

Old reg_name versus new reg_name

rw-4

@rzezeski
Copy link
Contributor

I've confirmed that with the reg_name the performance is 1:1 for before/after. Personally, I'm a little weary of this change since it adds more things for what seems like no short-term gain. But I understand this is meant as a long-term improvement so I'll just step in line.

Comparison (pareto 4/1 read/write ratio)

comparison

@jtuple jtuple merged commit 5d1fa1b into master Dec 13, 2011
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants