Refactor riak_core vnode management (part 1) #106

jtuple · 2011-10-31T19:58:10Z

Move vnode spawning and pid tracking logic from riak_core_master to
riak_core_manager. This change sets the stage for future work that will
make the vnodes more pure, with state transitions coordinated by the
vnode manager. A side-effect is that there is now a central process
managing pids for all registered vnode modules, rather than one master
per vnode module.

Add registered and supervised vnode proxy processes. Requests can be
sent directly to a vnode proxy process for a desired node/index/module
and the request will be forwarded to the proper vnode. The proxy process
coordinates with the vnode manager to know the proper pid for routing,
and monitors the vnode pid in order to clean-up on shutdown/crashes. While
vnodes may be spun up and down on demand by the vnode manager, the relevant
proxy processes are always alive and can be counted on for direct routing.

Change riak_core_vnode_master to use the new vnode proxy support in most
cases. All requests that are routed through a vnode master process will
now use the proxy logic to dispatch requests. Likewise, the non-sync API
exposed by the vnode master module no longer routes through the master
itself, but directly dispatches requests to the proxy processes. Sync
commands still route through the master, which then routes through the
proxy in handle_call. A side-effect is that sync commands now require
three local hops (master, proxy, vnode-pid) rather than two (master,
vnode-pid).

A few notes. You need to add {legacy_vnode_routing, false} to the riak_core portion of app.config if you want the direct routing (ie. bypassing vnode_master) behavior. This is necessary for rolling upgrade support. With or with the setting, vnode master will always route through the proxies, but the setting determines if non-sync requests are sent directly to a proxy or first through vnode master (the second case being safe for old nodes that have a vnode master but no proxies).

The changes effect the underlying vnode behavior of riak_core, so for testing I just went with basho_expect using this branch for stage injection. basho_expect individual node tests seem to pass the same as they do on master (ie. the same tests fail that fail on master for me). The cluster tests are hit or miss. A few failed, but I think it may have been an issue with the stage beam clobbering approach. Re-running with a package containing my changes to test things out.

I'll likely add an automated integration test that monitors trace messages to determine that requests are proxied as desired. For now, I've manually done similar with redbug (monitoring master, manager, and proxy process handle_calls/casts) and everything looks good.

Move vnode spawning and pid tracking logic from riak_core_master to riak_core_manager. This change sets the stage for future work that will make the vnodes more pure, with state transitions coordinated by the vnode manager. A side-effect is that there is now a central process managing pids for all registered vnode modules, rather than one master per vnode module. Add registered and supervised vnode proxy processes. Requests can be sent directly to a vnode proxy process for a desired node/index/module and the request will be forwarded to the proper vnode. The proxy process coordinates with the vnode manager to know the proper pid for routing, and monitors the vnode pid in order to clean-up on shutdown/crashes. While vnodes may be spun up and down on demand by the vnode manager, the relevant proxy processes are always alive and can be counted on for direct routing. Change riak_core_vnode_master to use the new vnode proxy support in most cases. All requests that are routed through a vnode master process will now use the proxy logic to dispatch requests. Likewise, the non-sync API exposed by the vnode master module no longer routes through the master itself, but directly dispatches requests to the proxy processes. Sync commands still route through the master, which then routes through the proxy in handle_call. A side-effect is that sync commands now require three local hops (master, proxy, vnode-pid) rather than two (master, vnode-pid).

rzezeski · 2011-11-14T16:46:52Z

src/riak_core_vnode_manager.erl

+                            <- supervisor:which_children(riak_core_vnode_sup)],
+    IdxTable = ets:new(ets_vnodes, [{keypos, 2}]),
+
+    %% In case this the vnode master is being restarted, scan the existing


s/master/manager/

rzezeski · 2011-11-14T19:35:23Z

I'm probably not fully appreciating the interaction between vnodes/handoff/manager/etc but why have vnode_proxy at all? Could we simply register the vnode directly under the <Mod>_<Index> name?

jtuple · 2011-11-14T19:39:03Z

vnodes are dynamic and come and go. If you only registered the vnode, then you could only send requests to already running vnodes -- not to vnodes that aren't currently running. This may work for primary vnodes, but not for secondary/fallback vnodes. Secondary vnodes are only started whenever a request is sent to them.

rzezeski · 2011-11-14T21:31:42Z

EDIT: Disregard this, it seems the register hack will become unnecessary once we have cached preflists in place.

So we register each proxy vnode under it's unique Mod/Idx. This, of course, requires a process and atom for each registration. Given a larger ring size like 2048 and a 3 vnode behaviors we are looking at 3*2048 extra processes and atoms. Processes are cheap (though not free) so I'm not too concerned there but atoms are never reclaimed. It's a general guideline to avoid runtime atom creation and should mainly be used as constants/enumerations (i.e. for their own value). Like all guidelines they are meant to be broken now and again. There is probably enough space in atom-land for the current implementation to get along just fine (you'd have to really up the ring size and vnode behaviors). However, we have hinted at a dynamic ring size in the future which would potentially mean dynamic/changing index numbers which could lead to the atom table filling up. It's probably a case of cross the bridge when we come to it but thought I'd share anyways.

rzezeski · 2011-11-14T21:34:53Z

src/riak_core_vnode_proxy_sup.erl

+          end,
+    Pid.
+
+stop_proxy(Mod, Index) ->


This isn't used anywhere. From what I understand you always want all vnode proxies running so you would never actually stop them. So why have this API?

rzezeski · 2011-11-16T00:09:12Z

After running some local benchmarks on my MBA I'm a little worried by the results given I expected this, if anything, to increase throughput and drop latency. These graphs were made using the compare script for basho_bench. They compare various bench configs before/after this patch is applied.

Just to verify I didn't reverse my fingers when generating these graphs here's what I used to gen the "Read Only" graph.

./priv/compare.r -i tests/get-only-before-vnode-proxy/ -j 'get-only-before-proxy' -k tests/get-only-after-vnode-proxy -l 'get-only-after-proxy'

Read/Write #1 (pareto/100K keys/4KB values)

Read/Write #2 (same as above, just ran again)

Write Only (load 100K, 4KB values using partitioned_int and 4 workers)

Read Only (pareto on dataset generated from previous load)

rzezeski · 2011-11-16T15:15:29Z

I forgot to set legacy_vnode_routing to false, heh. Rerunning benchmarks. Puts dunce hat on and stands in corner

EDIT: Although, it's also nice to have a rough approximation of how the extra hop affects things in the case where legacy routing is in play (i.e. a rolling upgrade).

rzezeski · 2011-11-16T17:05:28Z

So the throughput is worse with legacy routing disabled. I'm not sure why this is so. In the meantime here are more graphs.

rzezeski · 2011-11-17T21:13:36Z

More of the same, this time on EC2 using 5 nodes (4 riak/ 1 basho bench)

Before Proxy

After Proxy

Comparison

rzezeski · 2011-11-17T22:39:34Z

I forgot to add this before, here is a filtered/truncated fprof analysis I ran on my local benchmark with vnode proxy enabled. Notice that the accumulated time for reg_name is ~10% or the total accumulated time 21194 / 211882. I wonder if the cached preflist change would bring it back to 100%+ throughput? Also notice half the time is spent in suspend. I would assume waiting on IO.

9:[{ totals,                                     1799476,61052.826,95483.357}].  %%%
1283: { {proc_lib,init_p_do_apply,3},               1270,211882.926,  126.633},     %
3843: { {gen_server,loop,6},                        2540,121746.335,  672.655},     %
3853: { {gen_server,decode_msg,8},                  2540,121744.697,  317.534},     %
3859: { {gen_server,handle_msg,5},                  2540,121744.636,  757.012},     %
3873: { {gen_server,handle_common_reply,6},         1270,121614.336,  109.907},     %
6300: { suspend,                                    6361,117250.935,    0.000},     %
7573: { {proc_lib,init_p,5},                        1270,90987.957,  367.902},     %
13925: { {gen,init_it,6},                            1270,89142.467,   62.885},     %
16467: { {gen,init_it2,7},                           1270,89079.582,   47.819},     %
19009: { {gen_fsm,init_it,6},                        1270,89031.763,  255.995},     %
27838: { {gen_fsm,loop,7},                           7878,88360.083,  580.053},     %
31930: { {gen_fsm,decode_msg,9},                     7878,88295.455,  412.694},     %
34472: { {gen_fsm,handle_msg,7},                     7878,88229.291,  956.051},     %
40766: { {gen_fsm,dispatch,4},                       7878,82296.616,  354.678},     %
48073: { {riak_kv_get_fsm,waiting_read_repair,2},    2042,33119.713,  360.769},     %
53216: { {riak_kv_get_fsm,maybe_finalize,1},         3063,32408.216,  369.776},     %
56281: { {riak_kv_get_fsm,finalize,1},               1021,31887.638,  218.214},     %
59346: { {riak_kv_get_core,final_action,1},          1021,31622.796,  338.295},     %
65137: { {riak_kv_get_core,merge,2},                 2042,31603.828,  360.533},     %
68710: { {riak_object,reconcile,2},                  1700,29280.103,  592.335},     %
76367: { {riak_object,reconcile,1},                  1700,26111.185,  391.875},     %
83224: { {riak_core_vnode_master,command,4},         4828,24554.530,  293.710},     %
88131: { {riak_core_vnode_master,proxy_cast,2},      3621,24192.150,  100.502},     %
90569: { {riak_core_vnode_master,do_proxy_cast,2},   3621,24090.983,  311.965},     %
95251: { {riak_kv_get_fsm,execute,2},                1021,21699.999,  110.648},     %
100371: { {riak_kv_vnode,get,3},                      1021,21284.489,   28.817},     %
102640: { {riak_core_vnode_proxy,reg_name,3},         3621,21194.938,  143.212},     %
105056: { {riak_core_vnode_proxy,reg_name,2},         3621,21051.726,  268.916},     %
108681: { {supervisor,handle_call,3},                 1270,15652.095,  648.218},     %
108692: { {supervisor,do_start_child_i,3},            1270,13843.279,  399.212},     %
108699: { {gen_fsm,start_link,3},                     1270,13364.503,   94.683},     %
108705: { {gen,start,5},                              1270,13269.820,   78.968},     %
108711: { {gen,do_spawn,5},                           1270,13190.852,  184.821},     %
109927: { {io_lib,format,2},                          3621,12641.794,  143.492},     %
111138: { {proc_lib,start_link,5},                    1270,12585.118,  173.877},     %
112353: { {io_lib_format,fwrite,2},                   3621,12498.302,  204.149},     %
115978: { {proc_lib,sync_wait,2},                     1270,11157.735,  429.249},     %
115983: { {riak_kv_get_fsm,start_link,4},             1021,10960.117,   53.020},     %
116837: { {riak_object,ancestors,1},                  1700,10180.839,  144.214},     %
120239: { {riak_object,'-ancestors/1-lc$^0/1-0-',2},  5100, 9355.208,  507.159},     %
124491: { {vclock,descends,2},                        67496, 9141.924, 6793.778},     %

jtuple · 2011-11-22T23:02:16Z

I confirmed the throughput regression and looked into improving the speed of riak_core_vnode_proxy:reg_name given the 10% fprof result. I have pushed a new commit that is 4x faster than the old reg_name when tested in isolation. This change results in nearly even throughput results. If these results are reasonable, we can look into more improvements post the next release. Please re-confirm the new performance results just be sure I didn't screw things up.

Confirm regression (1:1 get:put/pareto/100K keys/4KB values)

Results with new reg_name

Another run with the new reg_name

Old reg_name versus new reg_name

rzezeski · 2011-11-23T16:24:47Z

I've confirmed that with the reg_name the performance is 1:1 for before/after. Personally, I'm a little weary of this change since it adds more things for what seems like no short-term gain. But I understand this is meant as a long-term improvement so I'll just step in line.

Comparison (pareto 4/1 read/write ratio)

rzezeski reviewed Nov 14, 2011
View reviewed changes

rzezeski added 2 commits November 15, 2011 14:43

Update vnode QC tests to work with vnode_manager

3f792b5

Update comment

b393564

Improve proxy reg_name to fix performance regression

5d1fa1b

jtuple merged commit 5d1fa1b into master Dec 13, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor riak_core vnode management (part 1) #106

Refactor riak_core vnode management (part 1) #106

jtuple commented Oct 31, 2011

rzezeski Nov 14, 2011

rzezeski commented Nov 14, 2011

jtuple commented Nov 14, 2011

rzezeski commented Nov 14, 2011

rzezeski Nov 14, 2011

rzezeski commented Nov 16, 2011

rzezeski commented Nov 16, 2011

rzezeski commented Nov 16, 2011

rzezeski commented Nov 17, 2011

rzezeski commented Nov 17, 2011

jtuple commented Nov 22, 2011

rzezeski commented Nov 23, 2011

Refactor riak_core vnode management (part 1) #106

Refactor riak_core vnode management (part 1) #106

Conversation

jtuple commented Oct 31, 2011

rzezeski Nov 14, 2011

Choose a reason for hiding this comment

rzezeski commented Nov 14, 2011

jtuple commented Nov 14, 2011

rzezeski commented Nov 14, 2011

rzezeski Nov 14, 2011

Choose a reason for hiding this comment

rzezeski commented Nov 16, 2011

Read/Write #1 (pareto/100K keys/4KB values)

Read/Write #2 (same as above, just ran again)

Write Only (load 100K, 4KB values using partitioned_int and 4 workers)

Read Only (pareto on dataset generated from previous load)

rzezeski commented Nov 16, 2011

rzezeski commented Nov 16, 2011

rzezeski commented Nov 17, 2011

Before Proxy

After Proxy

Comparison

rzezeski commented Nov 17, 2011

jtuple commented Nov 22, 2011

Confirm regression (1:1 get:put/pareto/100K keys/4KB values)

Results with new reg_name

Another run with the new reg_name

Old reg_name versus new reg_name

rzezeski commented Nov 23, 2011

Comparison (pareto 4/1 read/write ratio)