-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor riak_core vnode management (part 1) #106
Conversation
Move vnode spawning and pid tracking logic from riak_core_master to riak_core_manager. This change sets the stage for future work that will make the vnodes more pure, with state transitions coordinated by the vnode manager. A side-effect is that there is now a central process managing pids for all registered vnode modules, rather than one master per vnode module. Add registered and supervised vnode proxy processes. Requests can be sent directly to a vnode proxy process for a desired node/index/module and the request will be forwarded to the proper vnode. The proxy process coordinates with the vnode manager to know the proper pid for routing, and monitors the vnode pid in order to clean-up on shutdown/crashes. While vnodes may be spun up and down on demand by the vnode manager, the relevant proxy processes are always alive and can be counted on for direct routing. Change riak_core_vnode_master to use the new vnode proxy support in most cases. All requests that are routed through a vnode master process will now use the proxy logic to dispatch requests. Likewise, the non-sync API exposed by the vnode master module no longer routes through the master itself, but directly dispatches requests to the proxy processes. Sync commands still route through the master, which then routes through the proxy in handle_call. A side-effect is that sync commands now require three local hops (master, proxy, vnode-pid) rather than two (master, vnode-pid).
<- supervisor:which_children(riak_core_vnode_sup)], | ||
IdxTable = ets:new(ets_vnodes, [{keypos, 2}]), | ||
|
||
%% In case this the vnode master is being restarted, scan the existing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/master/manager/
I'm probably not fully appreciating the interaction between vnodes/handoff/manager/etc but why have vnode_proxy at all? Could we simply register the vnode directly under the |
vnodes are dynamic and come and go. If you only registered the vnode, then you could only send requests to already running vnodes -- not to vnodes that aren't currently running. This may work for primary vnodes, but not for secondary/fallback vnodes. Secondary vnodes are only started whenever a request is sent to them. |
EDIT: Disregard this, it seems the register hack will become unnecessary once we have cached preflists in place. So we register each proxy vnode under it's unique Mod/Idx. This, of course, requires a process and atom for each registration. Given a larger ring size like 2048 and a 3 vnode behaviors we are looking at 3*2048 extra processes and atoms. Processes are cheap (though not free) so I'm not too concerned there but atoms are never reclaimed. It's a general guideline to avoid runtime atom creation and should mainly be used as constants/enumerations (i.e. for their own value). Like all guidelines they are meant to be broken now and again. There is probably enough space in atom-land for the current implementation to get along just fine (you'd have to really up the ring size and vnode behaviors). However, we have hinted at a dynamic ring size in the future which would potentially mean dynamic/changing index numbers which could lead to the atom table filling up. It's probably a case of cross the bridge when we come to it but thought I'd share anyways. |
end, | ||
Pid. | ||
|
||
stop_proxy(Mod, Index) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't used anywhere. From what I understand you always want all vnode proxies running so you would never actually stop them. So why have this API?
After running some local benchmarks on my MBA I'm a little worried by the results given I expected this, if anything, to increase throughput and drop latency. These graphs were made using the compare script for basho_bench. They compare various bench configs before/after this patch is applied. Just to verify I didn't reverse my fingers when generating these graphs here's what I used to gen the "Read Only" graph.
Read/Write #1 (pareto/100K keys/4KB values)Read/Write #2 (same as above, just ran again)Write Only (load 100K, 4KB values using partitioned_int and 4 workers)Read Only (pareto on dataset generated from previous load) |
I forgot to set EDIT: Although, it's also nice to have a rough approximation of how the extra hop affects things in the case where legacy routing is in play (i.e. a rolling upgrade). |
I forgot to add this before, here is a filtered/truncated fprof analysis I ran on my local benchmark with vnode proxy enabled. Notice that the accumulated time for
|
I confirmed the throughput regression and looked into improving the speed of Confirm regression (1:1 get:put/pareto/100K keys/4KB values)Results with new reg_nameAnother run with the new reg_nameOld reg_name versus new reg_name |
I've confirmed that with the Comparison (pareto 4/1 read/write ratio) |
Move vnode spawning and pid tracking logic from riak_core_master to
riak_core_manager. This change sets the stage for future work that will
make the vnodes more pure, with state transitions coordinated by the
vnode manager. A side-effect is that there is now a central process
managing pids for all registered vnode modules, rather than one master
per vnode module.
Add registered and supervised vnode proxy processes. Requests can be
sent directly to a vnode proxy process for a desired node/index/module
and the request will be forwarded to the proper vnode. The proxy process
coordinates with the vnode manager to know the proper pid for routing,
and monitors the vnode pid in order to clean-up on shutdown/crashes. While
vnodes may be spun up and down on demand by the vnode manager, the relevant
proxy processes are always alive and can be counted on for direct routing.
Change riak_core_vnode_master to use the new vnode proxy support in most
cases. All requests that are routed through a vnode master process will
now use the proxy logic to dispatch requests. Likewise, the non-sync API
exposed by the vnode master module no longer routes through the master
itself, but directly dispatches requests to the proxy processes. Sync
commands still route through the master, which then routes through the
proxy in handle_call. A side-effect is that sync commands now require
three local hops (master, proxy, vnode-pid) rather than two (master,
vnode-pid).
A few notes. You need to add {legacy_vnode_routing, false} to the riak_core portion of app.config if you want the direct routing (ie. bypassing vnode_master) behavior. This is necessary for rolling upgrade support. With or with the setting, vnode master will always route through the proxies, but the setting determines if non-sync requests are sent directly to a proxy or first through vnode master (the second case being safe for old nodes that have a vnode master but no proxies).
The changes effect the underlying vnode behavior of riak_core, so for testing I just went with basho_expect using this branch for stage injection. basho_expect individual node tests seem to pass the same as they do on master (ie. the same tests fail that fail on master for me). The cluster tests are hit or miss. A few failed, but I think it may have been an issue with the stage beam clobbering approach. Re-running with a package containing my changes to test things out.
I'll likely add an automated integration test that monitors trace messages to determine that requests are proxied as desired. For now, I've manually done similar with redbug (monitoring master, manager, and proxy process handle_calls/casts) and everything looks good.