New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ddoc cache #610
Optimize ddoc cache #610
Conversation
{noreply, St}; | ||
|
||
handle_cast({evict, DbName}, St) -> | ||
gen_server:abcast(mem3:nodes(), ?MODULE, {do_evict, DbName}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be [node() | mem3:nodes()]
to ensure the do_evict
logic is triggered on every node? Unless I'm missing something it looks like you'll trigger do_evict on all the other nodes besides this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mem3:nodes()
should include node()
from what I remember.
Some nodes might not be live. Initially thought maybe it would be good to only send to live ones, but abcast will just silently ignore the sends and so no big deal there probably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickva oh right, good call. Even though I actually typed mem3:nodes()
I read it as nodes()
;-)
|
||
handle_db_event(ShardDbName, deleted, St) -> | ||
gen_server:cast(?MODULE, {evict, mem3:dbname(ShardDbName)}), | ||
{ok, St}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also handle the couch_event:notify(Db2#db.name, {ddoc_updated, DDocId})
event and trigger a refresh directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd contemplated removing the evict call and handling it purely in ddoc_cache using the events. At the time I figured I'd keep the same pattern but given we're already processing (discarding) them I guess it makes sense to remove the extra message.
} = St, | ||
NewTime = Time + 1, | ||
NewSt = St#st{time = NewTime}, | ||
Pid = ddoc_cache_refresher:spawn_link(Key, ?REFRESH_TIMEOUT), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have a separate refresh pid for every single ddoc? That seems like it could potentially get a bit out of hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do stick with a pid per ddoc, would it be cleaner to use a simple_one_for_one
supervisor here and just add children to that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience supervisors are too slow for this sort of type of spawning since they want to start every child synchronously. The original PR for this work had a custom supervisor that I decided to not use because it seemed to add more complexity than it saved.
[#entry{val = Val}] -> | ||
?EVENT(update_noop, Key), | ||
loop(Key, Interval); | ||
[#entry{pid = Pid}] when Pid == self() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intention here to force this pid to exit when there's a different pid handling the #entry{}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a mild assertion that we don't have more than one pid updating a cache entry yes. It shouldn't ever kick in but especially for developing its good to know that the behavior is what we're expecting.
ddoc_cache_entry:handle_resp(Resp); | ||
[#entry{val = Val}] -> | ||
couch_stats:increment_counter([ddoc_cache, hit]), | ||
ddoc_cache_lru:accessed(Key), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about sending a message here with every successful lookup. We've seen this to be problematic in couch_server, and I think we might run into similar issues here. Every single time ddoc_cache_lru:accessed/1
is called it can trigger 1 khash read, 1 khash write, one ets lookup, one is_process_alive
check, one ets delete, and one ets insert. I could see this getting overwhelmed in a hurry.
{noreply, St#st{evictor=Evictor}}; | ||
|
||
handle_info(Msg, St) -> | ||
{stop, {invalid_info, Msg}, St}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my comment below in https://github.com/apache/couchdb/pull/610/files#r123584058, we're ignoring refresher pids exiting when another refresher pid has an entry. Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before this is asserting that the cache entries are behaving as expected. Each entry has exactly one refresher pid that lives for as long as the cache entry. We don't want more than one and we especially don't want zero. When we close an entry we unlink and stop the refresher so we should never see an unknown 'EXIT' message.
error:badarg -> | ||
recover | ||
couch_stats:increment_counter([ddoc_cache, miss]), | ||
Resp = gen_server:call(?MODULE, {open, Key}, infinity), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're falling back on the same pattern here of sending an open message on every cache miss. We've seen this to be problematic in the existing ddoc_cache, and I think this is an important issue to solve here. Is there a reason you think sticking with this approach is the proper way forward? It seems like we'll still be susceptible to thundering herd problems when there is not an entry in the cache.
Would it make sense to use the opener pattern you used recently in mem3_shards? b71677f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I was only focused on the worst case we've seen where we have very little load on ddoc_cache and were still hitting the thundering herd (because of the TTL expiration). I'll contemplate the mem3 approach but if memory serves that requires us to change our hard limit on cache size to a soft limit. Granted that's not a huge issue but its something to think about.
|
||
|
||
init(_) -> | ||
BaseOpts = [public, named_table], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the tables are public, would it make sense to have callers perform more of the ETS updates rather than funneling it through a gen_server? Obviously not all of that would be appropriate, but it seems like you could potentially do something here with the ?LRU
table and using os:timestamp()
. For instance:
%% Make duplicate_bag ets table to store an index of all LRU entries per ddoc_cache Key
ets:new(lru_keys, [duplicate_bag]).
accessed(Key) ->
T = os:timestamp(),
[ets:delete(?LRU, Old) || Old <- ets:lookup(lru_keys, Key)],
true = ets:insert(?LRU, {T, Key}),
true = ets:insert(lru_keys, {Key, T}).
This would bypass funneling the lru access messages through ddoc_cache_lru and instead use ETS for the concurrent access.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly can't have the LRU updates not single-threaded as we can't do multi-key atomic compare and replace type operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the other changes are mostly due to us having a hard cap on cache size. Going with a soft cap I think we could do something similar to the mem3 approach. Also, these tables are only public so that I could use the separate process for table ownership.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly can't have the LRU updates not single-threaded as we can't do multi-key atomic compare and replace type operations.
Is that actually needed? I think the duplicate bag approach above might allow us to skip that entirely. We would still need to figure out what to do for the ?LRU
entries in that if we used os:timestamp()
we could potentially end up with duplicate entries, and I don't think there's an ordered_duplicate_bag
ETS table. But I think there's potential for a less strict LRU that accomplishes the same things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I missed the duplicate_bag trick. Though your os:timestamp means you could stomp on another key as you point out. Maybe include the pid in there or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should say that's quite an intriguing approach. I'm poking around at other caching schemes to see if there's not a slight variation that'll play better with ets.
Refreshing my memory on the mem3 PR I can definitely apply a lot of that to this as well so I'll update things for that. For the accessed messages there's not a lot I think we can do there with the existing structure. Although I could see changing things around a bit to make it a LFU cache instead of an LRU cache. It'd involve a table scan to evict but would remove a lot of the message passing to ddoc_cache_lru and friends. I'll make the mem3 inspired changes first and then will look at this approach. |
@chewbranca You managed to nerd snipe me for like four days trying to figure out the mem3 and concurrent LRU stuff. For the mem3 bits I think my conclusion is that its not appropriate here. For mem3_shards we wanted to make sure that there would always be progress so we load things outside of the gen_server and send them to cache. Given that its a local db read this isn't the most terrible thing in the world. Although, for ddoc_cache its a clustered fabric call, so if we followed the mem3_shards pattern we'd actually just be putting extra load on other parts of the system like rexi, and the couch_db_updaters at 3x the rate of client requests. So having them funnel through the opener actually makes more sense here. As to the distributed LRU approach for writes I've been playing with a couple different standalone tests to try and get a feel for our maximum throughput with that. So far (with some extremely synthetic tests) I think we're looking at topping out at around 200K updates a second (at least on my laptop). However, this test doesn't actually take into account actually evicting things. My plan for today is to write a basho_bench driver to try and gauge the relative throughput between what I've got written now and a second approach based on your direct ets write idea. I'll post results here when I get them. |
@chewbranca Epic nerd snipe. I've sprinkled more parallel on the ddoc_cache. Care to take another look. Currently measuring 1M+ ops/sec against a 1,000 item cache on my MBP. I have to do some work on the test suite tomorrow to finish up the rewritten rewrite work but I'm currently pretty happy with it. New style is one pid per entry which does all of the ets writes for the entry. the LRU process only exists to do evictions. I soaked 20,000 clients against a 500 item cache on my laptop for 30-45m and it sustained about 1.5M ops/sec without error and without blowing RAM up beyond 500M. Those numbers are for a custom module that does no loading but I think it shows that the cache itself won't be bound by the cache and instead it'll be fabric calls and or some other bottle neck before the cache. |
|
||
-module(ddoc_cache_tables). | ||
-behaviour(gen_server). | ||
-vsn(1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the vsn(1) attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes it easier to write code_change clauses if we ever get back to hot code upgrades.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some modules don't have it like ddoc_cache_entry. should we add it to that as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
d3c7393
to
747c248
Compare
% the License. | ||
|
||
-module(ddoc_cache_entry). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the gen_server behavior attribute?
-behaviour(gen_server).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call.
747c248
to
1fa3851
Compare
waiters = [] | ||
}, | ||
?EVENT(started, Key), | ||
gen_server:enter_loop(?MODULE, [], St). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use the proc_lib:spawn_link + enter_loop instead of the regular gen_server init pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance. gen_server:start* functions are all synchronous so we'd be gating ddoc_cache_lru performance on the cost of ddoc_cache_entry's init function plus overhead of message passing back and forth. Here we link and crash like the world is on fire if anything goes wrong so its ok to fire and forget here.
@davisp hehehehe happy to help with the sniping ;-) |
|
||
|
||
open(Key) -> | ||
try ets:lookup(?CACHE, Key) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ddoc_cache:open_doc/{2,3}
the Key
is {ddoc_cache_entry_ddocid, {DbName, DocId}}
and {ddoc_cache_entry_ddocid_rev, {DbName, DocId, RevId}}
, respectively. That implies we'll have separate pids for the "latest version" and "specific versions" cases. Is that intentional?
It also means we'll end up triggering two ddoc_cache load cycles on an uncached ddoc view request. For instance in fabric:query_view/6
[1] we call the two arity open_doc
giving us Key = {ddoc_cache_entry_ddocid, {DbName, DocId}}
. We then extract the doc_id/rev pair [2] to ensure all view shards use the same ddoc and ship that in the rpc calls, at which point we use the three arity open_doc
variant to open the specific ddoc rev [3]. This means we'll perform a non ddoc-rev cache fetch on the coordinator, followed by up to Q*N ddoc-rev cache lookups on the rpc nodes, leaving the coordinator node with the ddoc entry and no ddoc-rev entry, and the rpc nodes with the ddoc-rev entry and no ddoc entry. Then we'll potentially perform a non ddoc-rev fetch on every single coordinator node until every node has the non ddoc-rev cache entry, even if they already have the latest ddoc-rev entry. So if you have 12 nodes, and round robin one view request across all of them you'll block on caching the non ddoc-rev version 12 times, despite most likely already having the relevant latest rev in the ddoc-rev version.
Perhaps this isn't the end of the world, but it definitely seems awkward and potentially avoidable. In the case of loading the ddoc-rev version you could insert a ?CACHE
entry for the non ddoc-rev version pointing to that pid, but that might be awkward in the inverse case when you start with ddoc_open/2
.
On a related note, does this mean that we'll have pids sticking around for old ddoc rev versions until they're evicted from the cache?
[1] https://github.com/apache/couchdb/blob/master/src/fabric/src/fabric.erl#L337
[2] https://github.com/apache/couchdb/blob/master/src/fabric/src/fabric_view_map.erl#L28
[3] https://github.com/apache/couchdb/blob/master/src/fabric/src/fabric_rpc.erl#L119
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's all mostly right. The motivation for this was that it would allow the individual entries to be direct lookups rather than table scans and also allows for the non-rev-specific entry to automatically update its value when a design document is updated. As to the extra non-cache lookups the way I reasoned about it was that its only a constant number in the frequently asked case so once they're in cache its not an issue. And on the flip side if we're iterating over things in such a way that if we're evicting entries and then have a lot of traffic to un-cached entries it doesn't matter either as we'd have to increase the cache size regardless.
I'll take a ponder on how to insert things into the cache from non-client requested actions but it seems to me like it'd open up a lot of race conditions in the already fairly complex logic around managing the cache entries.
Also for the rev-specific entries that's true for the most part. They'll stick around until evicted or a quorum of their database shards are compacted. I'll contemplate how easily it'd be to add some sort of check there so that they'll close if they refresh and are not the most recent revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's more response in my PR comment down below, but these two commits address the two main points from @chewbranca's comment:
-include_lib("couch/include/couch_db.hrl"). | ||
|
||
|
||
dbname({DbName, _}) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm... don't we need the 3-tuple variants of these functions to handle keys of the form {DbName, DocId, RevId}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha! You're using the different module dddoc_cache_entry_ddocid_rev
for the 3-tuple case. NVM!
not_found -> | ||
{ok, Keys} = khash:from_list([{Key, Pid}]), | ||
{ok, DDocIds} = khash:from_list([{DDocId, Keys}]), | ||
khash:put(Dbs, DbName, DDocIds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, the name of the variable DDocIds
is a bit confusing because you can have the two and three tuple key variants in here, so it's ddoc_ids and ddoc_id_revs. Perhaps DDocKeys
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not quite right. DDocIds are only ever a single binary. Its a tree of entries made out of the ddoc_cache_entry:dbname/1 and ddocid/1 calls so that when we go to evict entries based on updates we can avoid scanning the entire ?CACHE table looking for matches.
open(Key) -> | ||
try ets:lookup(?CACHE, Key) of | ||
[] -> | ||
lru_start(Key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be useful to add counters for all four clauses here rather than just the miss/hit cases.
Perhaps couch_stats:increment_counter([ddoc_cache, lru_start])
and couch_stats:increment_counter([ddoc_cache, lru_partial_start])
or some such for this clause and the one immediately below. That would make it nice and easy to see if we're hitting thundering herd problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that the best thing is to add those calls here. I think that all of the miss cases are all correctly qualified as misses. However it might be useful to add a new metric to the entry process that tracks how many clients it responded to which would show the thundering herd issues.
{ok, Val} | ||
catch _:_ -> | ||
couch_stats:increment_counter([ddoc_cache, recovery]), | ||
ddoc_cache_entry:recover(Key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice fallback pattern to loading things directly through fabric in the event of an error or no ddoc_cache app running. 👍
{open_ok, Key, {ok, Val}} -> | ||
if not is_list(St#st.waiters) -> ok; true -> | ||
respond(St#st.waiters, {open_ok, {ok, Val}}) | ||
end, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing where the waiters list is cleared out. I'm assuming that should be down in the NewSt
construction on line 191?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome catch! Totally forgot to clean it out on NewSt just as you mentioned. None of the tests apparently is long lived enough to get caught up by having an extra message in its mailbox.
NewSt = St#st{ | ||
val = {Status, Other}, | ||
opener = undefined, | ||
waiters = undefined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I see that setting waiters = undefined
will bypass the respond to waiters logic below on line 204, however this will eventually cause a runtime error while calling gen_server:reply(undefined, Resp)
in the respond/2
function if val
is ever unset for whatever reason and we end up back in the handle_call(open, From, #st{val = undefined} = St) ->
case. Seems like it would be better to always keep this as a list so we can never encounter that scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... I suppose this is somewhat moot given this case clause ends in {stop, normal, NewSt}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, NewSt has waiters = undefined, the original St record still has the waiters list. This is a switch to know when we're responding vs not. I guess technically we could keep it an empty list and call a function that does a no-op but that seemed slightly weirder than this way as it doesn't show the actual intent so I'd fear it'd get lost in the sands of time.
Its also a slightly subtle assumption that once we switch to "actual value inserted in cache" mode that we don't want to handle open calls.
catch exit:_ -> | ||
% Its possible that this process was evicted just | ||
% before we tried talking to it. Just fallback | ||
% to a standard recovery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to update this comment to indicate this is also the fallback for when Pid
executes spawn_opener
and that opener pid fails for whatever reason, resulting in the not open_ok
clause of handle_info({'DOWN', _, _, Pid, Resp}, #st{key = Key, opener = Pid} = St) ->
to be hit leading to {stop, normal, NewSt}
which will be caught by this clause as Pid
dies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the intention and I think you were just missing slightly subtle reference of St after creating NewSt.
-define(CACHE, ddoc_cache_entries). | ||
-define(LRU, ddoc_cache_lru). | ||
-define(REFRESH_TIMEOUT, 67000). | ||
-define(SHUTDOWN_TIMEOUT, 1000). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to not make these configurable? At the very least REFRESH_TIMEOUT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to make them configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made REFRESH_TIMEOUT configurable and this is now just the default. SHUTDOWN_TIMEOUT is entirely too subtle to be turned into a config option. If we hit issues there then we have other problems.
not_found -> | ||
ok | ||
end | ||
end, [no_ddocid | DDocIdList]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth noting no_ddocid
is for the ddoc_cache_entry_custom
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And VDUs. I'll add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done up above now.
@@ -0,0 +1,61 @@ | |||
-module(ddoc_cache_speed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this module be in the test/
dir? or does that complicate executing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its marked as a TEMP commit. I'll end up removing it before merging the final PR. Its not a very rigourous test. To keep it in history I may remove it iwth a revert rather than a squash if you think it'd be worth it.
@davisp overall looks solid to me. |
bc8ddc1
to
8e66b07
Compare
Performance Benchmarking Update: New one is faster! And now for some graphs. For each of these graphs the max_dbs_open was set to 5,000. That's important to remember as I go through the parameter sets whether we're hitting that limit which will be an artificial limit on performance of the cache. Each of these tests is also just querying an empty view which leads to a ddoc_cache lookup for the coordinator, and then Q lookups for each RPC worker. So the requests per second numbers have to be multiplied out to get actual cache performance. Granted we really don't care given that its a constant factor. These first two runs are 1,000 workers hitting 1,000 different design documents in 1,000 different databases with two different Q values (i.e., 1,000 workers using their own ddoc and db). For the most part their ops/sec is basically identical while latencies for the new ddoc_cache are slightly better. This suggests that something else was bottlenecking performance. Possibly the basho_bench driver and possibly something else in CouchDB. Old ddoc_cache Q=4: New ddoc_cache Q=4: Old ddoc_cache Q=8 New ddoc_cache Q=8 So basically, trying to hit a bunch of different databases at once now has slightly better latencies but both are fast enough to not be the bottleneck. Next up was a shoot the moon test to try and test the limits of what sort of performance we could get trying to crush a single entry. These two graphs show 1,000 workers hitting a single design document in a single database. The Q for both of these is 128 which means that the ddoc_cache is going to have to try and maintain 129,000 lookups/sec. Old ddoc_cache: New ddoc_cache: The immediate thing to note here is that old ddoc_cache flat lines half way into its test. This was because couch_log_server spiked its message queue. The reason for this is because of the huge flood of fabric_rpc_worker timeout messages being logged. If you look at the graph for the old ddoc_cache you can see that it has two cliffs, one at 60s, and one at 120s when it then flat lines cause it pushed couch_log_server off a cliff. These drops are precisely what motivated this work in the first place when the old ddoc_cache would evict entries every 60s and the thundering herd would knock things off their rocker. For the new ddoc_cache you'll see that its still kinda crap performance even though it doesn't totally lose its mind. In the background this was because rexi was unable to keep up with the work load and spiked pretty bad. I'm gonna be investigating that area for more optimizations once this work is wrapped up. And finally, this last graph is a comparison between the old and new ddoc_caches sweeping through Q=8, 16, and 32 with the same 1,000 workers against a single db and ddoc. Red is old ddoc_cache, green is new ddoc_cache. Yes I know Tufte would kill me but that's the colors that get picked and I don't care enough to go fiddle with it. As you can see the new ddoc_cache is consistently faster as well as much less variable. Looking into the variance on the best Q=64 run for the old ddoc_cache vs the worst Q=64 run for new ddoc_cache shows a good example of the variance of the old approach. For these two runs I've also included what the rexi server message queue is doing so we can see its effect on performance. Old ddoc_cache: New ddoc_cache: Again we can see how badly that 60s eviction policy is when we have sustained load against a single design document. This leads to some fairly massive spikes in the system. For the old ddoc_cache on the third eviction at 180s we see it flat line again which is why those runs are so variable. For the new ddoc_cache we can see that db3's rex has sustained elevated message counts which are holding back the benchmark back from meeting some of the old ddoc_cache spikes when rexi had a chance to clear out. Now that I can duplicate that rexi issue easily enough though I'll be working on trying to figure out why its being slow and try and optimize around it. Hopefully this data is as convincing to everyone else as it is to me. If anyone wants me to check into anything else I certainly have the data and/or can design runs to try and run in some other configuration if requested. However poking both ends of the spectrum (lots of clients against separate design docs and lots of clients against a single design doc) I'm fairly confident that we're winning across the spectrum though most specifically for the single design doc case (which was the motivation for this work). |
|
||
|
||
handle_call(Msg, _From, St) -> | ||
{stop, {invalid_call, Msg}, {invalid_call, Msg}, St}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never understood the reasoning behind the termination of a server when it receives unknown message. In fact it is the caller (sender) fault so the caller should be terminated not the server. I think we should just ignore all unknown messages in this case (and other gen_server callbacks in this module).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that tables are supervised by one_for_all
supervisor we would restart most of ddoc_cache
when we receive unknown message by accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're quite right that it should be the client that terminates. I've always wanted to have some sort of VM feature that would allow us to define a protocol of message patterns a process was willing to receive. However I haven't got a clue how that'd work when it comes to distributed Erlang.
However, given that something like that isn't available the next best possibility at enforcing a protocol is to bail if we get something unexpected. In some cases it can be appropriate to ignore messages, however given that this is a concurrent cache of design documents in a database it would be a bad idea. If we're getting messages that are unexpected then it means that something we've written is not behaving as intended. And given that this cache handles things like validate doc update functions it could be a sever issue if we were to have something working incorrectly.
The one_for_all is legacy from a previous iteration of this work that split the LRU work between two gen_servers. In that case if either of the servers died for any reason we would have wanted the other to also be restarted so that state is not inconsistent. Now that all the eviction and opening logic is in a single gen_server I'll remove the tables process and revert to a standard one_for_one supervisor strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For "gen_server call" crashing the client is simple, but "info" is impossible.
some_operation(Pid, Arg, Something) when is_integer(Arg) andalso something(Something) ->
gen_server:call(Pid, {do_some_operation, Arg, Something}).
handle_call({do_some_operation, Arg, Something},....)
If we place API function in the same module with appropriate guards we would be able to crash the caller.
This technique is not applicable in the case of ddoc_cache_tables.erl
module. Since this module is not suppose to handle any messages it is just a ets owner process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not an explicit assertion though since it relies on the programmer to guarantee that any message generated by some_operation/3 is handled by a clause in handle_call/3. Having a catch all is an explicit assertion that prevents the possibility of ignoring messages that were intended to be handled.
ok -> | ||
true = ets:insert_new(?CACHE, #entry{key = Key}), | ||
{ok, Pid} = ddoc_cache_entry:start_link(Key, Default), | ||
true = ets:update_element(?CACHE, Key, {#entry.pid, Pid}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its an optimization to start funneling open messages directly to the entry process even before its fully initialized to relieve pressure off the ddoc_cache_lru process. The places you mentioned are creating a full entry. Rather than do an update on the individual elements they just overwrite the entire #entry{} which involves setting the pid again.
end. | ||
|
||
|
||
insert(Key, Value) -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to mention that return value of this function is inconsistent (and not obvious) it can return any of
ok
{ok, Pid}
full
It shouldn't be a problem though since the result of the call is not checked anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, its fire and forget. I can add an unconditional ok after the gen_server:call if you prefer but it didn't seem super useful since its also an internal API.
CurSize = ets:info(?CACHE, memory) * erlang:system_info(wordsize), | ||
couch_log:error("SIZE: ~b :: ~b~n", [CurSize, MaxSize]), | ||
if CurSize =< MaxSize -> ok; true -> | ||
case ets:first(?LRU) of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we have to access ?LRU table here is unfortunate. We could implement
ddoc_cache_entry:oldest_entry() ->
case ets:first(?LRU) of
'$end_of_table' -> nil;
{_Ts, _Key, _Pid} = Entry -> Entry
end.
Not a big deal. Feel free to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For efficiency reasons you could swap end_of_table
and {_, _, _}
clauses. So the most likely case would be first. Not a big deal either. Since it is a micro-optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a bit clearer as well to change that order so the positive branch is first. Will do that.
Pattern = #entry{key = Key, pid = self(), _ = '_'}, | ||
CacheMSpec = [{Pattern, [], [true]}], | ||
1 = ets:select_delete(?CACHE, CacheMSpec), | ||
if Ts == undefined -> ok; true -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These exact 3 lines are repeated 3 times in the module. It might be a good opportunity to extract a function:
remove_slot_from_lru(Key, Ts) ->
if Ts == undefined -> 0; true ->
MSpec = [{{{Ts, Key, self()}}, [], [true]}],
ets:select_delete(?LRU, MSpec)
end.
or
remove_slot_from_lru(_Key, undefined) ->
0;
remove_slot_from_lru(Key, Ts) ->
MSpec = [{{{Ts, Key, self()}}, [], [true]}],
ets:select_delete(?LRU, MSpec).
Not a big deal. Feel free to ignore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only twice. The logic in terminate/2 is slightly different since its accounting for whether the entry was already removed. I'll go ahead and pull it out and add a comment though.
ok. | ||
|
||
|
||
drain_accessed() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very neat idea!!!
@@ -10,25 +10,28 @@ | |||
% License for the specific language governing permissions and limitations under | |||
% the License. | |||
|
|||
-module(ddoc_cache_util). | |||
-module(ddoc_cache_entry_custom). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you planing to remove it along with _speed.erl
after you finish testing?
If not then why not to dispatch the rest of the API functions to Mod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure on the dispatch question. The diff there is a bit unfortunate as I removed an unused util module and added this one that's smaller. The custom module is part of the API though. Mango uses this to cache all indexes for a database.
meck:new(ddoc_cache_ev, [passthrough]), | ||
try | ||
Lru = whereis(ddoc_cache_lru), | ||
State = sys:get_state(Lru), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider to not rely on internals? There are few approaches to do it:
- Export a small boilerplate with every gen_server which uses
r2l
macro.get_state() -> ?r2l(sys:get_state(whereis(?MODULE)). get_state(Keys) -> %% this would return values in the order of the Keys %% so we can use [Foo, Bar] = ddoc_cache_lru:get_state([foo, bar]). %% end we wouldn't have to update all integer offsets when we change the record [proplists:get_value(K, get_state()) || K <- Keys].
- export specific introspection functions for testing (get_evictor)
- export test function implementing actions (stop_evictor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a test testing internals for code coverage so it'll be hard to get away from relying on them. ;)
Given that we're assigning all the extracted values to named variables I don't see how adding an API is doing anything more than just adding unnecessary code.
Opener1 = element(4, sys:get_state(Entry)), | ||
Ref1 = erlang:monitor(process, Opener1), | ||
gen_server:cast(Entry, force_refresh), | ||
receive {'DOWN', Ref1, _, _, _} -> ok end, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want a timeout clause here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Narp. That just adds dead code to the test suite. Tests already have a 5s timeout so no need to add a shorter one.
true = ets:insert_new(?CACHE, #entry{key = Key}), | ||
{ok, Entry} = ddoc_cache_entry:start_link(Key, undefined), | ||
Ref = erlang:monitor(process, Entry), | ||
?assertEqual(1, element(7, sys:get_state(Entry))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if you decide that sys:get_state
is the best you could do, please abstract it into a function. Since readability of the test case is affected. The reader has to open the source of ddoc_cache_entry and count the fields to figure out accessed
. It could be as simple as:
%% this would make it easier to update when we change state record
%% there is no need to specify fields you are not using in the test
get_state(Entry, opener) -> element(4, sys:get_state(Entry));
get_state(Entry, accessed) -> element(7, sys:get_state(Entry)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've labeled the access counts with a variable name so that its more clear. Given all of the other instances already use variable names to clarify it didn't seem hugely useful to add a function for these.
|
||
?assertEqual(0, element(7, sys:get_state(Entry))), | ||
ok = gen_server:cast(Entry, refresh), | ||
receive {'DOWN', Ref, _, _, Reason} -> Reason end, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add timeout clause here as well?
open_dead_entry({DbName, _}) -> | ||
Pid = spawn(fun() -> ok end), | ||
Key = {ddoc_cache_entry_custom, {DbName, ?MODULE}}, | ||
?assertEqual(recover(DbName), ddoc_cache_entry:open(Pid, Key)). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome test 👍
ddoc_cache_tutil:clear(), | ||
meck:reset(ddoc_cache_ev), | ||
Rev = ddoc_cache_tutil:get_rev(DbName, ?FOOBAR), | ||
ShardName = element(2, hd(mem3:shards(DbName))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Technically we don't include mem3 in the app file. So we are not suppose to call it. We cannot include mem3 header file either. Tricky problem. However we include fabric which depends on mem3.
Should we:
-include_lib("mem3/include/mem3.hrl").
...
%% Replaces: ShardName = element(2, hd(mem3:shards(DbName))),
first_shard(DbName) ->
[#shard{name = ShardName} | _] = mem3:shards(DbName)),
ShardName.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Listed mem3 in the .app.src just in case fabric somehow eventually comes to not depend on it. And pulled in mem3.hrl to use a pattern match rather than element access. No idea why I did that in the first place.
Key = InitDDoc(I), | ||
couch_log:error("STARTED? ~p", [Key]), | ||
meck:wait(ddoc_cache_ev, event, [started, Key], 1000), | ||
?assert(ets:info(?CACHE, size) > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this assertion checks (why >0)? Does it expects:
> 0
> size(BeforeTestRun)
=:= I
I would expect it to check that ets:info(?CACHE, size) =:= I
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its an old assert from before I switched to memory usage for cache sizing. Originally that assert was == I and below was == I - 5 to assert we were re-filling cache. They don't make sense any more though so I'll just remove them.
d354951
to
da5eca6
Compare
[FooBar, VDU, Custom]. | ||
|
||
|
||
purge_modules() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move it into test_util
(and rename to purge_modules_for_app
maybe)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bad idea. I've been contemplating writing a Make rule to run eunit tests for each app in a separate Erlang VM. This was mostly a stop gap out of anger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Very nice work and performance analysis. The test coverage is great as well. Few minor suggestions for tests. Tests pass locally.
This is an old merge artifact that was duplicating the event notifications twice per design document update.
3821de7
to
8a7eaa7
Compare
There were a couple issues with the previous ddoc_cache implementation that made it possible to tip over the ddoc_cache_opener process. First, there were a lot of messages flowing through a single gen_server. And second, the cache relied on periodically evicting entries to ensure proper behavior in not caching an entry forever after it had changed on disk. The new version makes two important changes. First, entries now have an associated process that manages the cache entry. This process will periodically refresh the entry and if the entry has changed or no longer exists the process will remove its entry from cache. The second major change is that the cache entry process directly mutates the related ets table entries so that our performance is not dependent on the speed of ets table mutations. Using a custom entry that does no work the cache can now sustain roughly one million operations a second with a twenty thousand clients fighting over a cache limited to one thousand items. In production this means that cache performance will likely be rate limited by other factors like loading design documents from disk.
8a7eaa7
to
a1fadde
Compare
qc: lift experimental notice
…s. (apache#252)" (apache#610) This reverts commit ccc6538. This notion of different quorums for doc read vs doc write is not true.
Overview
The previous version of ddoc_cache was written to rely on evicting
entries after a maximum TTL. This leads to issues on clusters that have
a large amount of load on databases with a large Q. What ends up
happening is that when a design document is evicted we suddenly have a
thundering herd scenario as every client attempts to reinsert it into
the cache.
This change instead relies on a monitor process for each cache entry
that periodically attempts to refresh the cache. This way normal clients
accessing a popular design document will never hit a point where it
doesn't exist in cache. And we'll have at most one reader trying to
write the value.
Testing recommendations
make eunit apps=ddoc_cache
You'll find its got 100% test coverage if you enable cover.
One note on the tests that's a bit different than other suites. I stumbled across a fun pattern of adding a function call to various locations in the code so that I can meck:wait on it. This is a compile time switch so its only affects the eunit runs. Production code has the calls compiled out and replaced by an atom which does nothing.
Checklist