Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ddoc cache #610

Merged
merged 2 commits into from Aug 1, 2017
Merged

Optimize ddoc cache #610

merged 2 commits into from Aug 1, 2017

Conversation

davisp
Copy link
Member

@davisp davisp commented Jun 22, 2017

Overview

The previous version of ddoc_cache was written to rely on evicting
entries after a maximum TTL. This leads to issues on clusters that have
a large amount of load on databases with a large Q. What ends up
happening is that when a design document is evicted we suddenly have a
thundering herd scenario as every client attempts to reinsert it into
the cache.

This change instead relies on a monitor process for each cache entry
that periodically attempts to refresh the cache. This way normal clients
accessing a popular design document will never hit a point where it
doesn't exist in cache. And we'll have at most one reader trying to
write the value.

Testing recommendations

make eunit apps=ddoc_cache

You'll find its got 100% test coverage if you enable cover.

One note on the tests that's a bit different than other suites. I stumbled across a fun pattern of adding a function call to various locations in the code so that I can meck:wait on it. This is a compile time switch so its only affects the eunit runs. Production code has the calls compiled out and replaced by an atom which does nothing.

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;

{noreply, St};

handle_cast({evict, DbName}, St) ->
gen_server:abcast(mem3:nodes(), ?MODULE, {do_evict, DbName}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be [node() | mem3:nodes()] to ensure the do_evict logic is triggered on every node? Unless I'm missing something it looks like you'll trigger do_evict on all the other nodes besides this one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mem3:nodes() should include node() from what I remember.

Some nodes might not be live. Initially thought maybe it would be good to only send to live ones, but abcast will just silently ignore the sends and so no big deal there probably

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickva oh right, good call. Even though I actually typed mem3:nodes() I read it as nodes() ;-)


handle_db_event(ShardDbName, deleted, St) ->
gen_server:cast(?MODULE, {evict, mem3:dbname(ShardDbName)}),
{ok, St};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also handle the couch_event:notify(Db2#db.name, {ddoc_updated, DDocId}) event and trigger a refresh directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd contemplated removing the evict call and handling it purely in ddoc_cache using the events. At the time I figured I'd keep the same pattern but given we're already processing (discarding) them I guess it makes sense to remove the extra message.

} = St,
NewTime = Time + 1,
NewSt = St#st{time = NewTime},
Pid = ddoc_cache_refresher:spawn_link(Key, ?REFRESH_TIMEOUT),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have a separate refresh pid for every single ddoc? That seems like it could potentially get a bit out of hand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do stick with a pid per ddoc, would it be cleaner to use a simple_one_for_one supervisor here and just add children to that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience supervisors are too slow for this sort of type of spawning since they want to start every child synchronously. The original PR for this work had a custom supervisor that I decided to not use because it seemed to add more complexity than it saved.

[#entry{val = Val}] ->
?EVENT(update_noop, Key),
loop(Key, Interval);
[#entry{pid = Pid}] when Pid == self() ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention here to force this pid to exit when there's a different pid handling the #entry{}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a mild assertion that we don't have more than one pid updating a cache entry yes. It shouldn't ever kick in but especially for developing its good to know that the behavior is what we're expecting.

ddoc_cache_entry:handle_resp(Resp);
[#entry{val = Val}] ->
couch_stats:increment_counter([ddoc_cache, hit]),
ddoc_cache_lru:accessed(Key),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about sending a message here with every successful lookup. We've seen this to be problematic in couch_server, and I think we might run into similar issues here. Every single time ddoc_cache_lru:accessed/1 is called it can trigger 1 khash read, 1 khash write, one ets lookup, one is_process_alive check, one ets delete, and one ets insert. I could see this getting overwhelmed in a hurry.

{noreply, St#st{evictor=Evictor}};

handle_info(Msg, St) ->
{stop, {invalid_info, Msg}, St}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my comment below in https://github.com/apache/couchdb/pull/610/files#r123584058, we're ignoring refresher pids exiting when another refresher pid has an entry. Is this intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before this is asserting that the cache entries are behaving as expected. Each entry has exactly one refresher pid that lives for as long as the cache entry. We don't want more than one and we especially don't want zero. When we close an entry we unlink and stop the refresher so we should never see an unknown 'EXIT' message.

error:badarg ->
recover
couch_stats:increment_counter([ddoc_cache, miss]),
Resp = gen_server:call(?MODULE, {open, Key}, infinity),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're falling back on the same pattern here of sending an open message on every cache miss. We've seen this to be problematic in the existing ddoc_cache, and I think this is an important issue to solve here. Is there a reason you think sticking with this approach is the proper way forward? It seems like we'll still be susceptible to thundering herd problems when there is not an entry in the cache.

Would it make sense to use the opener pattern you used recently in mem3_shards? b71677f

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I was only focused on the worst case we've seen where we have very little load on ddoc_cache and were still hitting the thundering herd (because of the TTL expiration). I'll contemplate the mem3 approach but if memory serves that requires us to change our hard limit on cache size to a soft limit. Granted that's not a huge issue but its something to think about.



init(_) ->
BaseOpts = [public, named_table],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all the tables are public, would it make sense to have callers perform more of the ETS updates rather than funneling it through a gen_server? Obviously not all of that would be appropriate, but it seems like you could potentially do something here with the ?LRU table and using os:timestamp(). For instance:

%% Make duplicate_bag ets table to store an index of all LRU entries per ddoc_cache Key
ets:new(lru_keys, [duplicate_bag]).

accessed(Key) ->
    T = os:timestamp(),
    [ets:delete(?LRU, Old) || Old <- ets:lookup(lru_keys, Key)],
    true = ets:insert(?LRU, {T, Key}),
    true = ets:insert(lru_keys, {Key, T}).

This would bypass funneling the lru access messages through ddoc_cache_lru and instead use ETS for the concurrent access.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly can't have the LRU updates not single-threaded as we can't do multi-key atomic compare and replace type operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the other changes are mostly due to us having a hard cap on cache size. Going with a soft cap I think we could do something similar to the mem3 approach. Also, these tables are only public so that I could use the separate process for table ownership.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly can't have the LRU updates not single-threaded as we can't do multi-key atomic compare and replace type operations. Is that actually needed? I think the duplicate bag approach above might allow us to skip that entirely. We would still need to figure out what to do for the ?LRU entries in that if we used os:timestamp() we could potentially end up with duplicate entries, and I don't think there's an ordered_duplicate_bag ETS table. But I think there's potential for a less strict LRU that accomplishes the same things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I missed the duplicate_bag trick. Though your os:timestamp means you could stomp on another key as you point out. Maybe include the pid in there or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should say that's quite an intriguing approach. I'm poking around at other caching schemes to see if there's not a slight variation that'll play better with ets.

@davisp
Copy link
Member Author

davisp commented Jun 23, 2017

Refreshing my memory on the mem3 PR I can definitely apply a lot of that to this as well so I'll update things for that.

For the accessed messages there's not a lot I think we can do there with the existing structure. Although I could see changing things around a bit to make it a LFU cache instead of an LRU cache. It'd involve a table scan to evict but would remove a lot of the message passing to ddoc_cache_lru and friends. I'll make the mem3 inspired changes first and then will look at this approach.

@davisp
Copy link
Member Author

davisp commented Jun 27, 2017

@chewbranca You managed to nerd snipe me for like four days trying to figure out the mem3 and concurrent LRU stuff.

For the mem3 bits I think my conclusion is that its not appropriate here. For mem3_shards we wanted to make sure that there would always be progress so we load things outside of the gen_server and send them to cache. Given that its a local db read this isn't the most terrible thing in the world. Although, for ddoc_cache its a clustered fabric call, so if we followed the mem3_shards pattern we'd actually just be putting extra load on other parts of the system like rexi, and the couch_db_updaters at 3x the rate of client requests. So having them funnel through the opener actually makes more sense here.

As to the distributed LRU approach for writes I've been playing with a couple different standalone tests to try and get a feel for our maximum throughput with that. So far (with some extremely synthetic tests) I think we're looking at topping out at around 200K updates a second (at least on my laptop). However, this test doesn't actually take into account actually evicting things. My plan for today is to write a basho_bench driver to try and gauge the relative throughput between what I've got written now and a second approach based on your direct ets write idea. I'll post results here when I get them.

@apache apache deleted a comment from ditta95aR Jun 27, 2017
@davisp
Copy link
Member Author

davisp commented Jun 28, 2017

@chewbranca Epic nerd snipe. I've sprinkled more parallel on the ddoc_cache. Care to take another look. Currently measuring 1M+ ops/sec against a 1,000 item cache on my MBP. I have to do some work on the test suite tomorrow to finish up the rewritten rewrite work but I'm currently pretty happy with it.

New style is one pid per entry which does all of the ets writes for the entry. the LRU process only exists to do evictions. I soaked 20,000 clients against a 500 item cache on my laptop for 30-45m and it sustained about 1.5M ops/sec without error and without blowing RAM up beyond 500M. Those numbers are for a custom module that does no loading but I think it shows that the cache itself won't be bound by the cache and instead it'll be fabric calls and or some other bottle neck before the cache.


-module(ddoc_cache_tables).
-behaviour(gen_server).
-vsn(1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the vsn(1) attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes it easier to write code_change clauses if we ever get back to hot code upgrades.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some modules don't have it like ddoc_cache_entry. should we add it to that as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

% the License.

-module(ddoc_cache_entry).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the gen_server behavior attribute?

-behaviour(gen_server).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call.

waiters = []
},
?EVENT(started, Key),
gen_server:enter_loop(?MODULE, [], St).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use the proc_lib:spawn_link + enter_loop instead of the regular gen_server init pattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance. gen_server:start* functions are all synchronous so we'd be gating ddoc_cache_lru performance on the cost of ddoc_cache_entry's init function plus overhead of message passing back and forth. Here we link and crash like the world is on fire if anything goes wrong so its ok to fire and forget here.

@chewbranca
Copy link
Contributor

@davisp hehehehe happy to help with the sniping ;-)



open(Key) ->
try ets:lookup(?CACHE, Key) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ddoc_cache:open_doc/{2,3} the Key is {ddoc_cache_entry_ddocid, {DbName, DocId}} and {ddoc_cache_entry_ddocid_rev, {DbName, DocId, RevId}}, respectively. That implies we'll have separate pids for the "latest version" and "specific versions" cases. Is that intentional?

It also means we'll end up triggering two ddoc_cache load cycles on an uncached ddoc view request. For instance in fabric:query_view/6 [1] we call the two arity open_doc giving us Key = {ddoc_cache_entry_ddocid, {DbName, DocId}}. We then extract the doc_id/rev pair [2] to ensure all view shards use the same ddoc and ship that in the rpc calls, at which point we use the three arity open_doc variant to open the specific ddoc rev [3]. This means we'll perform a non ddoc-rev cache fetch on the coordinator, followed by up to Q*N ddoc-rev cache lookups on the rpc nodes, leaving the coordinator node with the ddoc entry and no ddoc-rev entry, and the rpc nodes with the ddoc-rev entry and no ddoc entry. Then we'll potentially perform a non ddoc-rev fetch on every single coordinator node until every node has the non ddoc-rev cache entry, even if they already have the latest ddoc-rev entry. So if you have 12 nodes, and round robin one view request across all of them you'll block on caching the non ddoc-rev version 12 times, despite most likely already having the relevant latest rev in the ddoc-rev version.

Perhaps this isn't the end of the world, but it definitely seems awkward and potentially avoidable. In the case of loading the ddoc-rev version you could insert a ?CACHE entry for the non ddoc-rev version pointing to that pid, but that might be awkward in the inverse case when you start with ddoc_open/2.

On a related note, does this mean that we'll have pids sticking around for old ddoc rev versions until they're evicted from the cache?

[1] https://github.com/apache/couchdb/blob/master/src/fabric/src/fabric.erl#L337
[2] https://github.com/apache/couchdb/blob/master/src/fabric/src/fabric_view_map.erl#L28
[3] https://github.com/apache/couchdb/blob/master/src/fabric/src/fabric_rpc.erl#L119

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's all mostly right. The motivation for this was that it would allow the individual entries to be direct lookups rather than table scans and also allows for the non-rev-specific entry to automatically update its value when a design document is updated. As to the extra non-cache lookups the way I reasoned about it was that its only a constant number in the frequently asked case so once they're in cache its not an issue. And on the flip side if we're iterating over things in such a way that if we're evicting entries and then have a lot of traffic to un-cached entries it doesn't matter either as we'd have to increase the cache size regardless.

I'll take a ponder on how to insert things into the cache from non-client requested actions but it seems to me like it'd open up a lot of race conditions in the already fairly complex logic around managing the cache entries.

Also for the rev-specific entries that's true for the most part. They'll stick around until evicted or a quorum of their database shards are compacted. I'll contemplate how easily it'd be to add some sort of check there so that they'll close if they refresh and are not the most recent revision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's more response in my PR comment down below, but these two commits address the two main points from @chewbranca's comment:

85b5385
32cd254

-include_lib("couch/include/couch_db.hrl").


dbname({DbName, _}) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm... don't we need the 3-tuple variants of these functions to handle keys of the form {DbName, DocId, RevId}?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! You're using the different module dddoc_cache_entry_ddocid_rev for the 3-tuple case. NVM!

not_found ->
{ok, Keys} = khash:from_list([{Key, Pid}]),
{ok, DDocIds} = khash:from_list([{DDocId, Keys}]),
khash:put(Dbs, DbName, DDocIds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, the name of the variable DDocIds is a bit confusing because you can have the two and three tuple key variants in here, so it's ddoc_ids and ddoc_id_revs. Perhaps DDocKeys instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not quite right. DDocIds are only ever a single binary. Its a tree of entries made out of the ddoc_cache_entry:dbname/1 and ddocid/1 calls so that when we go to evict entries based on updates we can avoid scanning the entire ?CACHE table looking for matches.

open(Key) ->
try ets:lookup(?CACHE, Key) of
[] ->
lru_start(Key);
Copy link
Contributor

@chewbranca chewbranca Jul 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be useful to add counters for all four clauses here rather than just the miss/hit cases.

Perhaps couch_stats:increment_counter([ddoc_cache, lru_start]) and couch_stats:increment_counter([ddoc_cache, lru_partial_start]) or some such for this clause and the one immediately below. That would make it nice and easy to see if we're hitting thundering herd problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that the best thing is to add those calls here. I think that all of the miss cases are all correctly qualified as misses. However it might be useful to add a new metric to the entry process that tracks how many clients it responded to which would show the thundering herd issues.

{ok, Val}
catch _:_ ->
couch_stats:increment_counter([ddoc_cache, recovery]),
ddoc_cache_entry:recover(Key)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice fallback pattern to loading things directly through fabric in the event of an error or no ddoc_cache app running. 👍

{open_ok, Key, {ok, Val}} ->
if not is_list(St#st.waiters) -> ok; true ->
respond(St#st.waiters, {open_ok, {ok, Val}})
end,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing where the waiters list is cleared out. I'm assuming that should be down in the NewSt construction on line 191?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome catch! Totally forgot to clean it out on NewSt just as you mentioned. None of the tests apparently is long lived enough to get caught up by having an extra message in its mailbox.

NewSt = St#st{
val = {Status, Other},
opener = undefined,
waiters = undefined
Copy link
Contributor

@chewbranca chewbranca Jul 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I see that setting waiters = undefined will bypass the respond to waiters logic below on line 204, however this will eventually cause a runtime error while calling gen_server:reply(undefined, Resp) in the respond/2 function if val is ever unset for whatever reason and we end up back in the handle_call(open, From, #st{val = undefined} = St) -> case. Seems like it would be better to always keep this as a list so we can never encounter that scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... I suppose this is somewhat moot given this case clause ends in {stop, normal, NewSt}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, NewSt has waiters = undefined, the original St record still has the waiters list. This is a switch to know when we're responding vs not. I guess technically we could keep it an empty list and call a function that does a no-op but that seemed slightly weirder than this way as it doesn't show the actual intent so I'd fear it'd get lost in the sands of time.

Its also a slightly subtle assumption that once we switch to "actual value inserted in cache" mode that we don't want to handle open calls.

catch exit:_ ->
% Its possible that this process was evicted just
% before we tried talking to it. Just fallback
% to a standard recovery
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to update this comment to indicate this is also the fallback for when Pid executes spawn_opener and that opener pid fails for whatever reason, resulting in the not open_ok clause of handle_info({'DOWN', _, _, Pid, Resp}, #st{key = Key, opener = Pid} = St) -> to be hit leading to {stop, normal, NewSt} which will be caught by this clause as Pid dies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the intention and I think you were just missing slightly subtle reference of St after creating NewSt.

-define(CACHE, ddoc_cache_entries).
-define(LRU, ddoc_cache_lru).
-define(REFRESH_TIMEOUT, 67000).
-define(SHUTDOWN_TIMEOUT, 1000).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to not make these configurable? At the very least REFRESH_TIMEOUT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to make them configurable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made REFRESH_TIMEOUT configurable and this is now just the default. SHUTDOWN_TIMEOUT is entirely too subtle to be turned into a config option. If we hit issues there then we have other problems.

not_found ->
ok
end
end, [no_ddocid | DDocIdList]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth noting no_ddocid is for the ddoc_cache_entry_custom case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And VDUs. I'll add a comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done up above now.

@@ -0,0 +1,61 @@
-module(ddoc_cache_speed).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this module be in the test/ dir? or does that complicate executing it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its marked as a TEMP commit. I'll end up removing it before merging the final PR. Its not a very rigourous test. To keep it in history I may remove it iwth a revert rather than a squash if you think it'd be worth it.

@chewbranca
Copy link
Contributor

@davisp overall looks solid to me.

@davisp
Copy link
Member Author

davisp commented Jul 21, 2017

Performance Benchmarking Update: New one is faster!

And now for some graphs. For each of these graphs the max_dbs_open was set to 5,000. That's important to remember as I go through the parameter sets whether we're hitting that limit which will be an artificial limit on performance of the cache. Each of these tests is also just querying an empty view which leads to a ddoc_cache lookup for the coordinator, and then Q lookups for each RPC worker. So the requests per second numbers have to be multiplied out to get actual cache performance. Granted we really don't care given that its a constant factor.

These first two runs are 1,000 workers hitting 1,000 different design documents in 1,000 different databases with two different Q values (i.e., 1,000 workers using their own ddoc and db). For the most part their ops/sec is basically identical while latencies for the new ddoc_cache are slightly better. This suggests that something else was bottlenecking performance. Possibly the basho_bench driver and possibly something else in CouchDB.

Old ddoc_cache Q=4:

old-ddoc-cache-multi-conc-1000-q-4

New ddoc_cache Q=4:

new-ddoc-cache-multi-conc-1000-q-4

Old ddoc_cache Q=8

old-ddoc-cache-multi-1000-q-8

New ddoc_cache Q=8

new-ddoc-cache-multi-1000-q-8

So basically, trying to hit a bunch of different databases at once now has slightly better latencies but both are fast enough to not be the bottleneck.

Next up was a shoot the moon test to try and test the limits of what sort of performance we could get trying to crush a single entry. These two graphs show 1,000 workers hitting a single design document in a single database. The Q for both of these is 128 which means that the ddoc_cache is going to have to try and maintain 129,000 lookups/sec.

Old ddoc_cache:

old-ddoc-cache-single-conc-1000-q-128

New ddoc_cache:

new-ddoc-cache-single-conc-1000-q-128

The immediate thing to note here is that old ddoc_cache flat lines half way into its test. This was because couch_log_server spiked its message queue. The reason for this is because of the huge flood of fabric_rpc_worker timeout messages being logged. If you look at the graph for the old ddoc_cache you can see that it has two cliffs, one at 60s, and one at 120s when it then flat lines cause it pushed couch_log_server off a cliff. These drops are precisely what motivated this work in the first place when the old ddoc_cache would evict entries every 60s and the thundering herd would knock things off their rocker.

For the new ddoc_cache you'll see that its still kinda crap performance even though it doesn't totally lose its mind. In the background this was because rexi was unable to keep up with the work load and spiked pretty bad. I'm gonna be investigating that area for more optimizations once this work is wrapped up.

And finally, this last graph is a comparison between the old and new ddoc_caches sweeping through Q=8, 16, and 32 with the same 1,000 workers against a single db and ddoc. Red is old ddoc_cache, green is new ddoc_cache. Yes I know Tufte would kill me but that's the colors that get picked and I don't care enough to go fiddle with it.

comparison-conc-100-q-8-16-32

As you can see the new ddoc_cache is consistently faster as well as much less variable. Looking into the variance on the best Q=64 run for the old ddoc_cache vs the worst Q=64 run for new ddoc_cache shows a good example of the variance of the old approach. For these two runs I've also included what the rexi server message queue is doing so we can see its effect on performance.

Old ddoc_cache:

old-ddoc-cache-single-conc-1000-q-64

old-ddoc-cache-single-rexi-conc-1000-q-64

New ddoc_cache:

new-ddoc-cache-single-conc-1000-q-64

new-ddoc-cache-single-rexi-conc-1000-q-64

Again we can see how badly that 60s eviction policy is when we have sustained load against a single design document. This leads to some fairly massive spikes in the system. For the old ddoc_cache on the third eviction at 180s we see it flat line again which is why those runs are so variable.

For the new ddoc_cache we can see that db3's rex has sustained elevated message counts which are holding back the benchmark back from meeting some of the old ddoc_cache spikes when rexi had a chance to clear out. Now that I can duplicate that rexi issue easily enough though I'll be working on trying to figure out why its being slow and try and optimize around it.

Hopefully this data is as convincing to everyone else as it is to me. If anyone wants me to check into anything else I certainly have the data and/or can design runs to try and run in some other configuration if requested. However poking both ends of the spectrum (lots of clients against separate design docs and lots of clients against a single design doc) I'm fairly confident that we're winning across the spectrum though most specifically for the single design doc case (which was the motivation for this work).



handle_call(Msg, _From, St) ->
{stop, {invalid_call, Msg}, {invalid_call, Msg}, St}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never understood the reasoning behind the termination of a server when it receives unknown message. In fact it is the caller (sender) fault so the caller should be terminated not the server. I think we should just ignore all unknown messages in this case (and other gen_server callbacks in this module).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that tables are supervised by one_for_all supervisor we would restart most of ddoc_cache
when we receive unknown message by accident.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're quite right that it should be the client that terminates. I've always wanted to have some sort of VM feature that would allow us to define a protocol of message patterns a process was willing to receive. However I haven't got a clue how that'd work when it comes to distributed Erlang.

However, given that something like that isn't available the next best possibility at enforcing a protocol is to bail if we get something unexpected. In some cases it can be appropriate to ignore messages, however given that this is a concurrent cache of design documents in a database it would be a bad idea. If we're getting messages that are unexpected then it means that something we've written is not behaving as intended. And given that this cache handles things like validate doc update functions it could be a sever issue if we were to have something working incorrectly.

The one_for_all is legacy from a previous iteration of this work that split the LRU work between two gen_servers. In that case if either of the servers died for any reason we would have wanted the other to also be restarted so that state is not inconsistent. Now that all the eviction and opening logic is in a single gen_server I'll remove the tables process and revert to a standard one_for_one supervisor strategy.

Copy link
Contributor

@iilyak iilyak Jul 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "gen_server call" crashing the client is simple, but "info" is impossible.

some_operation(Pid, Arg, Something) when is_integer(Arg) andalso something(Something) ->
    gen_server:call(Pid, {do_some_operation, Arg, Something}).

handle_call({do_some_operation, Arg, Something},....)

If we place API function in the same module with appropriate guards we would be able to crash the caller.

This technique is not applicable in the case of ddoc_cache_tables.erl module. Since this module is not suppose to handle any messages it is just a ets owner process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not an explicit assertion though since it relies on the programmer to guarantee that any message generated by some_operation/3 is handled by a clause in handle_call/3. Having a catch all is an explicit assertion that prevents the possibility of ignoring messages that were intended to be handled.

ok ->
true = ets:insert_new(?CACHE, #entry{key = Key}),
{ok, Pid} = ddoc_cache_entry:start_link(Key, Default),
true = ets:update_element(?CACHE, Key, {#entry.pid, Pid}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for updating the pid here? It seems like we do set it already here and here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its an optimization to start funneling open messages directly to the entry process even before its fully initialized to relieve pressure off the ddoc_cache_lru process. The places you mentioned are creating a full entry. Rather than do an update on the individual elements they just overwrite the entire #entry{} which involves setting the pid again.

end.


insert(Key, Value) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to mention that return value of this function is inconsistent (and not obvious) it can return any of

  • ok
  • {ok, Pid}
  • full

It shouldn't be a problem though since the result of the call is not checked anywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its fire and forget. I can add an unconditional ok after the gen_server:call if you prefer but it didn't seem super useful since its also an internal API.

CurSize = ets:info(?CACHE, memory) * erlang:system_info(wordsize),
couch_log:error("SIZE: ~b :: ~b~n", [CurSize, MaxSize]),
if CurSize =< MaxSize -> ok; true ->
case ets:first(?LRU) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we have to access ?LRU table here is unfortunate. We could implement

ddoc_cache_entry:oldest_entry() ->
     case ets:first(?LRU) of
         '$end_of_table' -> nil;
         {_Ts, _Key, _Pid} = Entry -> Entry
end.

Not a big deal. Feel free to ignore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For efficiency reasons you could swap end_of_table and {_, _, _} clauses. So the most likely case would be first. Not a big deal either. Since it is a micro-optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a bit clearer as well to change that order so the positive branch is first. Will do that.

Pattern = #entry{key = Key, pid = self(), _ = '_'},
CacheMSpec = [{Pattern, [], [true]}],
1 = ets:select_delete(?CACHE, CacheMSpec),
if Ts == undefined -> ok; true ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exact 3 lines are repeated 3 times in the module. It might be a good opportunity to extract a function:

remove_slot_from_lru(Key, Ts) ->
    if Ts == undefined -> 0; true ->
        MSpec = [{{{Ts, Key, self()}}, [], [true]}],
        ets:select_delete(?LRU, MSpec)
    end.

or

remove_slot_from_lru(_Key, undefined) ->
    0;
remove_slot_from_lru(Key, Ts) ->
    MSpec = [{{{Ts, Key, self()}}, [], [true]}],
    ets:select_delete(?LRU, MSpec).

Not a big deal. Feel free to ignore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only twice. The logic in terminate/2 is slightly different since its accounting for whether the entry was already removed. I'll go ahead and pull it out and add a comment though.

ok.


drain_accessed() ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very neat idea!!!

@@ -10,25 +10,28 @@
% License for the specific language governing permissions and limitations under
% the License.

-module(ddoc_cache_util).
-module(ddoc_cache_entry_custom).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you planing to remove it along with _speed.erl after you finish testing?
If not then why not to dispatch the rest of the API functions to Mod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure on the dispatch question. The diff there is a bit unfortunate as I removed an unused util module and added this one that's smaller. The custom module is part of the API though. Mango uses this to cache all indexes for a database.

meck:new(ddoc_cache_ev, [passthrough]),
try
Lru = whereis(ddoc_cache_lru),
State = sys:get_state(Lru),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider to not rely on internals? There are few approaches to do it:

  1. Export a small boilerplate with every gen_server which uses r2l macro.
    get_state() ->
       ?r2l(sys:get_state(whereis(?MODULE)).
    get_state(Keys) -> 
       %% this would return values in the order of the Keys
       %% so we can use [Foo, Bar] = ddoc_cache_lru:get_state([foo, bar]).
       %% end we wouldn't have to update all integer offsets when we change the record
       [proplists:get_value(K, get_state()) || K <- Keys]. 
    
  2. export specific introspection functions for testing (get_evictor)
  3. export test function implementing actions (stop_evictor)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a test testing internals for code coverage so it'll be hard to get away from relying on them. ;)

Given that we're assigning all the extracted values to named variables I don't see how adding an API is doing anything more than just adding unnecessary code.

Opener1 = element(4, sys:get_state(Entry)),
Ref1 = erlang:monitor(process, Opener1),
gen_server:cast(Entry, force_refresh),
receive {'DOWN', Ref1, _, _, _} -> ok end,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want a timeout clause here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Narp. That just adds dead code to the test suite. Tests already have a 5s timeout so no need to add a shorter one.

true = ets:insert_new(?CACHE, #entry{key = Key}),
{ok, Entry} = ddoc_cache_entry:start_link(Key, undefined),
Ref = erlang:monitor(process, Entry),
?assertEqual(1, element(7, sys:get_state(Entry))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if you decide that sys:get_state is the best you could do, please abstract it into a function. Since readability of the test case is affected. The reader has to open the source of ddoc_cache_entry and count the fields to figure out accessed. It could be as simple as:

%% this would make it easier to update when we change state record
%% there is no need to specify fields you are not using in the test
get_state(Entry, opener) -> element(4, sys:get_state(Entry));
get_state(Entry, accessed) -> element(7, sys:get_state(Entry)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've labeled the access counts with a variable name so that its more clear. Given all of the other instances already use variable names to clarify it didn't seem hugely useful to add a function for these.


?assertEqual(0, element(7, sys:get_state(Entry))),
ok = gen_server:cast(Entry, refresh),
receive {'DOWN', Ref, _, _, Reason} -> Reason end,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add timeout clause here as well?

open_dead_entry({DbName, _}) ->
Pid = spawn(fun() -> ok end),
Key = {ddoc_cache_entry_custom, {DbName, ?MODULE}},
?assertEqual(recover(DbName), ddoc_cache_entry:open(Pid, Key)).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome test 👍

ddoc_cache_tutil:clear(),
meck:reset(ddoc_cache_ev),
Rev = ddoc_cache_tutil:get_rev(DbName, ?FOOBAR),
ShardName = element(2, hd(mem3:shards(DbName))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Technically we don't include mem3 in the app file. So we are not suppose to call it. We cannot include mem3 header file either. Tricky problem. However we include fabric which depends on mem3.
Should we:

-include_lib("mem3/include/mem3.hrl").
...  
%% Replaces: ShardName = element(2, hd(mem3:shards(DbName))),
first_shard(DbName) ->
      [#shard{name = ShardName} | _] = mem3:shards(DbName)),
      ShardName.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Listed mem3 in the .app.src just in case fabric somehow eventually comes to not depend on it. And pulled in mem3.hrl to use a pattern match rather than element access. No idea why I did that in the first place.

Key = InitDDoc(I),
couch_log:error("STARTED? ~p", [Key]),
meck:wait(ddoc_cache_ev, event, [started, Key], 1000),
?assert(ets:info(?CACHE, size) > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this assertion checks (why >0)? Does it expects:

  • > 0
  • > size(BeforeTestRun)
  • =:= I

I would expect it to check that ets:info(?CACHE, size) =:= I

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its an old assert from before I switched to memory usage for cache sizing. Originally that assert was == I and below was == I - 5 to assert we were re-filling cache. They don't make sense any more though so I'll just remove them.

@davisp davisp force-pushed the optimize-ddoc-cache branch 2 times, most recently from d354951 to da5eca6 Compare July 24, 2017 20:44
[FooBar, VDU, Custom].


purge_modules() ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move it into test_util (and rename to purge_modules_for_app maybe)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad idea. I've been contemplating writing a Make rule to run eunit tests for each app in a separate Erlang VM. This was mostly a stop gap out of anger.

Copy link
Contributor

@iilyak iilyak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Very nice work and performance analysis. The test coverage is great as well. Few minor suggestions for tests. Tests pass locally.

This is an old merge artifact that was duplicating the event
notifications twice per design document update.
@davisp davisp force-pushed the optimize-ddoc-cache branch 3 times, most recently from 3821de7 to 8a7eaa7 Compare July 26, 2017 16:16
There were a couple issues with the previous ddoc_cache implementation
that made it possible to tip over the ddoc_cache_opener process. First,
there were a lot of messages flowing through a single gen_server. And
second, the cache relied on periodically evicting entries to ensure
proper behavior in not caching an entry forever after it had changed on
disk.

The new version makes two important changes. First, entries now have an
associated process that manages the cache entry. This process will
periodically refresh the entry and if the entry has changed or no longer
exists the process will remove its entry from cache.

The second major change is that the cache entry process directly mutates
the related ets table entries so that our performance is not dependent
on the speed of ets table mutations. Using a custom entry that does no
work the cache can now sustain roughly one million operations a second
with a twenty thousand clients fighting over a cache limited to one
thousand items. In production this means that cache performance will
likely be rate limited by other factors like loading design documents
from disk.
@davisp davisp merged commit 939761b into master Aug 1, 2017
lag-linaro pushed a commit to lag-linaro/couchdb that referenced this pull request Oct 25, 2018
@nickva nickva deleted the optimize-ddoc-cache branch December 18, 2019 19:46
nickva pushed a commit to nickva/couchdb that referenced this pull request Sep 7, 2022
…s. (apache#252)" (apache#610)

This reverts commit ccc6538.

This notion of different quorums for doc read vs doc write is not true.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants