New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache performance is low when num items is high #152
Comments
Huh can you share your full cache config? (Dump the serialized version of CacheAllocatorConfig). Other questions:
|
Btw for negative cache, CacheLib has this thing called compact-cache that is much more space-efficient than regular cache. https://cachelib.org/docs/Cache_Library_User_Guides/compact_cache |
(1) About the cache config. I only set the cache size and |
For the reference, with another query, the empty value cache can have very good performance:
|
Let me try this. |
I am trying to use the ccache, but got compilation error: Here is a snippet of my code:
Any hints on this? |
The compilation error only occurs when I declare the ccache as |
@wenhaocs: huh the compiler errors look like u hit some sort of internal bug in the compiler? (On our end, we have internal use-cases specifying NoType so it should work). What happens if you duplicate the "NoValue" definition and pass in your own "no value" type? Regarding the slower than expected lookup latency, it may be because all the keys are in the same allocation class. So each lookup (that promotes an item) will hit the same LRU lock. To validate this theory, in your test, can you try sharding your key space into 10 different cache pools? Something like:
|
Hi @therealgymmy Here are a few findings and questions:
I know |
Huh. I guess technically char[0] is illegal. But I expected compiler to handle it (giving us a zero-sided structure). We don't see this internally. But we switched to clang a while back. Code: https://github.com/facebook/CacheLib/blob/main/cachelib/compact_cache/CCacheDescriptor.h#L177 @wenhaocs: for now, you can work around this by using a value structure of 1 byte. U can just choose not to populate it.
This is a bit odd. Do you have any instrumentation on your end that can tell u which code path is taking CPU cycles? Also maybe use perftools to see if we're incurring more cpu cache misses? My original suggestion for sharding the pool is assuming we have contention on LRU lock. But perhaps it's elsewhere. What if you size your hashtable bigger even for the workload with fewer items? E.g. Change both workloads to use a very large hashtable e.g. set bucket power to 2^30. (Keep the pools sharded too to rule out possibility of LRU contention). If both workloads are now at similar latency, it's probably due to cpu cache misses from a larger hash table (buckets are more spread out). You can manually do this: https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/ChainedHashTable.h#L234
|
Yeah that's right. CCache is intended to be used orthogonal to regular cachelib. You can get the stats directly from each CCache instance. https://github.com/facebook/CacheLib/blob/main/cachelib/compact_cache/CCache.h#L384
It's in "maintenance mode" indefinitely. We have no plan to remove it but also no plan on adding additional features. We still have a few odd usage of this here and there internally. |
Hi @therealgymmy Thanks for the reply. Regarding the bucket power influence, using the benchmark limiting QPS, I do see downgraded performance when either we set it too high or too low. For example, in a test with 2^21 GETS per min, and 2^17 cached items, setting the bucket power to 15 or 30 will have similar performance, inferior to 25. But not such a big difference compared with using ccache. I do not have perf tools on hands, may need help from other teams. Here are some other findings I have: (1) The put latency of ccache is high compared with regular cache. In one experiment with fixed QPS benchmark, the avg cache put latency for ccache is (2) Using the regular cache for negative items works badly when I run benchmark as fast as possible, even worse than not using cache. However, if I limit the QPS, the performance is good. For example, for the query that I mentioned early having bad performance, here is a summary of the
That means if I limit the QPS, not running as fast as possible, the regular cache for negative items is not far from ccache. If running stress test, there is a big difference between regular cache and ccache for negative items. Can it be related to the difference of handling multi-threading between ccache and regular cache? Do you have any suggestions on the choices of cache for negative items, where storing keys is enough. If using regular cache, what's the guidance of creating pools? |
Yeah that's right. CCache put involves us re-writing the entire bucket (which contains several items) when evictions occur. You can tune this by changing the bucket size, but it is expected to be more expensive than evicting a regular item. Compact cache's read performance is good only for small items because it performs copy-on-read. Object size should typically be less than a cacheline (128bytes)
I do not really understand why this is. Let's debug this some more. In cachebench tests, when I run a miss-only workload, I typically see higher throughput than workloads with high hit rate. Cachebench runs its operations synchronously so higher throughput typically also mean lower latency. You can try it out by running cachebench with: https://github.com/facebook/CacheLib/blob/main/cachelib/cachebench/test_configs/throughput/miss-workload/ram_cache_get_throughput_all_misses.json In general, we recommend compact cache for negative caching because it is much more space efficient. This is assuming that your negative cache will need to cache a much larger working set than a non-negative cache. (E.g. there're more "do-not-exists" in your system than "exists") Does this pattern sound like your usage? |
Exactly, in our benchmark, "not-exists" can be 8x more than "exists". |
@wenhaocs: Did you resolve this issue? |
Yes. It can be closed for now. I will dig this further in the future. |
Hi, I am using Cachelib to store all items that are requested but missing from underlying layer. So the values of these items are empty. As a result, for 40GB of data, we will end up with around 2e7 items in the cache, which corresponds to the bucket power of 26. I measured the cache hit and cache miss latency. Compared with another cache using cachelib which stores regular items, this cache performance really suffers.
BTW, I set the cache size large enough so no evictions happens.
The increased cache latency actually diluted the value of using cache. Can you give some guidance on this?
The text was updated successfully, but these errors were encountered: