Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache performance is low when num items is high #152

Closed
wenhaocs opened this issue Jul 25, 2022 · 16 comments
Closed

Cache performance is low when num items is high #152

wenhaocs opened this issue Jul 25, 2022 · 16 comments

Comments

@wenhaocs
Copy link
Contributor

wenhaocs commented Jul 25, 2022

Hi, I am using Cachelib to store all items that are requested but missing from underlying layer. So the values of these items are empty. As a result, for 40GB of data, we will end up with around 2e7 items in the cache, which corresponds to the bucket power of 26. I measured the cache hit and cache miss latency. Compared with another cache using cachelib which stores regular items, this cache performance really suffers.

Cache type Cache hit latency Cache miss latency items in cache cache hit ratio
Cache storing regular items 2us 1us 2^22 7%
Cache storing empty values 5us 4us 2^25 44%

BTW, I set the cache size large enough so no evictions happens.
The increased cache latency actually diluted the value of using cache. Can you give some guidance on this?

@therealgymmy
Copy link
Contributor

therealgymmy commented Jul 25, 2022

Huh can you share your full cache config? (Dump the serialized version of CacheAllocatorConfig).

Other questions:

  1. Are you sending similar load to both caches? (cache operations per sec are similar?)
  2. Does 7% mean 93% are misses in cachelib, or these are misses due to empty values?
  3. Are all empty values of the same size (key size + value size always identical?)? What about regular items (what's its size distribution?)

@therealgymmy
Copy link
Contributor

Btw for negative cache, CacheLib has this thing called compact-cache that is much more space-efficient than regular cache.

https://cachelib.org/docs/Cache_Library_User_Guides/compact_cache

@wenhaocs
Copy link
Contributor Author

wenhaocs commented Jul 25, 2022

(1) About the cache config. I only set the cache size and config.setAccessConfig(numCacheItems). All others are default. No hybrid cache.
(2) I run the benchmark making the same query from the same data source. The benchmark will send workload as fast as possible.
(3) My bad about the hit ratio. Actually, 7% is for the cache storing regular items. The hit ratio for the cache storing empty values is 44%.
(4) All empty values are empty strings "". Key sizes are always identical. For regular items, the average value size is 100B, the key size is 16B.

@wenhaocs
Copy link
Contributor Author

For the reference, with another query, the empty value cache can have very good performance:

Cache type Cache hit latency Cache miss latency items in cache cache hit ratio
Cache storing empty values(another query) 1us 1us 2^20 85%

@wenhaocs
Copy link
Contributor Author

Btw for negative cache, CacheLib has this thing called compact-cache that is much more space-efficient than regular cache.

https://cachelib.org/docs/Cache_Library_User_Guides/compact_cache

Let me try this.

@wenhaocs
Copy link
Contributor Author

I am trying to use the ccache, but got compilation error:

image

Here is a snippet of my code:

struct MyKey {
  uint32_t key; 
  MyKey(uint32_t from) {
    key = from;
  }

  bool operator==(const MyKey& other) const {
    return key == other.key;
  }

  bool isEmpty() const {
    return key == 0;
  }
};

....
bool getKey() {
    MyKey mykey = 12;
    auto res = cCache_->get(mykey);                           <----- Here is the line where the compiler error is
    return res == facebook::cachelib::CCacheReturn::FOUND;
}

Any hints on this?

@wenhaocs
Copy link
Contributor Author

The compilation error only occurs when I declare the ccache as NoValue type like CCacheCreator<CCacheAllocator, MyKey>::type. Once I specify the value type, it can compile.

@therealgymmy
Copy link
Contributor

@wenhaocs: huh the compiler errors look like u hit some sort of internal bug in the compiler? (On our end, we have internal use-cases specifying NoType so it should work). What happens if you duplicate the "NoValue" definition and pass in your own "no value" type?


Regarding the slower than expected lookup latency, it may be because all the keys are in the same allocation class. So each lookup (that promotes an item) will hit the same LRU lock. To validate this theory, in your test, can you try sharding your key space into 10 different cache pools?

Something like:

auto poolId = static_cast<PoolId>(hash_func(key) % 10);
auto item = cache->allocate(poolId, ...);
// ... insert into cache.

@wenhaocs
Copy link
Contributor Author

wenhaocs commented Jul 28, 2022

Hi @therealgymmy Here are a few findings and questions:

  1. The floating point exception is caused by char[0] (let alone the Pedantic Warning from compiler). I copied the NoValue definition into my code. After I changed it into char[1]. The compilation succeeds. My compilers are:
    gcc (Ubuntu 9.4.0-1ubuntu1-20.04.1) 9.4.0
    g++ (Ubuntu 9.4.0-1ubuntu1-20.04.1) 9.4.0

  2. If I copy the NoValue definition into my code, how can I fake CacheLib to let me pass nullptr value when set? The CacheLib code determines this internally: constexpr static bool kHasValues = !std::is_same<typename ValueDescriptor::Value, NoValue>::value;

  3. After I change to use CCache to store negative keys. The performance is much better. The cache hit and miss latency are all sub-microseconds. But the overhead is still high if I choose to shard my key space into 10 different cache pools, around 4us for cache hit, 3us for cache miss. A little bit improvement from the original case though.

  4. Seems GlobalCacheStats does not contain the stats of CCache.

I know CCache is to be deprecated. Any guidance on my current situation?

@therealgymmy
Copy link
Contributor

therealgymmy commented Jul 28, 2022

  1. The floating point exception is caused by char[0] (let alone the Pedantic Warning from compiler).

Huh. I guess technically char[0] is illegal. But I expected compiler to handle it (giving us a zero-sided structure). We don't see this internally. But we switched to clang a while back.

Code: https://github.com/facebook/CacheLib/blob/main/cachelib/compact_cache/CCacheDescriptor.h#L177
Found this stackoverflow post about zero-sized struct: https://stackoverflow.com/a/47352963

@wenhaocs: for now, you can work around this by using a value structure of 1 byte. U can just choose not to populate it.

  1. But the overhead is still high if I choose to shard my key space into 10 different cache pools, around 4us for cache hit, 3us for cache miss.

This is a bit odd. Do you have any instrumentation on your end that can tell u which code path is taking CPU cycles? Also maybe use perftools to see if we're incurring more cpu cache misses?

My original suggestion for sharding the pool is assuming we have contention on LRU lock. But perhaps it's elsewhere.

What if you size your hashtable bigger even for the workload with fewer items? E.g. Change both workloads to use a very large hashtable e.g. set bucket power to 2^30. (Keep the pools sharded too to rule out possibility of LRU contention). If both workloads are now at similar latency, it's probably due to cpu cache misses from a larger hash table (buckets are more spread out).

You can manually do this: https://github.com/facebook/CacheLib/blob/main/cachelib/allocator/ChainedHashTable.h#L234

config.setAccessConfig({30 /* bucketPower */, 20 /* lockPower */});

@therealgymmy
Copy link
Contributor

  1. Seems GlobalCacheStats does not contain the stats of CCache.

Yeah that's right. CCache is intended to be used orthogonal to regular cachelib. You can get the stats directly from each CCache instance.

https://github.com/facebook/CacheLib/blob/main/cachelib/compact_cache/CCache.h#L384

I know CCache is to be deprecated. Any guidance on my current situation?

It's in "maintenance mode" indefinitely. We have no plan to remove it but also no plan on adding additional features. We still have a few odd usage of this here and there internally.

@wenhaocs
Copy link
Contributor Author

wenhaocs commented Aug 2, 2022

Hi @therealgymmy Thanks for the reply.

Regarding the bucket power influence, using the benchmark limiting QPS, I do see downgraded performance when either we set it too high or too low. For example, in a test with 2^21 GETS per min, and 2^17 cached items, setting the bucket power to 15 or 30 will have similar performance, inferior to 25. But not such a big difference compared with using ccache.

I do not have perf tools on hands, may need help from other teams. Here are some other findings I have:

(1) The put latency of ccache is high compared with regular cache. In one experiment with fixed QPS benchmark, the avg cache put latency for ccache is 88us, and 51us for regular cache for negative items. I know I can bypass it by putting to cache async, but it may decrease the cache hit ratio when the QPS is extremely high, and may not yield good results as expected. Of course, when the QPS is low, it works perfectly. Is this higher put latency in ccache expected?

(2) Using the regular cache for negative items works badly when I run benchmark as fast as possible, even worse than not using cache. However, if I limit the QPS, the performance is good. For example, for the query that I mentioned early having bad performance, here is a summary of the p95 latency:

w/o cache regular cache with 10 pools regular cache with 1 pool ccache
3575ms 1855ms 1811ms 1699ms

That means if I limit the QPS, not running as fast as possible, the regular cache for negative items is not far from ccache. If running stress test, there is a big difference between regular cache and ccache for negative items. Can it be related to the difference of handling multi-threading between ccache and regular cache?

Do you have any suggestions on the choices of cache for negative items, where storing keys is enough. If using regular cache, what's the guidance of creating pools?

@therealgymmy
Copy link
Contributor

(1) The put latency of ccache is high compared with regular cache

Yeah that's right. CCache put involves us re-writing the entire bucket (which contains several items) when evictions occur. You can tune this by changing the bucket size, but it is expected to be more expensive than evicting a regular item.

Compact cache's read performance is good only for small items because it performs copy-on-read. Object size should typically be less than a cacheline (128bytes)

(2) Using the regular cache for negative items works badly when I run benchmark as fast as possible, even worse than not using cache.

I do not really understand why this is. Let's debug this some more.

In cachebench tests, when I run a miss-only workload, I typically see higher throughput than workloads with high hit rate. Cachebench runs its operations synchronously so higher throughput typically also mean lower latency.

You can try it out by running cachebench with: https://github.com/facebook/CacheLib/blob/main/cachelib/cachebench/test_configs/throughput/miss-workload/ram_cache_get_throughput_all_misses.json


In general, we recommend compact cache for negative caching because it is much more space efficient. This is assuming that your negative cache will need to cache a much larger working set than a non-negative cache. (E.g. there're more "do-not-exists" in your system than "exists") Does this pattern sound like your usage?

@wenhaocs
Copy link
Contributor Author

wenhaocs commented Aug 2, 2022

Exactly, in our benchmark, "not-exists" can be 8x more than "exists".

@therealgymmy
Copy link
Contributor

@wenhaocs: Did you resolve this issue?

@wenhaocs
Copy link
Contributor Author

wenhaocs commented Aug 8, 2022

Yes. It can be closed for now. I will dig this further in the future.

@wenhaocs wenhaocs closed this as completed Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants