Replies: 3 comments
-
Hi @ben-manes , thank you for the detailed post! I think it's a good idea to add the support of hill climbing window size change into W-TinyLFU. It's unlikely that we can implement it any time soon since TinyLFU does not have a lot of use case in Facebook. But we are more than happy to review if you can send a pull request. We can also help supply traces for evaluation in cachebench. Regarding the additional suggestions:
It's nice to see you here Ben! I also worked at Addepar a while back on Greenbaums team and had the chance to take over a lot of your code in the financial graph. |
Beta Was this translation helpful? Give feedback.
-
Unfortunately I am not much of a C++ programmer, so I'd leave this to someone else if interest. What is nice is that this avoids needing developers to think about the eviction policy, since those differences mostly goes away. On that note, your 2Q policy is misnamed. The algorithm by Johnson & Shasha uses a non-resident queue, whereas cachelib implemented OpenBSD's TU-Q policy named after Ted Unangst.
This is actually in addition to the CMS. The idea is to avoid polluting the CMS with one-valued counters, which increases the error rate and requires more counters to compensate. Instead the CMS counters are only incremented if found in the BF. For large caches this reduces the CMS space, which at 100Ms+ entries could be a benefit. Caffeine doesn't use this trick, though, because as an application cache it is usually small enough to not be worth the trouble.
Ahh, thanks for pointing that out. This is what memcached does too. A flaw is that it is coupled to the entry, so operations like row scans would trigger many tryLock attempts. The docs do discuss lock contention problems, so if that does become a problem again then that is a technique to consider.
Nope, thanks for pointing that out.
ha, what a mess. I hope you don't hold that against me, I inherited much of it. My "starter project" was to make the graph distributed instead of running all of Addepar on one big machine. I created a transaction log, replayed to "catch-up" the nodes to allow for snapshot reads, a persistent data structure for concurrent readers, and used select-for-update to exclusively row-level lock the graph during a write transaction. Since everything was dependent on that logic in a large code base, but which was designed as a SPOF, that solution gave enough breathing room to allow for an eventual rebuild. I hope by now that situation is a lot better. Anyway, small world! 🙂 |
Beta Was this translation helpful? Give feedback.
-
I'll leave this open then.
Sure it is! AMP is a piece of art. Let's leave it there. |
Beta Was this translation helpful? Give feedback.
-
Goal
Simplify deployment requirements by reducing the need to select the best eviction policy, which is workload dependent and can change over the lifetime of an application. Instead, use simple algorithmic techniques to dynamically reconfigure the cache for the current workload. This allows the cache to robustly make near optimal eviction choices in a wider range of workload patterns.
Context
The TinyLFU implementation uses a static configuration between the tiny / main regions. Per the paper, this is defaulted to 1% / 99% to favor frequency-biased workloads, such as databases and search engines. However, some workloads are highly skewed towards recency such as blockchain mining and social networks. See for example Twitter's data sets where LRU is near optimal. In such cases the frequency filter can degrade the hit ratio, as shown below.
The implementation states that this static parameter does not need to be tuned. While some users might realize their workload bias and choose a different eviction policy, ideally the algorithm is intelligent to discover the optimal setting. In the Caffeine library, this is done by using hill climbing (short article, paper).
Suggested Approach
Use simple hill climbing to guess at a new configuration, sample the hit rate, calculate a new step size based on if the change was an improvement, adjust, and repeat. In Caffeine the initial step size is 6.25% and it decays at a rate of 0.98 so that the policy converges (rather than oscillates around) the best configuration. This process restarts if the hit rate changes by 5% or more. The sample period should be large enough to avoid a noisy hit rate and can piggyback on the reset interval for decaying the access frequency counters. As shown below, this approach can handle highly skewed workloads that change with the environment.
Success Metrics
Additional Suggestions
Beta Was this translation helpful? Give feedback.
All reactions