Skip to content

Conversation

@bitfaster
Copy link
Owner

@bitfaster bitfaster commented Jul 14, 2022

For the Glimpse scenario ConcurrentLru performs significantly worse than LRU:

Tested 6015 keys in 00:00:00.0494263
Size 250 Classic HitRate 0.91% Concurrent HitRate 1.38%
Size 500 Classic HitRate 0.95% Concurrent HitRate 1.38%
Size 750 Classic HitRate 1.15% Concurrent HitRate 1.38%
Size 1000 Classic HitRate 11.21% Concurrent HitRate 1.38%
Size 1250 Classic HitRate 21.25% Concurrent HitRate 1.38%
Size 1500 Classic HitRate 36.56% Concurrent HitRate 1.38%
Size 1750 Classic HitRate 45.04% Concurrent HitRate 1.38%
Size 2000 Classic HitRate 57.41% Concurrent HitRate 1.38%

image

@bitfaster bitfaster marked this pull request as ready for review July 14, 2022 21:21
@bitfaster bitfaster merged commit f68daf5 into main Jul 14, 2022
@ben-manes
Copy link

fyi, when I try in the simulator's version of this policy (TuQueue) I get better hit rates than LRU. Maybe there is a bug in yours clock adapted variant?

./gradlew simulator:simulate -q \
    --title=EqualCapacityPartition \
    --maximumSize=250,500,750,1000,1250,1500,1750,2000 \
    -Dcaffeine.simulator.files.paths.0=gli.trace.gz \
    -Dcaffeine.simulator.admission.0=Always \
    -Dcaffeine.simulator.policies.0=two-queue.TuQueue \
    -Dcaffeine.simulator.policies.1=linked.Lru \
    -Dcaffeine.simulator.tu-queue.percent-hot=0.33 \
    -Dcaffeine.simulator.tu-queue.percent-warm=0.33

EqualCapacity

./gradlew simulator:simulate -q \
    --title=FavorFrequencyPartition \
    --maximumSize=250,500,750,1000,1250,1500,1750,2000 \
    -Dcaffeine.simulator.files.paths.0=gli.trace.gz \
    -Dcaffeine.simulator.admission.0=Always \
    -Dcaffeine.simulator.policies.0=two-queue.TuQueue \
    -Dcaffeine.simulator.policies.1=linked.Lru \
    -Dcaffeine.simulator.tu-queue.percent-hot=0.1 \
    -Dcaffeine.simulator.tu-queue.percent-warm=0.8

FavorFrequency

@bitfaster bitfaster deleted the users/alexpeck/glimpse branch July 15, 2022 17:39
@bitfaster
Copy link
Owner Author

@ben-manes yeah - for sure not a good result. Previously I had only tested against Zipf distributions and traces from production systems (similar to your Wikipedia efficiency test).

Thanks for providing the comparison with 2Q. ConcurrentLru uses FIFO queues for all buffers instead of a linked list LRU for the Am buffer in 2Q (there are other differences, but I suspect this is why it is much worse here). I used queues because .NET already has a performant ConcurrentQueue and in my earlier tests the results were decent with that simplification. For the glimpse trace, ConcurrentLru is not a good solution. I will retest with EqualCapacityPartition and FavorRecencyPartition (allocate larger hot queue), out of curiosity to see if there is scope for tuning or it still tanks.

I replicated your efficiency tests from Caffeine, and this looks like the toughest workload. For the ARC workloads ConcurrentLru is comparable or slightly worse than LRU. For Wikipedia, ConcurrentLru does much better.

@ben-manes
Copy link

I think you are using memcached's 2Q inspired policy? If so, then the terms from the 2Q paper can be confusing since it is not the same policy. In this blog post the author names his variant TU-Q.

I would have expected your fifo idea to be pretty decent, actually. If I understand correctly, you use it as a Second Chance fifo where the mark bit lets it re-insert the item when searching for the victim. That's a pseudo LRU that matches very closely, with a worst case of O(n) as it has to sweep over all entries unmarking them to find the victim. Since it seems to be stuck even when the cache size increases, I think maybe something is wrong like it is not using up free capacity (e.g. does it evict if hot + cold are full and warm is empty?).

glimpse is a loopy workload so MRU is optimal, making LRU-like policies struggle. It's a nice stresser for scan resistance, but is probably not as realistic as the db/search traces. I think it is a multi-pass text analyzer (this tool if I'm right).

@bitfaster
Copy link
Owner Author

Excellent point - the policy is flawed during warmup and doesn't use the empty warm buffer at all - fixing that is the purpose of this PR.

I started building out all the hit rate tests so I could validate whether fixing this actually matters. After I read your Caffeine wiki/TinyLfu paper I had the realization that it is most likely better to enlarge the warm buffer by default, favoring frequent items (similar to Caffeine's default allocation of 80% protected 20% probation - I figured you did that for a reason). And 80% warm worked better in the Zipf test (90% and 70% were worse). But with the smaller hot+cold, it is harder for items to get into warm and as you point out it gets stuck (I hadn't yet made the connection that this is likely what caused glimpse to get stuck - really good observation). It's also completely unintuitive that 80% of the allocated cache capacity won't be used until an item is accessed twice.

Terminology is confusing - reading that blog post they write In the OpenBSD code, they are named hot, cold, and warm, and each is an LRU queue. I interpreted LRU as not FIFO - thinking hard on that now, i'm still not totally sure what it means :)

Looking at your TuQueue code for onHit() which is a much more precise definition, hot/warm items move to the tail of the linked lists like an LRU (moving to tail marking as most recent). So, the item lists are not FIFO.

In ConcurrentLru, each item list is a FIFO queue. I was thinking this simplification hurt the glimpse test, but the empty warm buffer is a more likely cause. Another difference, in ConcurrentLru cache hits don't move items at all - only the accessed flag is updated. Items only move across queues on miss. This has the effect of making hits on hot items very cheap but makes the LRU order less accurate. In steady state if there are no misses and only hits, ConcurrentLru has almost no overhead.

On miss, items transition across queues only when a queue is above capacity (else item order remains fixed and new items are simply enqueued). The victim search is more limited than an O(n) search. Eventually over several misses the whole of warm is searched. In the context of a single cache miss each of warm and cold are cycled only when above capacity and at most twice. Worst case cache miss it will cycle warm and cold twice each, so there are 4x dequeue + 4x enqueue ops as overhead.

Scanning warm O(n) to find and discard the first non-accessed item would be an interesting test.

@ben-manes
Copy link

ben-manes commented Jul 15, 2022

Those new numbers in that PR is really great! congrats 😃

In TuQueue, hot is an LRU and cold, warm is a Segmented LRU (aka probation and protected). The structure of W-TinyLFU is identical except we have a filter to disallow migration from the hot to cold regions (window to probation) and evict that candidate entry instead.

To follow TU-Q more strictly, your hot region could be a Clock / SecondChance circular fifo queue. Adapting those together and I would expect the policy to have logic roughly like,

  1. HIT: Set wasAccessed to true
  2. Miss and not full...
    1. Add entry and let region exceed its threshold
  3. Miss and full:
    1. HOT
      • victim wasAccessed = true: reset flag, reinsert to hot, retry eviction
      • else, insert (evict) into cold
    2. COLD
      • victim wasAccessed = true: reset flag, insert into warm, retry eviction
      • else, evict item
    3. WARM
      • victim wasAccessed = true: reset flag, reinsert to warm, retry eviction
      • else, insert (evict) into cold

The worst case is that all entries have wasAccessed = true and you have to circulate through resetting the flag. You could avoid this by having a scan threshold and simply evict a wasAccessed entry regardless, thereby limiting the penalty and knowing that you probably wouldn't have found a much better victim in a fuller scan. You're new results are so good I don't know if this variation would even be any better!

For W-TinyLFU the SLRU paper recommended 20/80 to avoid cache pollution, and this made sense as we expected the probabilistic sketch would make some admission mistakes. At one point I saw it help on the wiki trace compared to an LRU main region, but that easily could have been (since fixed) flaws in my CountMin4 which it corrected for. It's likely that the SLRU is not needed anymore for similar quality, but theoretically I like the probation region as a correction for wrong admission decisions, which for an LRU could take a long time to evict. This somewhat became moot when we were able to avoid the sizing dilemma by using hill climbing to make adjustments dynamically so that a bad initial configuration did not have much of an impact.

Edit: Rereading the TU-Q blog entry, code, and code comments... I am not sure if my implementation is correct. Maybe HOT is a FIFO after all? I may have been confused by calling each region an LRU and other odd phrasing. It seems oddly pointless to make HOT a fifo as that only serves to disallow promotions to WARM until its aged. Certainly the author made the explanations unnecessarily confusing...

@ben-manes
Copy link

ben-manes commented Jul 16, 2022

The memcached pr matches what I implemented. I think the OpenBSD code follows how you interpreted it, though. The important part of memcached's description is,

LRU updates only happen as items reach the bottom of an LRU. If active in HOT, stay in HOT, if active in WARM, stay in WARM. If active in COLD, move to WARM.

However, This text was modified also in the official docs,

  • LRU updates only happen as items reach the bottom of an LRU. If active in
    HOT, move to WARM, if active in WARM, stay in WARM. If active in COLD, move
    to WARM.

The blog post has a similar description,

Each item has two bit flags indicating activity level.

FETCHED: Set if an item has ever been requested
ACTIVE: set if an item has been accessed for a second time. Removed when an item is bumped or moved.

HOT acts as a probationary queue, since items are likely to exhibit strong temporal locality or very short TTLs (time-to-live). As a result, items are never bumped within HOT: once an item reaches the tail of the queue, it will be moved to WARM if the item is active (3), or COLD if it is inactive (5).

I guess my implementation is wrong by making HOT an LRU instead of a FIFO. I don't see a good reason for that, though, so will keep it as is for the time being. That misinterpreted structure was helpful as it led me to clarify my thinking for W-TinyLFU.

@bitfaster
Copy link
Owner Author

Turned out much better than I expected! 😃

I missed this detail with the hot buffer re-circulating - I had never seen the OpenBSD docs, only the memcached blog post. It's a good thing to try - I can implement it pretty easily and compare against this as baseline. Hopefully I will get some time to try it next week. I'm curious if it makes much difference in practice, behavior in both cases would be pretty similar - except a slot in warm would effectively be swapped for hot in the case of a very high frequency key. This could give slightly more warm capacity since transient hot keys would not be retained in warm at the expense of longer-lived repeat keys.

Key takeaway is that without adaptive queue sizing, the problem I fixed in the warmup PR still exists - it's just masked because glimpse is a repeat pattern. In other words, if I ran the Wikipedia bench, then glimpse, it would be back to 1.38% hit rate for the glimpse part. So, I think your addition of hill climbing is very important. It's an edge case, but one that is good to cover.

BTW I think your hint on the queue sizing/warmup probably saved me weeks of thinking/debugging/testing - thank you!

And also, I got a huge head start by looking at your tests - you distilled down the most interesting scenarios without which it would be tough to draw conclusions about adaptive queue sizes etc. I was using the hit rate test methodology from the 2Q paper, which now I look is from 1994.

@ben-manes
Copy link

For hill climbing I took a very straightforward approach. It uses a high sampling period (10x max size) and decays from an initially large step size for convergence. This came about when charting multiple traces and their configurations, where none was optimal for all, and realizing I could simply walk the curve for that workload. I was surprised that no one else seemed to do that, with ARC taking a less direct route (as did my early attempts at feedback).

I tried small periods and small step sizes, but the climber got confused due to noise. I also tried gradient descent but couldn't figure out the derivative for the slope, so all of its variations (popularized by ML articles) didn't work. However they have some neat optimizations like momentum and an adaptive learning rates (step sizes) to speed up the convergence. I was able to borrow the adaptive learning rate idea when implementing it in the library (our paper is before that, where the code was merely an early PoC).

I think that that adding momentum would be a nice improvement, but I have not seen a workload justifying that experiment (so the simplest code wins out without data). My hypothesis is to start with a small sample period that gradually gets larger (e.g. 0.5x ... 10x) and a large step size that gradually decays (e.g. 5% ... 0%). Then hopefully early on it makes rapid guesses that are large enough to dampen the noise but is more likely to be wrong, finds the configuration to oscillates around, and slowly converges to that optimal, long-term value. That intuitively seems nicer for a large cache, but I don't know yet.

Similarly, I don't have data to justify trying to adapt a very tiny cache where the current math fails (10 entries x 5% = 0 initial step size). I'm sure it is not optimal, but miniscule caches are probably safe to be given a higher value and likely not monitored, so their usefulness is a developer's random guess anyway. Again without data I'd probably do more harm if I tried, so I am waiting for some unfortunate user to complain.

Outside of eviction, I was also quite happy with how my approaches towards concurrency (replay buffers) and expiration (hierarchical timing wheels) turned out. Lots of tradeoffs that aren't right for everyone, but fun problems and those overlooked techniques were enjoyable to code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants