Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache maintains size greater than the defined maximumSize for a long time (1h+) #420

Closed
jakubzytka opened this issue May 14, 2020 · 16 comments

Comments

@jakubzytka
Copy link

jakubzytka commented May 14, 2020

I'm observing a weird behavior of caffeine cache v2.4.0. This issue presumably goes away when caffeine is upgraded to v2.8.1.

I'm seeking a confirmation whether this is an old, fixed issue (possibly by e8ff6d3) or if that's a new issue and I was just lucky to not see it after the upgrade.
I'd greatly appreciate any comments.

The issue:
I have a cache with a removalListener and a maximumSize, say 32767.
I'm observing that the cache size (measured by the number of put minus the number of removalListener calls) sometimes gets stuck at 32768. There is a constant getIfPresent load, but no put happens when the cache size stays at 32768.
Sometimes this condition gets magically resolved but it usually takes hours.
The moment of resolution is often correlated with a reduction in the number of cache accesses (as if some cleanup task was being starved? On the other hand I do see parked ForkJoinPool threads)

This is a problem because a certain component (specifically, Solr's BlockCache) relies on cache size getting down to maximumSize in a reasonably short time. Is such an expectation reasonable?
Side note: I ensured immutability of cache key (BlockCacheKey in case of Solr) so no hanky-panky with hashCode/equals is involved.

I gathered stack traces during the period when the cache is "stuck". They show that some sort of cleanup is triggered:

   java.lang.Thread.State: RUNNABLE
        at sun.misc.Unsafe.unpark(Native Method)
        at java.util.concurrent.ForkJoinPool.signalWork(ForkJoinPool.java:1649)
        at java.util.concurrent.ForkJoinPool.externalPush(ForkJoinPool.java:2414)
        at java.util.concurrent.ForkJoinPool.execute(ForkJoinPool.java:2648)
        at com.datastax.bdp.shade.com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers(BoundedLocalCache.java:976)
        at com.datastax.bdp.shade.com.github.benmanes.caffeine.cache.BoundedLocalCache.afterRead(BoundedLocalCache.java:815)
        at com.datastax.bdp.shade.com.github.benmanes.caffeine.cache.BoundedLocalCache.getIfPresent(BoundedLocalCache.java:1443)
        at com.datastax.bdp.shade.com.github.benmanes.caffeine.cache.LocalManualCache.getIfPresent(LocalManualCache.java:49)
...
   Locked ownable synchronizers:
        - <0x0000000410f48a70> (a java.util.concurrent.ThreadPoolExecutor$Worker)
        - <0x000000047fab2498> (a com.github.benmanes.caffeine.cache.NonReentrantLock$Sync)

I also can see that there are unutilized ForkJoinPool threads afterward, e.g.:

   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000410f2ac48> (a java.util.concurrent.ForkJoinPool)
        at java.util.concurrent.ForkJoinPool.awaitWork(ForkJoinPool.java:1824)
        at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1693)
        at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
...
@ben-manes
Copy link
Owner

The cache does use a clean-up process that is schedules on the ForkJoinPool. It uses read and write buffers to decouple the hash table from the eviction policy, which allows us to make the policy work concurrently rather than locking all operations. The write buffer is sized at 32768, but should be aggressively drained on any read/write if not empty. The scheduling of this work is guarded by a flag indicating whether it was submitted to the ForkJoinPool.

There was a bug in some JDKs where ForkJoinPool would drop tasks rather than running them. The cache is less trusting of the configured Executor now, so it will fallback to making block and writers perform this work if the write buffer is full. However even on a synthetic stress test it is difficult to fill the write buffer and cause this.

From the trace it looks like FJP is blocked trying to signal the free worker thread after submitting into its work queue. We submit after obtaining a tryLock that guards the eviction policy, so unfortunately our recovery measures would also block since they need this lock. It appears to be a JVM bug where it is unable to unpark the thread and blocks the submitter. There isn't any way for us to recover or be resilient to this type of error, as it looks like a data race problem within their logic. The only workarounds would be to specify a different executor, e.g. same-thread, or upgrading to a later JVM (ideally latest patch release of your major version).

@ben-manes
Copy link
Owner

Another possibility is if you are on an old Linux kernel version. There was a nasty bug that they quietly fixed which fundamentally broke all locks (futexes) under load. So you might be affected by a kernel bug, or a lack of workaround for a processor bug (e.g. Intel's TSX was very buggy and used to optimize locks).

@jakubzytka
Copy link
Author

Thanks for your prompt responses.

There was a bug in some JDKs where ForkJoinPool would drop tasks rather than running them.

So far I have run into this issue on Oracle's 1.8.0_241 (and some older version too). I might check a couple of older versions, but I suppose not before I find some small reproduction case.
Right now I'm verifying now on build 1.8.0_252-8u252-b09-1~16.04-b09

The cache is less trusting of the configured Executor now, so it will fallback to making block and writers perform this work if the write buffer is full. However even on a synthetic stress test it is difficult to fill the write buffer and cause this.

There is only a read load (getIfPresent) during this stall period. I can't rely on new writes rectifying the problem, unfortunately.

From the trace it looks like FJP is blocked trying to signal the free worker thread after submitting into its work queue

Actually, I just selected one specific stack trace that shows that some cleanup is scheduled.
This specific stack trace was obtained ~30 minutes after the onset of the condition, during a period of constant read load.
As far as I can see no thread is blocked anywhere in FJP (only workers are idle). In particular, subsequently obtained traces do not show any frames near caffeine.

To me, this means that:

  • scheduling scheduleDrainBuffers does happen; is it all the cleanup that needs to be done?
  • if FJP drops tasks only occasionally then either draining buffers is not enough or doesn't work (e.g. sees that there is another draining in progress, this one that was scheduled and presumably lost?)

Another possibility is if you are on an old Linux kernel version

Oh yeah, I remember that infamous futex issue :) I verified our kernel version against known synchronization issues. No findings here.
Anyway, kernel/HW issue is IMO highly unlikely. We have lots of threads, locks, synchronization, but the only thing that repeatedly and consistently breaks is this caffeine cache.

There isn't any way for us to recover or be resilient to this type of error, as it looks like a data race problem within their logic. The only workarounds would be to specify a different executor, e.g. same-thread, or upgrading to a later JVM (ideally latest patch release of your major version).

On my side, I can probably detect the stall and manually call cleanup. Similarly, it would be possible in caffeine to execute cleanup directly if e.g. the most recent scheduled cleanup has not completed after some nanos. But these are lame workarounds of an issue that shouldn't happen...

I'll try to create a simple test that reproduces the issue to possibly learn more... Naive highly concurrent put and getIfPresent pattern didn't reproduce it, unfortunately.

@ben-manes
Copy link
Owner

So far I have run into this issue on Oracle's 1.8.0_241 (and some older version too).

Great, the issue I knew of was #77 (JDK-8078490) which was fixed by then.

On my side, I can probably detect the stall and manually call cleanup

In that stacktrace the cache held its eviction lock and was stuck trying to schedule the task. All cleanups have to be performed under that lock, so if that's an error condition a manual cleanUp would block as well and be broken.

scheduling scheduleDrainBuffers does happen; is it all the cleanup that needs to be done?

Yes, the caller just needs to schedule the work and move on. The read buffer is lossy so when full it just drops that event. The write buffer can't be, so it blocks.

if FJP drops tasks only occasionally then either draining buffers is not enough or doesn't work (e.g. sees that there is another draining in progress, this one that was scheduled and presumably lost?)

Yes, since it thinks the FJP task was scheduled then it won't bother scheduling another one. The improvements in later versions won't schedule in that case, but will have writers assist by doing the work themselves if the write buffer is full. This means that the FJP may still be broken by dropping out task, but each time the write buffer is full the cache corrects itself by having the writer do the full cleanup. So it is broken but periodically fixed.

The easiest workaround would be to simply set Caffeine.executor(Runnable::run) instead of using the default FJP. The cleanup work is very cheap, so this shouldn't impact your latencies. However it will run the removal listener and any other async work directly on that thread, which could be a penalty. If you specified a ThreadPoolExecutor style thread, then that's perhaps a little wasteful thread-wise but should work correctly. If either of those worked then it would show that Caffeine's logic isn't at fault and it is something to do with FJP or lower level.

@jakubzytka
Copy link
Author

In that stacktrace the cache held its eviction lock and was stuck trying to schedule the task. All cleanups have to be performed under that lock, so if that's an error condition a manual cleanUp would block as well and be broken.

If manual cleanUp blocked for a long time this would be some indication of what is going on.
However, as I wrote, I do have many stack traces where nothing is waiting on the eviction lock...
Also, I do happen to have some jfrs which do not show any contention around eviction lock or anything FJP related.

The easiest workaround would be to simply set Caffeine.executor(Runnable::run) instead of using the default FJP. The cleanup work is very cheap, so this shouldn't impact your latencies. However it will run the removal listener and any other async work directly on that thread, which could be a penalty. If you specified a ThreadPoolExecutor style thread, then that's perhaps a little wasteful thread-wise but should work correctly.

Good ideas, thanks! My removal listener is cheap enough to make testing both approaches worthwhile. At this point, I'm mostly concerned about correctness.

@ben-manes
Copy link
Owner

great, please let me know the results with an alternative executor. I think the latest release is correct, but would very much appreciate your feedback from these experiments.

@jakubzytka
Copy link
Author

jakubzytka commented May 18, 2020

I managed to reproduce the issue in a simple example. It mimics the behavior of the original code.
Key/Value classes are pretty much copied. I didn't check if the same happens with a simpler Key class.

My expectation is that if maximumSize is set to 32767 cache size will drop to that size and not stay at 32768.

The problem reproduces with the default and Runnable::run executor.
It reproduces also with caffeine-2.8.1.jar.

https://gist.github.com/jakubzytka/349c6738a71f249743171a78864a795d

Reproduced with:

javac -cp caffeine-2.4.0.jar Caffeine420Test.java && java -cp caffeine-2.4.0.jar:. Caffeine420Test

under

$ java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~16.04-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

and also

$ java -version
openjdk version "11.0.1" 2018-10-16
OpenJDK Runtime Environment (build 11.0.1+13-Ubuntu-3ubuntu116.04ppa1)
OpenJDK 64-Bit Server VM (build 11.0.1+13-Ubuntu-3ubuntu116.04ppa1, mixed mode, sharing)

Example output (cache with maximumSize set to 32767):

[14:45:05] [Jakub] $ javac -cp caffeine-2.8.1.jar Caffeine420Test.java && java -cp caffeine-2.8.1.jar:. Caffeine420Test
1/999999: since last updated: 503577984; cache stuck? size=32768, estSize=32768, puts=32779, gets=9118281, releases=11, diff=32768
1/1999999: since last updated: 990143754; cache stuck? size=32768, estSize=32768, puts=32779, gets=19499796, releases=11, diff=32768
1/2999999: since last updated: 1368141690; cache stuck? size=32768, estSize=32768, puts=32779, gets=27619995, releases=11, diff=32768
1/3999999: since last updated: 1737031701; cache stuck? size=32768, estSize=32768, puts=32779, gets=35710050, releases=11, diff=32768
1/4999999: since last updated: 2115552524; cache stuck? size=32768, estSize=32768, puts=32779, gets=43803900, releases=11, diff=32768
1/5999999: since last updated: 2510485553; cache stuck? size=32768, estSize=32768, puts=32779, gets=52477515, releases=11, diff=32768
1/6999999: since last updated: 2877702313; cache stuck? size=32768, estSize=32768, puts=32779, gets=60543353, releases=11, diff=32768
1/7999999: since last updated: 3269783751; cache stuck? size=32768, estSize=32768, puts=32779, gets=69009634, releases=11, diff=32768
1/8999999: since last updated: 3635635807; cache stuck? size=32768, estSize=32768, puts=32779, gets=77065404, releases=11, diff=32768
1/9999999: since last updated: 3999235336; cache stuck? size=32768, estSize=32768, puts=32779, gets=85043550, releases=11, diff=32768
1/10999999: since last updated: 4380717158; cache stuck? size=32768, estSize=32768, puts=32779, gets=93235758, releases=11, diff=32768
1/11999999: since last updated: 4744316964; cache stuck? size=32768, estSize=32768, puts=32779, gets=101227193, releases=11, diff=32768
1/12999999: since last updated: 5149292686; cache stuck? size=32768, estSize=32768, puts=32779, gets=109798196, releases=11, diff=32768
1/13999999: since last updated: 5515600650; cache stuck? size=32768, estSize=32768, puts=32779, gets=117872079, releases=11, diff=32768
1/14999999: since last updated: 5885781636; cache stuck? size=32768, estSize=32768, puts=32779, gets=126023722, releases=11, diff=32768
1/15999999: since last updated: 6269300647; cache stuck? size=32768, estSize=32768, puts=32779, gets=134187054, releases=11, diff=32768
1/16999999: since last updated: 6629441418; cache stuck? size=32768, estSize=32768, puts=32779, gets=142080498, releases=11, diff=32768
1/17999999: since last updated: 6990405203; cache stuck? size=32768, estSize=32768, puts=32779, gets=150019930, releases=11, diff=32768
1/18999999: since last updated: 7370606500; cache stuck? size=32768, estSize=32768, puts=32779, gets=158125246, releases=11, diff=32768
1/19999999: since last updated: 7747605588; cache stuck? size=32768, estSize=32768, puts=32779, gets=166442908, releases=11, diff=32768
1/20999999: since last updated: 8137270561; cache stuck? size=32768, estSize=32768, puts=32779, gets=174837672, releases=11, diff=32768
1/21999999: since last updated: 8509989017; cache stuck? size=32768, estSize=32768, puts=32779, gets=183030570, releases=11, diff=32768
1/22999999: since last updated: 8881343545; cache stuck? size=32768, estSize=32768, puts=32779, gets=191192377, releases=11, diff=32768
1/23999999: since last updated: 9266829880; cache stuck? size=32768, estSize=32768, puts=32779, gets=199393422, releases=11, diff=32768
1/24999999: since last updated: 9639681279; cache stuck? size=32768, estSize=32768, puts=32779, gets=207582200, releases=11, diff=32768
1/25999999: since last updated: 9989359063; cache stuck? size=32768, estSize=32768, puts=32779, gets=215252218, releases=11, diff=32768
7: last updated long time ago: 3970959990814596; cache stuck? size=32768, estSize=32768, puts=32779, releases=11, diff=32768
threads still running: 7
threads still running: 6
threads still running: 5
threads still running: 4
threads still running: 3
threads still running: 2
threads still running: 1
detected after: 10155199594
cache size: 32768
cache size after sleep: 32768
threads still running: 0

@jakubzytka
Copy link
Author

reproduces also with 2.8.3:

$ javac -cp caffeine-2.8.3.jar Caffeine420Test.java && java -cp caffeine-2.8.3.jar:. Caffeine420Test
1/999999: since last updated: 480198999; cache stuck? size=32768, estSize=32768, puts=32771, gets=9424740, releases=3, diff=32768
1/1999999: since last updated: 855332901; cache stuck? size=32768, estSize=32768, puts=32771, gets=18949919, releases=3, diff=32768
1/2999999: since last updated: 1155671787; cache stuck? size=32768, estSize=32768, puts=32771, gets=26826811, releases=3, diff=32768
1/3999999: since last updated: 1450474828; cache stuck? size=32768, estSize=32768, puts=32771, gets=34722907, releases=3, diff=32768
1/4999999: since last updated: 1743405381; cache stuck? size=32768, estSize=32768, puts=32771, gets=42593745, releases=3, diff=32768
1/5999999: since last updated: 2035892093; cache stuck? size=32768, estSize=32768, puts=32771, gets=50448679, releases=3, diff=32768
1/6999999: since last updated: 2340232941; cache stuck? size=32768, estSize=32768, puts=32771, gets=58299586, releases=3, diff=32768
1/7999999: since last updated: 2635325314; cache stuck? size=32768, estSize=32768, puts=32771, gets=66213103, releases=3, diff=32768
1/8999999: since last updated: 2926512547; cache stuck? size=32768, estSize=32768, puts=32771, gets=74039932, releases=3, diff=32768
1/9999999: since last updated: 3243603241; cache stuck? size=32768, estSize=32768, puts=32771, gets=81916362, releases=3, diff=32768
1/10999999: since last updated: 3548161326; cache stuck? size=32768, estSize=32768, puts=32771, gets=89981566, releases=3, diff=32768
1/11999999: since last updated: 3859138351; cache stuck? size=32768, estSize=32768, puts=32771, gets=97723293, releases=3, diff=32768
1/12999999: since last updated: 4188741715; cache stuck? size=32768, estSize=32768, puts=32771, gets=105691447, releases=3, diff=32768
1/13999999: since last updated: 4508487342; cache stuck? size=32768, estSize=32768, puts=32771, gets=113668024, releases=3, diff=32768
1/14999999: since last updated: 4821698243; cache stuck? size=32768, estSize=32768, puts=32771, gets=121541082, releases=3, diff=32768
1/15999999: since last updated: 5124199151; cache stuck? size=32768, estSize=32768, puts=32771, gets=128799286, releases=3, diff=32768
1/16999999: since last updated: 5444884372; cache stuck? size=32768, estSize=32768, puts=32771, gets=136833193, releases=3, diff=32768
1/17999999: since last updated: 5855453159; cache stuck? size=32768, estSize=32768, puts=32771, gets=147021729, releases=3, diff=32768
1/18999999: since last updated: 6147956700; cache stuck? size=32768, estSize=32768, puts=32771, gets=154137515, releases=3, diff=32768
1/19999999: since last updated: 6439966378; cache stuck? size=32768, estSize=32768, puts=32771, gets=161409883, releases=3, diff=32768
1/20999999: since last updated: 6728663136; cache stuck? size=32768, estSize=32768, puts=32771, gets=168771466, releases=3, diff=32768
1/21999999: since last updated: 7021352124; cache stuck? size=32768, estSize=32768, puts=32771, gets=176408077, releases=3, diff=32768
1/22999999: since last updated: 7341180748; cache stuck? size=32768, estSize=32768, puts=32771, gets=184721896, releases=3, diff=32768
1/23999999: since last updated: 7637468710; cache stuck? size=32768, estSize=32768, puts=32771, gets=192660430, releases=3, diff=32768
1/24999999: since last updated: 7931016339; cache stuck? size=32768, estSize=32768, puts=32771, gets=200526492, releases=3, diff=32768
1/25999999: since last updated: 8236657915; cache stuck? size=32768, estSize=32768, puts=32771, gets=208508105, releases=3, diff=32768
1/26999999: since last updated: 8533140077; cache stuck? size=32768, estSize=32768, puts=32771, gets=216419982, releases=3, diff=32768
1/27999999: since last updated: 8832919242; cache stuck? size=32768, estSize=32768, puts=32771, gets=224511723, releases=3, diff=32768
1/28999999: since last updated: 9155587264; cache stuck? size=32768, estSize=32768, puts=32771, gets=232901109, releases=3, diff=32768
1/29999999: since last updated: 9449908284; cache stuck? size=32768, estSize=32768, puts=32771, gets=240810045, releases=3, diff=32768
1/30999999: since last updated: 9744465158; cache stuck? size=32768, estSize=32768, puts=32771, gets=248704140, releases=3, diff=32768
7: last updated long time ago: 3971604947835100; cache stuck? size=32768, estSize=32768, puts=32771, releases=3, diff=32768
detected after: 10142186091
threads still running: 7
threads still running: 6
cache size: 32768
threads still running: 5
threads still running: 4
threads still running: 3
threads still running: 2
threads still running: 1
cache size after sleep: 32768
threads still running: 0
$ md5sum caffeine-2.8.3.jar 
65289340c381904e81ddfc1b9a08d6d5  caffeine-2.8.3.jar

@ben-manes
Copy link
Owner

ben-manes commented May 18, 2020

Thanks for the test case. Now I understand what the confusion is which you correctly identified originally. I had thought you meant it was broken and never catching up, blocking future writes, as that had problems in the past due to FJP bugs. But now I see what you said all along, that there is an incorrect assumption that the maximum size is a strict threshold and that the cache will not exceed.

Solr's BlockCache used to be implemented on ConcurrentLinkedHashMap, which like Caffeine did not offer a strict guarantee. It sounds like you found a long-standing bug in their assumptions and a reasonable misunderstanding, An option would be to switch to Guava's cache that does provide a strict bound and often evicts prematurely to do so.

A strict bound is only possible if the writers are serialized with the eviction policy, e.g. atomic under an exclusive lock. Guava does this by splitting the hash table into multiple segments, each with their own LRU policy. This approach limits throughput since writes to independent keys block each other and makes it more difficult to implement more advanced operations like Map.compute due to blocking on long-running user operations.

In Caffeine and CLHM, we instead decouple the hash table and the eviction policy under two different locks. This is a pub-sub style approach where our write buffer captures the event and replays it on the policy. This allows the buffer to absorb a burst of writes and schedule using a tryLock to avoid blocking. However it does mean a race could allow the cache to be left slightly higher than its maximum, where the writer set the flag and but the evictor had just completed. The flag does allow the next cache operation will schedule the work, but if that is far off in the future then it could take a while. The constant getIfPresent load should trigger a cleanup if the entry is found because it calls afterRead which does the scheduling.

The commit you referenced does help mask this race when the FJP.commonPool is used. It won't trigger for a custom executor, like Runnable::run because that might be a user-facing thread and we don't want to penalize it by repeatedly evicting under load.

Most of the time a small excess is okay, e.g. like GCs have garbage that is cleared out when activity occurs again. The main case where this caused trouble was for expiration, when application code expected a prompt notification (e.g. to close an idle session) but no cache activity meant no cleanup occured. We offer a scheduler option to mitigate that, as a scheduler thread can track the next expiration time.

If there is no activity then you could schedule a task that calls cache.cleanUp periodically. Otherwise the assumptions or libraries have to be switched. However it also sounded like maybe the cleanup task was scheduled but FJP simply didn't perform the work. If that was the case, like your original stacktrace showed, then that seems like a bug in the underlying system.

Sorry for my misunderstandings early on. Does that answer your questions?

@jakubzytka
Copy link
Author

But now I see what you said all along, that there is an incorrect assumption that the maximum size is a strict threshold and that the cache will not exceed.

I'm sorry I still think we may not be on the same page :)
There is no assumption that the maximum size is a strict threshold that the cache doesn't exceed.
The assumption is that when it does exceed maximum size it eventually reaches maximum size under a read-only load.

There is a hard limit on cache size in solr's code, but it is enforced differently (I used getAndUpdate for that: I guarantee that the cache size never exceeds 32768).
I expect cache size to be reduced to maximumSize when only getIfPresent operations are executed.

IIUC that expectation is OK as long as getIfPresent finds an entry? That's tricky :)
Is there an upper bound for an excess?

Thanks for the hint about a scheduler. I'll try to figure out some reasonable solution.

@ben-manes
Copy link
Owner

We can certainly add a hint when absent to check and drain if needed. It simply never came up before as a problem :)

We would change the current logic,

public @Nullable V getIfPresent(Object key, boolean recordStats) {
  Node<K, V> node = data.get(nodeFactory.newLookupKey(key));
  if (node == null) {
    if (recordStats) {
      statsCounter().recordMisses(1);
    }
    return null;
  }
  ...

to something like,

public @Nullable V getIfPresent(Object key, boolean recordStats) {
  Node<K, V> node = data.get(nodeFactory.newLookupKey(key));
  if (node == null) {
    if (recordStats) {
      statsCounter().recordMisses(1);
    }
    if (drainStatus() == REQUIRED) {
      scheduleDrainBuffers();
    }
    return null;
  }
  ...

and write a unit test to assert the behavior. If that would solve your problem then I can look into it tonight.

@jakubzytka
Copy link
Author

jakubzytka commented May 18, 2020

It simply never came up before as a problem :)

yeah, I can perfectly see why :)

I'd appreciate it if this use case could be supported by caffeine.
Obviously, if such change e.g. impacted performance negatively I will understand any reluctance to make the change.

There is one more tricky thing... I do have a stack trace where draining buffers was scheduled during the perceived "stuck cache"... I'll recheck the metrics, maybe I missed something (like - an accidental cache hit)

@ben-manes
Copy link
Owner

This won't hurt performance and we similarly trigger it if the retrieved entry is expired. This will of course be best effort, so a possibly stale read due to memory barrier piggybacking on an initial call or other races. You can build locally to test with that snippet if desired, or else I'll have a snapshot for you tonight to verify with prior to a new release.

@ben-manes
Copy link
Owner

Please try the snapshot, e.g. against your test case, to see if that satisfies your needs. If so then I'll cut a release. Otherwise reopen if there's more for us to dig into.

@jakubzytka
Copy link
Author

I confirmed 2.8.4-SNAPSHOT works both in the synthetic example given above and in my real-life scenario.

Thank you very much for your outstanding help!

@ben-manes
Copy link
Owner

Released 2.8.4!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants