Queue capacity exceeded #57

nosideeffects · 2016-03-07T18:15:06Z

I am getting the following exception adding items to cache. Is this an actionable error?

WARNING:` Exception thrown when submitting maintenance task
java.util.concurrent.RejectedExecutionException: Queue capacity exceeded
        at java.util.concurrent.ForkJoinPool$WorkQueue.growArray(ForkJoinPool.java:884)
        at java.util.concurrent.ForkJoinPool.externalSubmit(ForkJoinPool.java:2354)
        at java.util.concurrent.ForkJoinPool.externalPush(ForkJoinPool.java:2416)
        at java.util.concurrent.ForkJoinPool.execute(ForkJoinPool.java:2645)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers(BoundedLocalCache.java:816)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:799)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.putFast(BoundedLocalCache.java:1360)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1296)
        at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)

The text was updated successfully, but these errors were encountered:

ben-manes · 2016-03-07T18:31:59Z

This is surprising, because the maximum capacity of the work queue is 64M. FJP commonPool should create more threads to handle the load such that this wasn't expected. The logging was intended if someone supplied their own bounded executor to indicate a resource constraint.

This should be benign, except if you have a RemovalListener. Currently if that task is rejected it logs an error and drops the event. It would probably be better if instead it fell back to executing on the caller's thread so that there was a performance loss, but no correctness problem (e.g. resource leak).

ben-manes · 2016-03-08T03:22:38Z

Is there any chance you can provide a test program to reproduce this? I'm not sure how you were able to exhaust the work queue, and if there's anything much I can do about it. I'll fix the removal listener issue, though.

nosideeffects · 2016-03-08T05:01:15Z

I will see if I can extract a reproduction. The code in question seems to put pretty infrequently (like once per web request), so this is extremely strange.

ben-manes · 2016-03-08T05:14:48Z

The executor is used by Caffeine for the for periodic maintenance (after a write or a bunch of reads), removal notifications, refresh / refreshAfterWrite, and AsyncLoadingCache futures. In your case of a manual cache that means only periodic maintenance or removal notifications should be enqueued by the cache.

Periodic maintenance is guarded by a tryLock and a simple state machine. That should result in only one maintenance task queue'd up in general, or a little less than one per write, so that there aren't too many filling the work queue. I did experiment with avoiding any duplicates but didn't see a problem in my write benchmarks that would cause excessive scheduling problems.

The actual maintenance work is pretty small, except that it calls into a CacheWriter#delete on eviction. If the delete is excessively slow and writes keep coming in, then I could see a potential problem. A slow writer should be replaced with one that is mostly queue'ing up the work for asynchronous processing. If that's the issue then we could update the documentation to reflect best usage patterns.

nosideeffects · 2016-03-08T14:56:02Z

So we do have a long living cache for each web thread that we invalidateAll at the beginning of each request. Could this be causing the issue? Should we be building a new cache instead?

ben-manes · 2016-03-08T15:15:24Z

invalidateAll() should be safe. I think invalidateAll(keys) could use some tuning.

Are these caches thread local? If so a bounded LinkedHashMap might be best regardless.

I think if we can reproduce the issue then I should be able to make the cache more resilient.

nosideeffects · 2016-03-08T15:47:42Z

The caches are ThreadLocal. And they often only contain a few entries. I haven't reproduced locally yet.

ben-manes · 2016-03-09T00:41:44Z

I cleaned up the stress test to help investigate this. It prints out useful information like the ForkJoinPool pending queue size.

For write heavy workloads the state machine isn't perfect, so additional tasks do get scheduled. However that mostly doesn't occur if there is any intermediate work, like Thread.yield(), which minimizes the race and lets the maintenance work catch up. If we don't yield then the maintenance task is de-prioritized, as the other threads fight for all the CPU time, and the cache can't catch up with the work. This results in the known problem of exceeding its size, but only in that synthetic case, and shows that the executor's work queue is similarly lagging.

The perfect state machine doesn't have any race for the executor's work queue. Then even in the synthetic non-yielding case the queue stays empty, though as described before the cache can't evict fast enough to keep up with the writes. Again, that issue isn't a real concern because it requires the machine be dedicated to doing nothing but writing into the cache.

So the state machine change might help in your case and be a decent safety net. I'll play with it a little more and probably add it into the next release. It has a tiny performance penalty, but not enough to be noticeable outside of a micro-benchmark.

However, I think your resolution is best served by asking why a concurrent cache is used as a thread-local cache? A cache usually isn't write heavy, nor cleared so frequently. It sounds like you need a bounded scratchpad area, so a LinkedHashMap in LRU mode is probably both faster and a better fit. Can you explain the use-case?

ben-manes · 2016-03-09T09:34:40Z

Integrated the improved scheduling, but feel free to continue the discussion.

The simpler state machine and tryLock guard protected against most unnecessary scheduling of the maintenance task. This could occur though due to some benign races, a high write rate, and pronounce the effect of a clogged executor. This perfect scheduling has a negligible cost while ensuring that even in synthetic stress tests no excess occurs.

nosideeffects · 2016-03-11T15:03:42Z

Thanks for looking so much into this! I've taken your suggestion and used a LinkedHashMap in this instance, instead, as there was no need for a concurrent structure. No doubt your improvement to scheduling will help others, and prevent this issue from occurring in our caches that do need concurrent access.

ben-manes · 2016-03-11T20:11:21Z

Thanks for the update. I hope to finish #56 this weekend and cut a release. I'm been a little slow on that one because it has an API change, and I'm unsure if there is a better method name. If you can take a look and offer a suggestion I'd appreciate it.

ben-manes · 2016-03-23T03:29:21Z

Release 2.2.3 with this improvement. Thanks again for bug report.

ben-manes closed this as completed in 03af211 Mar 9, 2016

ben-manes mentioned this issue Feb 7, 2019

[OEP 8] Asynchronous read cache with state machine orientechnologies/orientdb-labs#8

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue capacity exceeded #57

Queue capacity exceeded #57

nosideeffects commented Mar 7, 2016

ben-manes commented Mar 7, 2016

ben-manes commented Mar 8, 2016

nosideeffects commented Mar 8, 2016

ben-manes commented Mar 8, 2016

nosideeffects commented Mar 8, 2016

ben-manes commented Mar 8, 2016

nosideeffects commented Mar 8, 2016

ben-manes commented Mar 9, 2016

ben-manes commented Mar 9, 2016

nosideeffects commented Mar 11, 2016

ben-manes commented Mar 11, 2016

ben-manes commented Mar 23, 2016

Queue capacity exceeded #57

Queue capacity exceeded #57

Comments

nosideeffects commented Mar 7, 2016

ben-manes commented Mar 7, 2016

ben-manes commented Mar 8, 2016

nosideeffects commented Mar 8, 2016

ben-manes commented Mar 8, 2016

nosideeffects commented Mar 8, 2016

ben-manes commented Mar 8, 2016

nosideeffects commented Mar 8, 2016

ben-manes commented Mar 9, 2016

ben-manes commented Mar 9, 2016

nosideeffects commented Mar 11, 2016

ben-manes commented Mar 11, 2016

ben-manes commented Mar 23, 2016