Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue capacity exceeded #57

Closed
nosideeffects opened this issue Mar 7, 2016 · 12 comments
Closed

Queue capacity exceeded #57

nosideeffects opened this issue Mar 7, 2016 · 12 comments

Comments

@nosideeffects
Copy link

I am getting the following exception adding items to cache. Is this an actionable error?

WARNING:` Exception thrown when submitting maintenance task
java.util.concurrent.RejectedExecutionException: Queue capacity exceeded
        at java.util.concurrent.ForkJoinPool$WorkQueue.growArray(ForkJoinPool.java:884)
        at java.util.concurrent.ForkJoinPool.externalSubmit(ForkJoinPool.java:2354)
        at java.util.concurrent.ForkJoinPool.externalPush(ForkJoinPool.java:2416)
        at java.util.concurrent.ForkJoinPool.execute(ForkJoinPool.java:2645)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.scheduleDrainBuffers(BoundedLocalCache.java:816)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:799)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.putFast(BoundedLocalCache.java:1360)
        at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:1296)
        at com.github.benmanes.caffeine.cache.LocalManualCache.put(LocalManualCache.java:64)
@ben-manes
Copy link
Owner

This is surprising, because the maximum capacity of the work queue is 64M. FJP commonPool should create more threads to handle the load such that this wasn't expected. The logging was intended if someone supplied their own bounded executor to indicate a resource constraint.

This should be benign, except if you have a RemovalListener. Currently if that task is rejected it logs an error and drops the event. It would probably be better if instead it fell back to executing on the caller's thread so that there was a performance loss, but no correctness problem (e.g. resource leak).

@ben-manes
Copy link
Owner

Is there any chance you can provide a test program to reproduce this? I'm not sure how you were able to exhaust the work queue, and if there's anything much I can do about it. I'll fix the removal listener issue, though.

@nosideeffects
Copy link
Author

I will see if I can extract a reproduction. The code in question seems to put pretty infrequently (like once per web request), so this is extremely strange.

@ben-manes
Copy link
Owner

The executor is used by Caffeine for the for periodic maintenance (after a write or a bunch of reads), removal notifications, refresh / refreshAfterWrite, and AsyncLoadingCache futures. In your case of a manual cache that means only periodic maintenance or removal notifications should be enqueued by the cache.

Periodic maintenance is guarded by a tryLock and a simple state machine. That should result in only one maintenance task queue'd up in general, or a little less than one per write, so that there aren't too many filling the work queue. I did experiment with avoiding any duplicates but didn't see a problem in my write benchmarks that would cause excessive scheduling problems.

The actual maintenance work is pretty small, except that it calls into a CacheWriter#delete on eviction. If the delete is excessively slow and writes keep coming in, then I could see a potential problem. A slow writer should be replaced with one that is mostly queue'ing up the work for asynchronous processing. If that's the issue then we could update the documentation to reflect best usage patterns.

@nosideeffects
Copy link
Author

So we do have a long living cache for each web thread that we invalidateAll at the beginning of each request. Could this be causing the issue? Should we be building a new cache instead?

@ben-manes
Copy link
Owner

invalidateAll() should be safe. I think invalidateAll(keys) could use some tuning.

Are these caches thread local? If so a bounded LinkedHashMap might be best regardless.

I think if we can reproduce the issue then I should be able to make the cache more resilient.

@nosideeffects
Copy link
Author

The caches are ThreadLocal. And they often only contain a few entries. I haven't reproduced locally yet.

@ben-manes
Copy link
Owner

I cleaned up the stress test to help investigate this. It prints out useful information like the ForkJoinPool pending queue size.

For write heavy workloads the state machine isn't perfect, so additional tasks do get scheduled. However that mostly doesn't occur if there is any intermediate work, like Thread.yield(), which minimizes the race and lets the maintenance work catch up. If we don't yield then the maintenance task is de-prioritized, as the other threads fight for all the CPU time, and the cache can't catch up with the work. This results in the known problem of exceeding its size, but only in that synthetic case, and shows that the executor's work queue is similarly lagging.

The perfect state machine doesn't have any race for the executor's work queue. Then even in the synthetic non-yielding case the queue stays empty, though as described before the cache can't evict fast enough to keep up with the writes. Again, that issue isn't a real concern because it requires the machine be dedicated to doing nothing but writing into the cache.

So the state machine change might help in your case and be a decent safety net. I'll play with it a little more and probably add it into the next release. It has a tiny performance penalty, but not enough to be noticeable outside of a micro-benchmark.

However, I think your resolution is best served by asking why a concurrent cache is used as a thread-local cache? A cache usually isn't write heavy, nor cleared so frequently. It sounds like you need a bounded scratchpad area, so a LinkedHashMap in LRU mode is probably both faster and a better fit. Can you explain the use-case?

@ben-manes
Copy link
Owner

Integrated the improved scheduling, but feel free to continue the discussion.

ben-manes added a commit that referenced this issue Mar 9, 2016
The simpler state machine and tryLock guard protected against most
unnecessary scheduling of the maintenance task. This could occur though
due to some benign races, a high write rate, and pronounce the effect
of a clogged executor. This perfect scheduling has a negligible cost
while ensuring that even in synthetic stress tests no excess occurs.
@nosideeffects
Copy link
Author

Thanks for looking so much into this! I've taken your suggestion and used a LinkedHashMap in this instance, instead, as there was no need for a concurrent structure. No doubt your improvement to scheduling will help others, and prevent this issue from occurring in our caches that do need concurrent access.

@ben-manes
Copy link
Owner

Thanks for the update. I hope to finish #56 this weekend and cut a release. I'm been a little slow on that one because it has an API change, and I'm unsure if there is a better method name. If you can take a look and offer a suggestion I'd appreciate it.

@ben-manes
Copy link
Owner

Release 2.2.3 with this improvement. Thanks again for bug report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants