Expiry after Access / Time to Idle / Idle Scan #39
Support of expiry after an entry is accessed is missing in core cache2k.
What is it?
An entry expires and is removed from the cache after being not accessed for a period of time. The main use is primarily for freeing resources and shrinking the cache when the application is not in use.
The feature is available in the JCache support. However, it is not implemented very efficiently. Also, the standard functionality of JCache is not very practicable. JCache defines TTI, however it cannot be combined with expire after write.
Why it isn't available in cache2k, yet?
We almost never need it. Typically we run caches with a TTL between 30 seconds and 30 minutes. Which means that entries expire via the TTL setting and the cache shrinks. TTI would be needed if entries never expire via TTL or have a higher TTL than TTI. For example a web resource might be allowed to be cached for 6 days, however, if being not accessed we like to remove it from memory after one hour.
The second reason is the overhead. TTI is typically implemented via an LRU style linked list. This brings lock contention and concurrency issues.
If needed several options are available via the existing cache2k features, e.g. the expiry can be modified in a cache request via the entry processor.
We also plan to do an efficient implementation of a TTI approximation which is based on the eviction algorithm.
The text was updated successfully, but these errors were encountered:
@ricardojlrufino Thanks for reaching out! Its good timing, since I have now stabilized the next version (1.2) and need to do some planing what important stuff to put the focus on next.
Two reasons it's not there yet: We rarely need it, there needs to be always a compromise on the performance we have achieved to provide/enable this feature.
To drive it in the right direction, I'd love to hear from people who actually need it. Can you share something about the scenario you want to use it for?
@cruftex My application is from IoT and I cache the Devices in use and, after a while, I need to remove it from memory.
@cruftex This is a very important feature that I need now. I have more than 100 million elements to cache, but not all of them are accessed continuously (~ 5 millions). The cache eats my ram (> 70Gbs) so I cannot run two instances simultaneously (total 125Gbs).
@detohaz, @ricardojlrufino: Thanks for your feedback! Unfortunately, I am a bit held up with other activities but there will be some progress here during 2019... For the use cases you describe I still wonder whether TTI is really the best solution. So I like to move in two directions here: Implement TTI but also maybe come up with a better cache feature for 90% of the use cases that use TTI now.
@detohaz: You know that around 5 million entries are accessed continuously and should therefore be cached. What about limiting the cache size to 8 million entries and keep the ram usage in check?
@detohaz: Did my suggestion to limit the cache size help and solve your problem?
In case you use a TTI like feature, the actual cache size would depend on the usage pattern. This can lower your heap memory usage, in times when you have low utilization. However, you don't have a guarantee for an upper memory limit.
I've been tinkering on a neat solution for this the last days / weeks.
Main configuration parameter is
Early testing results look promising. Just checked in the current state. The solution has the following advantages over a usual TTI implementation based on a queue / linked list:
I add more hard data, once the test scenarios stabilize.
…emporal information, #39
Updated the numbers in the table after changes, 23rd Nov
As promised some comparison data. For the comparison I use a recorded trace, which I included in the last commit.
About the trace:
Access trace on the product catalog on an eCommerce website of 24 hours starting from 4am goining to 4am the next morning. All requests of robots (web spiders from search engines like Google or Bing) were removed, so the trace contains mostly user activity. The requests were reduced by removing a random selection of keys to have a more compact sice for including in the project. The first integer of the trace is the time in seconds, the second one the accessed key. Both integers are normalized and starting at 0. The trace characteristics are:
For the comparison I implemented the established semantics (removal exactly after passing the idle duration), plus and run a playback of the above trace. Next I run the playback against the current cache2k code via a simulated clock source.
Find the code and traces via the referenced commit above.
Established Time To Idle semantics
Time To Idle emulation via scanning in cache2k 2.6
Using about three thirds of the TTI as scan time, produces a similar average cache size, while having a better hitrate and a lower maximum cache size. Thus, saving resources while being more effective.
With the established TTI semantics resource usage is growing or shrinking linear with the time setting. With the scanning approach the resource numbers are "jumping". This can happen because the results depend on how the scanning activity aligns with the requests.
For the above results cache2k runs with unbounded capacity and the eviction algorithm is not working as regular. Instead of splitting up the data between a cold and hot set, everything is in the hot set, or, one clock data structure that is scanned through. The next test is combined with normal eviction activity by setting the capacity to 2000 entries.
Time To Idle emulation via scanning in cache2k 2.6 with capacity limit of 2000 entries
While there is more validation and testing work to do, it can be seen that the general direction makes sense and the approach is a valid alternative to the established straight forward TTI implementation.
When we cap the cache size to 2000 entries the scanning TTI emulation runs in cooperation with the normal eviction resulting in a lower maximum cache size but not much lower hit rates. That means there is potential for more improvement by making more use of the eviction algorithm, e.g. by removing "idle" entries faster that had never seen any hits.
I see different directions of "improvement":
For now I will stabilize the current approach, which seems "good enough" to make it into production and then we can learn from it.
Did some more testing and fixed an logical error that lead to one round scan and one round pausing. Corrected the numbers in the previous comment.
Here is a histogram of the actual time from last access to eviction for the 45 minutes scan time setting:
The time column represents a time span of 5 minutes.
Idle evictions between 45 and 90 minutes could be expected, however we see evictions a bit before and after. Longer times are happening during a phase of cache growing, smaller times are happening during a phase of cache shrinking.
As an example let's look at the shrink phase: One entry could be in the middle of the scan round. If most of the entries are removed in the round, the entry would be at the beginning of the next scan round. This means there is a chance that entries are removed much earlier during shrinking.
These outliers are not problematic, the opposite, the behavior is in sync with the workload phase. During the insert phase (rising activity) it makes sense to keep entries a bit longer, during the shrink phase (ebbing activity) evictions can be done earlier without doing harm.
Notes on compensating for inserts and removals while scanning:
Inserts: There is no compensation for inserts. We could speed up the scan since the cache size increases, however, this happens at the start of the new round anyway.
Completed and shipped a while ago in: