Initial support for secondary cache in LRUCache#8271
Initial support for secondary cache in LRUCache#8271anand1976 wants to merge 9 commits intofacebook:masterfrom
Conversation
pdillinger
left a comment
There was a problem hiding this comment.
Although the design is general, it seems to make the "simple" or "normal" case complicated.
Normally, we store a mix of "parsed" (e.g. ParsedFullFilterBlock or Block) and "no parsing necessary" (BlockContents) data structures in block cache.
I expect the "normal" case for tiered cache to be storing just the BlockContents, perhaps with recompression or something. Do we have plans for something more elaborate?
Did we consider an alternate design like this: instead of adding an "unparsing" layer to get block cache data to tiered cache, codify the parsing part of the volatile tier into the block cache itself, so that the tiered cache has direct access to the unparsed version (BlockContents)? It seems like this would involve exposing some CacheableEntry / BlocklikeTraits functionality into block cache (seems reasonable) and save the ugliness (and execution cost?) of coding up (and executing) an unparsing step.
cache/lru_cache.cc
Outdated
There was a problem hiding this comment.
LRUCache is among the most performance-sensitive pieces of code we have. Have you evaluated the performance impact of these changes (e.g. cache_bench)? Do you see an efficient path to planned changes?
There was a problem hiding this comment.
Good point. I'll run benchmarks to check for regression.
cache/lru_cache.h
Outdated
There was a problem hiding this comment.
It also seems awkward wedging in this new capability into one implementation, rather than extending it to (perhaps) more naturally support the data.
Except perhaps for Cache API compatibility, which I would argue is not so critical. |
We store BlockContents only in the compressed block cache. The uncompressed block cache mostly has objects that require some parsing.
For the time being, yes. We may be able to do something more elaborate with PM, like a deep copy, so it'd be good to retain the flexibility.
Interesting. I haven't considered this design. I'm not sure the block cache would need to be exposed to the BlocklikeTraits functionality. The Insert API could just take a pointer to the raw data right, since the assumption is that no special parsing is needed? I'm not too concerned about the execution cost of "unparsing", since its just extracting the raw data pointer. The bigger overhead is memcpy'ing the data. OTOH, I think its too restrictive to assume that the tiered cache would only ever need access to the BlockContents/raw data. |
|
@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
zhichao-cao
left a comment
There was a problem hiding this comment.
Thanks for working on it. In general it looks good for me. The title and description need to be updated to "Secondary cache". Since it is experimental, maybe we do not need to log the change to HISTORY.md for now?
|
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
|
@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
|
@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
pdillinger
left a comment
There was a problem hiding this comment.
I didn't dig into tests, but I'm now satisfied with the interface and indications of no notable performance regression.
include/rocksdb/cache.h
Outdated
There was a problem hiding this comment.
With the waiting functionality, consider (probably future PR) whether it's worth fixing the race condition between failed lookup and insert. Although our "redundant" block cache insertion statistics suggest it's not a problem currently, with more moves toward remote storage, we could run into more "thundering herd" type problems. My prototype is here: pdillinger@513affa#diff-e9d84127b0d5363ff842cce3fca54dfbffec5d9955f810e7678d69c5c5904e3fR227
|
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
|
@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
|
@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
@anand1976 has updated the pull request. You must reimport the pull request before landing. |
|
@anand1976 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
@anand1976 merged this pull request in feb06e8. |
…ache (#8315) Summary: Secondary cache is implemented to achieve the secondary cache tier for block cache. New Insert and Lookup APIs are introduced in #8271 . To support and use the secondary cache in block based table reader, this PR introduces the corresponding callback functions that will be used in secondary cache, and update the Insert and Lookup APIs accordingly. benchmarking: ./db_bench --benchmarks="fillrandom" -num=1000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/tmp/rocks_t/db -partition_index_and_filters=true ./db_bench -db=/tmp/rocks_t/db -use_existing_db=true -benchmarks=readrandom -num=1000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=5 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -stats_dump_period_sec=30 -reads=50000000 master benchmarking results: readrandom : 3.923 micros/op 254881 ops/sec; 33.4 MB/s (23849796 of 50000000 found) rocksdb.db.get.micros P50 : 2.820992 P95 : 5.636716 P99 : 16.450553 P100 : 8396.000000 COUNT : 50000000 SUM : 179947064 Current PR benchmarking results readrandom : 4.083 micros/op 244925 ops/sec; 32.1 MB/s (23849796 of 50000000 found) rocksdb.db.get.micros P50 : 2.967687 P95 : 5.754916 P99 : 15.665912 P100 : 8213.000000 COUNT : 50000000 SUM : 187250053 About 3.8% throughput reduction. P50: 5.2% increasing, P95, 2.09% increasing, P99 4.77% improvement Pull Request resolved: #8315 Test Plan: added the testing case Reviewed By: anand1976 Differential Revision: D28599774 Pulled By: zhichao-cao fbshipit-source-id: 098c4df0d7327d3a546df7604b2f1602f13044ed
Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.
Tests:
Results from cache_bench and db_bench don't show any regression due to these changes.
cache_bench results before and after this change -
Command
./cache_bench -ops_per_thread=10000000 -threads=1Before
Complete in 40.688 s; QPS = 245774Complete in 40.486 s; QPS = 246996Complete in 42.019 s; QPS = 237989After
Complete in 40.672 s; QPS = 245869Complete in 44.622 s; QPS = 224107Complete in 42.445 s; QPS = 235599db_bench results before this change, and with this change + #8213 and #8191 -
Commands
./db_bench --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300Before
After