-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a blob-specific cache priority #10461
Conversation
Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of facebook#10156
The previous PR (#10309) was reverted by #10434 because it caused regression tests to fail (degraded performance). Today we reproduced the potential bug and successfully fixed the bug by comparing the performance gap of the three versions (#10451), this PR contains the latest implementation. The performance test results of three different binaries are provided by @mdcallag, more details could be found in [bug, revert, fix] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @gangliao ! Let's wait for the results from the regression tests
if (num_shard_bits >= 20) { | ||
return nullptr; // The cache cannot be sharded into too many fine pieces. | ||
} | ||
if (high_pri_pool_ratio < 0.0 || high_pri_pool_ratio > 1.0) { | ||
// Invalid high_pri_pool_ratio | ||
return nullptr; | ||
} | ||
if (low_pri_pool_ratio < 0.0 || low_pri_pool_ratio > 1.0) { | ||
// Invalid high_pri_pool_ratio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: low_pri_pool_ratio
// Percentage of cache reserved for low priority entries. | ||
// If greater than zero, the LRU list will be split into a high-pri list, a | ||
// low-pri list and a bottom-pri list. High-pri entries will be inserted to | ||
// the tail of high-pri list, while low-pri entries will be first inserted to | ||
// the low-pri list (the midpoint) and bottom-pri entries will be first | ||
// inserted to the bottom-pri list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to adjust/clarify the wording here a bit (and also for high_pri_pool_ratio
above); for instance, there are corner cases like the one we hit earlier where high_pri_pool_ratio == 0
but low_pri_pool_ratio > 0
. The simplest way of putting it might be saying that high_pri_pool_ratio > 0
results in a dedicated high-priority pool being created, low_pri_pool_ratio > 0
results in a dedicated low-priority pool, and items will be inserted into the pool with the highest priority that exists which doesn't exceed the item's priority.
I don't think we should introduce small passive performance regressions (affecting performance when new feature is not enabled) for features that haven't been proven necessary. One possibility that I'm not in love with but might be a decent compromise would be to add two bool template parameters to LRUCache for whether certain priority pools exist. This way the appropriately optimized implementation can be used when those pools don't (need to) exist. This code is hot enough that I think we can accept code size increase. |
I'd say let's first establish whether there is actually a performance impact for the common case with the updated code. This is still in progress. (The earlier patch did have a bug that we have since fixed.) Personally, the change doesn't strike me as something that would introduce a significant overhead; then again, I might be turn out be wrong on this one, in which case we can obviously reconsider. |
This is among the hottest of hot code, and this change introduces strictly more tracking to be done regardless of enabling the feature. Just look at LRU_Remove code, which alone is 1% of total CPU (nearly 2% of RocksDB CPU) of a large workload looking for CPU improvements and 0.3% of a very large workload. (All of LRUCache is roughly 9% and 2.5% of those workloads.) |
I know it's hot - what I'm saying is that I don't expect the cost to be noticeable if the feature is not enabled ( |
I will share more benchmark results later today. |
The low pri pool was added as "mid-point" insertion. The purpose is to make sure occasional full scans (logical backups, shard movements, administrative scans, etc) that touch a large range of otherwise cold blocks would not thrash all the hot blocks, and they will die in low pri pool and blocks in high pri pool will remain in cache. With bottom pri pool, does it mean those scans will thrash both low pri pool and bottom pool which contains blobs? |
Our goal is to prioritize SST data blocks over blobs, since with BlobDB data blocks function sort of function as an index. So when those scans happen, we would prefer to keep data blocks in the cache at the expense of blobs. |
My test shows an unexplained small improvement (~2% ops/sec on a benchmark with 50% CPU in LRUCache) with the feature present but disabled, and ~9% overhead with the feature enabled. Assuming Mark's tests come out OK, that satisfies me on the performance front. |
The answer to my question is yes, or no? |
Sorry I'm not sure I understand your question. If you have all three pools enabled (i.e. high ratio > 0, low ratio > 0), metablocks will go in the high-pri pool, data blocks will go in the low-pri pool, and blobs will go in the bottom-pri pool. So if we have a full scan, blobs will see more thrashing than data blocks, which in turn will see more thrashing than metablocks (which is I think what we want). |
I have results from 3 benchmarks. I will summarize them here, then add 2 more comments with more detail per benchmark. Two binaries were compared, I assume these are close in the git log
The test server has 256G RAM, 40 cores, 80 HW threads and hyperthreads are enabled. The benchmarks are:
For my benchmark.sh
Everything was run 3 times and I focus on the median values for QPS and CPU/query I expect by RX at 4 threads to be the most likely to show a regression. The summary is ... For by OS:
For by RX:
For viewstate:
|
More detail for by RX and by OS Using tools/benchmark.sh with 40M KB/pairs, 20-byte key, 400-byte values and no compression. There were 2 configs:
For "by OS":
For "by RX":
|
For viewstate
|
The first time a data block is inserted, it goes to low-pri pool, but if it gets hit again, it is promoted to the high pri pool. On the other hand, what is evicted from the high pri pool will be inserted in to the low-pri pool. Metablock in high pri pool is just a paranoid optimization, rather than the main purpose of the two-pool approach. |
One data point regarding that: technically the root cause of the earlier regression was that when there was a low-pri pool configured but no high-pri one, metablocks ended up at the bottom (below data blocks) and got evicted first. This suggests to me that where we insert them in the LRU list does make a difference. |
Did a test with BlobDB to quantify the effects of the bottommost cache priority. After populating a smallish (fits in the OS page cache) DB with 1KB-sized values using I repeated the tests three times. Here are the data block cache miss counters for the baseline (note: there were no cache misses for index blocks after the initial load with either configuration):
With the low-priority pool enabled:
So data blocks saw much less churn with the low-priority pool enabled, which is what we would want since we have to go through them to get to the blobs. This also translated to higher ops/sec: baseline was 617698/633135/635381 for the three runs vs 620729/636272/643619 with the feature enabled. |
Good results! I assume total read I/O bandwidth also improved. Can you post the number here for a record? |
Some more stats about overall cache misses as discussed with @siying : Baseline:
With the feature enabled:
So there's fewer misses overall and also less I/O: there's a ~270,000 reduction in SST block misses in exchange for a ~10,000-20000 increase in blob misses (with 1-KB blobs which are thus smaller than SST blocks). |
@ltamasi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@gangliao has updated the pull request. You must reimport the pull request before landing. |
@ltamasi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@gangliao has updated the pull request. You must reimport the pull request before landing. |
@ltamasi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@gangliao has updated the pull request. You must reimport the pull request before landing. |
@ltamasi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@ltamasi has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hi @gangliao! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of facebook#10156 Pull Request resolved: facebook#10461 Reviewed By: siying Differential Revision: D38672823 Pulled By: ltamasi fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
* Add a blob-specific cache priority (facebook#10461) Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of facebook#10156 Pull Request resolved: facebook#10461 Reviewed By: siying Differential Revision: D38672823 Pulled By: ltamasi fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743 * make format Signed-off-by: Connor1996 <zbk602423539@gmail.com> * make format Signed-off-by: Connor1996 <zbk602423539@gmail.com> --------- Signed-off-by: Connor1996 <zbk602423539@gmail.com> Co-authored-by: Gang Liao <gangliao@cs.umd.edu>
Summary:
RocksDB's
Cache
abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.This task is a part of #10156