common/options: Update RocksDB CF Tuning#51821
Conversation
Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
|
Note: As mentioned above, we can decrease write-amp to be even closer to the quincy default (it was already slightly lower in the rate-limited tests but higher in the rate unlimited tests) by setting the default min_write_buffer_number_to_merge, but we will likely lose some of the tail latency improvement (and potentially some of the 4K random write performance gains) by doing so. An alternative might be to enable lz4 compression in RocksDB by default. This had a mild reduction on rbd write-amplification but a more significant impact on RGW metadata write-amp. David Orman's team is seeing very good results with lz4 compression on their cluster. |
|
Here's the results from my tests. 4 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4
8 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4
16 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4
In my environment, it looks like the changes benefit the long tail latencies for 4KiB and 8KiB IO sizes at the cost of a slight increase in CPU and disk IO - IMO, it's a definite improvement over the original changes. My only concern is what happens at 16KiB and above. The latency improvement at 16KiB is negligible, but the increase in physical disk IOPS is around 9% compared to Quincy and around 4% more than Reef running the original rocksdb defaults. |
|
So we could be a little less aggressive on onode flushes by increasing the default number of memtables to fill before flush a little higher (maybe 7 or 8 instead of 6). The trade off would likely be higher tail latency to reduce DB write amp (bringing us a little closer to quincy, but hopefully still with some performance improvement). We won't be able to squeeze much more out of pgmeta/deferred I'm afraid. In my tests I saw slightly lower write-amp vs quincy in the rate-limited tests but higher write-amp vs quincy in the rate-unlimited tests (rotating between 4MB, 128KB, and 4KB IOs), so that would support the theory that we're still making a write-amp vs latency trade-off here, we've just mitigated most of it by specifically keeping pgmeta entries around as long as possible which is closer to the quincy behavior. It's also possible that we've actually (very slightly) lowered DB write amp for small IOs but increased it for larger ones. On the other hand, large writes should result in less DB write traffic relative to block write traffic... |
|
Perhaps the other way to look at the increase in IOPS is that disk devices are consumables - we expect them to wear and we expect them to fail. Looking at the where the the read/write bias is, it's mostly read: reads up 17% and writes increase 4%. Given the obvious latency benefit to 4K and 8K workloads, and the fact that this PR brings CPU consumption closer to Quincy I think for block, this is a step forward. @markhpc have you tested for impact to object workloads? |
|
@pcuzner yeah, I did some tests with slightly less aggressive onode flushing behavior and it did appear to lower write amp a little. Probably not enough to change things at this point though since there's also likely a slight latency hit (Tests were too noisy to tell for sure). Let's do some quick object sanity checks and if things look good, get this change in before we give reef out to the community. |
|
Very interesting results when testing 4K object PUTS/LIST/GETS/DELS in RGW:
The new tuning generally appears to be a win for RGW. The number of compactions is lower, the total compaction duration is lower, and the write amp appears to be lower. Performance wise it was pretty close with some variations here and there. Deletes were maybe a little faster and listing was maybe a little slower. I suspect on this setup that RGW is still the major bottleneck. Personally I think we should go ahead and merge this to get more user feedback at this point. |
|
Thank you @ljflores! |
In #47221 we updated the bluestore RocksDB Tunings to reduce overhead in the bluestore kv sync thread and improve small random write performance. In the original testing we saw a slight write-amp increase, but it was far lower than in previous testing of similar options:
https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/
Recently in rate-limited testing performed by Paul Cuzner at IBM, Paul observed higher CPU overhead in RocksDB get requests via extentmap::fault_range along with higher IOPS amplification on the underlying devices. Last week I was able to replicate those findings by running similar tests on the upstream mako cluster. Reverting back to the quincy rocksdb defaults reduced the disk write-amplification, but also lowered 4K IOPS (in rate unlimited tests), increased latency, and especially increased 99% tail latency. This is the historic problem we've faced with RocksDB. On one hand we want to hold onto pgmeta data long enough that tombstones prevent pglog entries from entering the database. On the other hand, we don't want to hold onto any data longer than we have to because this increases the work that RocksDB has to do in the bstore_kv_sync thread to keep records in sorted order. Sadly the new tuning in #47221 didn't reduce leakage of pgmeta data into the database when using small memtables as much as I initially thought.
The good news is that back in 2021 we laid the groundwork in PR #38855 to separate pgmeta data into a separate column family, allowing us to potentially tune pgmeta and deferred column families differently than other CFs: Mainly onode data. In an attempt to retain the benefits of the new reef rocksdb tunings with the write-amplification behavior the old settings, we specifically tune the pgmeta and deferred column families to retain a large amount of data in memtables before flushing them, while doubling down on quickly flushing other column families (mainly onode) to help reduce the amount of work being done in the bstore_kv_sync thread.
The RocksDB behavior overall looks much closer to Quincy:
rate-limited tests mimicking Paul Cuzner's testing:
And in full-speed tests:
However, the new tuning is showing the same 4K random write performance improvement as the existing reef tuning, along with better latency and significantly better tail latencies in the rate limited tests (though not as good as the original reef tuning). While performance in the rate-unlimited 4K random write tests showed the same improvement vs Quincy as the original tuning, tail latencies were only slightly better than the Quincy tuning. The original reef tuning is still superior in this regard.
Overall this PR is expected to reduce write-amplification in RocksDB to being closer to Quincy levels while retaining most of the benefits of the original Reef tunings.
Overview of the specific changes:
Detailed performance data for Quincy, the original Reef tuning, and this change is available:
Tests to mimic Paul Cuzner's rate-limited mixed-workload tests:
https://docs.google.com/spreadsheets/d/1HFFn0dAqLazLe8AYIUum5r8zHDBFzxEpZRu8OveT9K4/edit?usp=sharing
Rate-Unlimited Tests:
https://docs.google.com/spreadsheets/d/1srz_vY0_cllgRWxQb1UyzdMw2F1iaIfD-AT1E9Paqpo/edit?usp=sharing
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this pleasejenkins test classic perfjenkins test crimson perfjenkins test signedjenkins test make checkjenkins test make check arm64jenkins test submodulesjenkins test dashboardjenkins test dashboard cephadmjenkins test apijenkins test docsjenkins render docsjenkins test ceph-volume alljenkins test ceph-volume toxjenkins test windows