Skip to content

common/options: Update RocksDB CF Tuning#51821

Merged
yuriw merged 1 commit intoceph:mainfrom
markhpc:wip-bs-rocksdb-cf-tuning
Jun 2, 2023
Merged

common/options: Update RocksDB CF Tuning#51821
yuriw merged 1 commit intoceph:mainfrom
markhpc:wip-bs-rocksdb-cf-tuning

Conversation

@markhpc
Copy link
Copy Markdown
Member

@markhpc markhpc commented May 30, 2023

In #47221 we updated the bluestore RocksDB Tunings to reduce overhead in the bluestore kv sync thread and improve small random write performance. In the original testing we saw a slight write-amp increase, but it was far lower than in previous testing of similar options:

https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/

Recently in rate-limited testing performed by Paul Cuzner at IBM, Paul observed higher CPU overhead in RocksDB get requests via extentmap::fault_range along with higher IOPS amplification on the underlying devices. Last week I was able to replicate those findings by running similar tests on the upstream mako cluster. Reverting back to the quincy rocksdb defaults reduced the disk write-amplification, but also lowered 4K IOPS (in rate unlimited tests), increased latency, and especially increased 99% tail latency. This is the historic problem we've faced with RocksDB. On one hand we want to hold onto pgmeta data long enough that tombstones prevent pglog entries from entering the database. On the other hand, we don't want to hold onto any data longer than we have to because this increases the work that RocksDB has to do in the bstore_kv_sync thread to keep records in sorted order. Sadly the new tuning in #47221 didn't reduce leakage of pgmeta data into the database when using small memtables as much as I initially thought.

The good news is that back in 2021 we laid the groundwork in PR #38855 to separate pgmeta data into a separate column family, allowing us to potentially tune pgmeta and deferred column families differently than other CFs: Mainly onode data. In an attempt to retain the benefits of the new reef rocksdb tunings with the write-amplification behavior the old settings, we specifically tune the pgmeta and deferred column families to retain a large amount of data in memtables before flushing them, while doubling down on quickly flushing other column families (mainly onode) to help reduce the amount of work being done in the bstore_kv_sync thread.

The RocksDB behavior overall looks much closer to Quincy:

rate-limited tests mimicking Paul Cuzner's testing:

Compaction Statistics v17.2.6 (librbd) v17.2.6 (krbd) v17.2.6 (kcephfs) reef (librbd) reef (krbd) reef (kcephfs) reef new (librbd) reef new (krbd) reef new (kcephfs)
Total OSD Log Duration (seconds) 3174.725 3660.688 3352.91 3187.756 3753.734 3360.273 3174.826 3676.494 3331.977
Number of Compaction Events 79 72 62 108 103 93 58 51 42
Avg Compaction Time (seconds) 1.2 1.5 1.5 1.9 2.1 2.1 1.6 1.9 2.1
Total Compaction Time (seconds) 91.6 104.6 92.7 207.6 218.7 194.5 95.0 99.2 88.4
Avg Output Size: (MB) 143.3 171.6 172.2 258.5 284.1 282.0 192.6 240.1 240.3
Total Output Size: (MB) 11316.9 12352.8 10678.4 27918.6 29263.2 26229.4 11168.5 12244.9 10092.4
Total Input Records 172517209 183830873 172921746 341281440 354601953 330362204 151630197 159038412 143201639
Total Output Records 106776939 114160771 107240383 266427868 278340843 258230363 83881563 86861138 76251202
Avg Output Throughput (MB/s) 115.2 133.2 128.3 125.1 134.5 135.5 100.0 128.4 118.1
Avg Input Records/second 1551454.4 1759918.6 1915039.9 1496781.8 1605701.4 1682837.4 1182177.6 1396317.2 1466980.0
Avg Output Records/second 802347.3 906671.3 990708.0 1151879.9 1255532.3 1311169.5 620226.0 719748.2 748543.6
Avg Output/Input Ratio 0.51 0.50 0.51 0.69 0.72 0.73 0.44 0.45 0.45

And in full-speed tests:

Compaction Statistics Reef - New Tuning Reef - Original Tuning v17.2.6 - Stock
Total OSD Log Duration (seconds) 11478.59 11520.46 11490.44
Number of Compaction Events 200 412 254
Avg Compaction Time (seconds) 2.47 2.81 1.96
Total Compaction Time (seconds) 494.62 1158.81 499.10
Avg Output Size: (MB) 205.04 268.45 129.98
Total Output Size: (MB) 41007.52 110602.42 33015.77
Total Input Records 615400246 1381251268 646524292
Total Output Records 348137162 1103275376 413326509
Avg Output Throughput (MB/s) 85.78 94.76 79.52
Avg Input Records/second 1100049.03 1156592.64 1397685.03
Avg Output Records/second 556637.88 884706.70 710541.65
Avg Output/Input Ratio 0.44 0.72 0.51

However, the new tuning is showing the same 4K random write performance improvement as the existing reef tuning, along with better latency and significantly better tail latencies in the rate limited tests (though not as good as the original reef tuning). While performance in the rate-unlimited 4K random write tests showed the same improvement vs Quincy as the original tuning, tail latencies were only slightly better than the Quincy tuning. The original reef tuning is still superior in this regard.

Overall this PR is expected to reduce write-amplification in RocksDB to being closer to Quincy levels while retaining most of the benefits of the original Reef tunings.

Overview of the specific changes:

  1. Remove ttl=21600. This was added in reef to help combat the tombstone iteration issues we've had, but tends to introduce write-amplification over time. With the introduction of the compact-on-iteration feature in [WIP] kv/RocksDBStore: Improved RocksDB Settings and Tombstone behavior #47221, this no longer appears to be needed based on community testing and feedback. We'll no longer introduce this as a default behavior in Reef.
  2. Switch from max 128 8MB to max 64 16MB buffers. This had a very slight latency reduction in the rate-limited tests while still allowing for fine tuning of the write-amplification (such as increasing the number of memtables to fill before flushing). Jumping to 32 32MB buffers slightly hurt the unlimited rate tests so we stick with 16MB buffers.
  3. Switch the default min_write_buffer_number_to_merge from 16 8MB buffers to 6 16MB buffers. This primarily affects the onode column families. This is set even lower than the Reef defaults to help offset the other change we are going to make for the pgmeta and deferred write column families. If we want to lower write-amplification (potentially at the expense of higher CPU in the bstore_kv_sync thread, higher tail latency, and lower 4K randwrite IOPS), we can increase this to 7 or 8 buffers.
  4. For the pgmeta (P) and deferred (L) column families, set the min_write_buffer_number_to_merge to 32 16MB buffers. This allows more data to accumulate in the memtables and leak less of these kinds of writes into the database since they are both generally short lived.

Detailed performance data for Quincy, the original Reef tuning, and this change is available:

Tests to mimic Paul Cuzner's rate-limited mixed-workload tests:
https://docs.google.com/spreadsheets/d/1HFFn0dAqLazLe8AYIUum5r8zHDBFzxEpZRu8OveT9K4/edit?usp=sharing

Rate-Unlimited Tests:
https://docs.google.com/spreadsheets/d/1srz_vY0_cllgRWxQb1UyzdMw2F1iaIfD-AT1E9Paqpo/edit?usp=sharing

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

Signed-off-by: Mark Nelson <mark.nelson@clyso.com>
@markhpc
Copy link
Copy Markdown
Member Author

markhpc commented May 30, 2023

Note: As mentioned above, we can decrease write-amp to be even closer to the quincy default (it was already slightly lower in the rate-limited tests but higher in the rate unlimited tests) by setting the default min_write_buffer_number_to_merge, but we will likely lose some of the tail latency improvement (and potentially some of the 4K random write performance gains) by doing so. An alternative might be to enable lz4 compression in RocksDB by default. This had a mild reduction on rbd write-amplification but a more significant impact on RGW metadata write-amp. David Orman's team is seeing very good results with lz4 compression on their cluster.

@markhpc markhpc requested review from benhanokh, cfsnyder and ifed01 May 30, 2023 01:21
@pcuzner
Copy link
Copy Markdown
Contributor

pcuzner commented May 31, 2023

Here's the results from my tests.

4 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4

Ceph Version IOPS Achieved IOPS Goal CPU Physical Read Physical Write Total Physical Client IOPS/Core 99%ile Read 99%ile Write Sample Period
Pacific 16.2.13 160,055 100.03% 73.8 127,837 278,519 406,356 2,168 0.98 3.96 05/15 16:30-17:30
Quincy 17.2.6 160,025 100.02% 71 127,806 280,338 408,144 2,254 1.02 1.71 05/19 15:55-16:55
Reef (default) 160,012 100.01% 77 127,832 281,662 409,494 2,077 1.2 1.72 05/05 20:25-21:25
Reef (rocksdb=quincy) 159,997 100.00% 70.9 127,809 280,633 408,442 2,256 0.711 1.42 05/17 18:20-19:20
Reef (PR # 51821) 159,945 99.97% 73.2 127,836 282,099 409,935 2,184 0.562 1.06 05/30 22:46-23:45

8 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4

Ceph Version IOPS Achieved IOPS Goal CPU Physical Read Physical Write Total Physical Client IOPS/Core 99%ile Read 99%ile Write Sample Period
Pacific 16.2.13 160,237 100.15% 77 195,362 330,450 525,812 2,082 2.6 7.35 05/15 20:15-21:15
Quincy 17.2.6 160,219 100.14% 74.2 190,478 339,711 530,189 2,159 1.9 2.83 05/19 19:10-20:10
Reef (default) 160,154 100.10% 80.5 238,899 351,277 590,176 1,989 1.93 2.74 05/07 12:00-13:00
Reef (rocksdb=quincy) 160,146 100.09% 73.9 189,042 338,514 527,555 2,168 2.46 2.96 05/17 21:40 22:40
Reef (PR # 51821) 160,230 100.14% 76 224,444 349,880 574,324 2,108 0.818 1.33 05/31 10:15-11:15

16 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4

Ceph Version IOPS Achieved IOPS Goal CPU Physical Read Physical Write Total Physical Client IOPS/Core 99%ile Read 99%ile Write Sample Period
Pacific 16.2.13 117,685 73.55% 72.8 239,795 423,141 662,936 1,616 5.1 8.33 05/16 12:25-13:25
Quincy 17.2.6 119,449 74.66% 70 214,448 414,531 628,979 1,707 6.06 6.42 05/19 22:30-23:30
Reef (default) 120,815 75.51% 75 260,172 426,471 686,643 1,611 5.57 5.92 05/08 13:50-14:50
Reef (rocksdb=quincy) 120,329 75.21% 70.1 241,161 418,373 659,534 1,717 5.65 6 05/18 12:00-13:00
Reef (PR # 51821) 118,134 73.83% 71.3 251,384 431,695 683,079 1,658 5.59 5.86 05/31 13:25-14:25

In my environment, it looks like the changes benefit the long tail latencies for 4KiB and 8KiB IO sizes at the cost of a slight increase in CPU and disk IO - IMO, it's a definite improvement over the original changes. My only concern is what happens at 16KiB and above. The latency improvement at 16KiB is negligible, but the increase in physical disk IOPS is around 9% compared to Quincy and around 4% more than Reef running the original rocksdb defaults.

@markhpc
Copy link
Copy Markdown
Member Author

markhpc commented May 31, 2023

So we could be a little less aggressive on onode flushes by increasing the default number of memtables to fill before flush a little higher (maybe 7 or 8 instead of 6). The trade off would likely be higher tail latency to reduce DB write amp (bringing us a little closer to quincy, but hopefully still with some performance improvement). We won't be able to squeeze much more out of pgmeta/deferred I'm afraid. In my tests I saw slightly lower write-amp vs quincy in the rate-limited tests but higher write-amp vs quincy in the rate-unlimited tests (rotating between 4MB, 128KB, and 4KB IOs), so that would support the theory that we're still making a write-amp vs latency trade-off here, we've just mitigated most of it by specifically keeping pgmeta entries around as long as possible which is closer to the quincy behavior. It's also possible that we've actually (very slightly) lowered DB write amp for small IOs but increased it for larger ones. On the other hand, large writes should result in less DB write traffic relative to block write traffic...

@pcuzner
Copy link
Copy Markdown
Contributor

pcuzner commented May 31, 2023

Perhaps the other way to look at the increase in IOPS is that disk devices are consumables - we expect them to wear and we expect them to fail. Looking at the where the the read/write bias is, it's mostly read: reads up 17% and writes increase 4%. Given the obvious latency benefit to 4K and 8K workloads, and the fact that this PR brings CPU consumption closer to Quincy I think for block, this is a step forward. @markhpc have you tested for impact to object workloads?

@markhpc
Copy link
Copy Markdown
Member Author

markhpc commented May 31, 2023

@pcuzner yeah, I did some tests with slightly less aggressive onode flushing behavior and it did appear to lower write amp a little. Probably not enough to change things at this point though since there's also likely a slight latency hit (Tests were too noisy to tell for sure). Let's do some quick object sanity checks and if things look good, get this change in before we give reef out to the community.

@markhpc
Copy link
Copy Markdown
Member Author

markhpc commented Jun 2, 2023

Very interesting results when testing 4K object PUTS/LIST/GETS/DELS in RGW:

Compaction Statistics v17.2.6 reef-9d5a260e reef-9d5a260e-newtuning
Total OSD Log Duration (seconds) 9857.44 9300.18 9166.42
Number of Compaction Events 2808 2160 1724
Avg Compaction Time (seconds) 1.88 2.76 2.72
Total Compaction Time (seconds) 5270.85 5969.43 4686.30
Avg Output Size: (MB) 290.80 340.60 370.06
Total Output Size: (MB) 816566.94 735698.55 637987.92
Total Input Records 3230220383 4165810416 2676878737
Total Output Records 2308556504 3139196730 1788896685
Avg Output Throughput (MB/s) 161.34 125.24 140.55
Avg Input Records/second 645043.65 734816.94 601287.37
Avg Output Records/second 456709.21 571266.58 427716.74
Avg Output/Input Ratio 0.83 0.82 0.81

The new tuning generally appears to be a win for RGW. The number of compactions is lower, the total compaction duration is lower, and the write amp appears to be lower. Performance wise it was pretty close with some variations here and there. Deletes were maybe a little faster and listing was maybe a little slower. I suspect on this setup that RGW is still the major bottleneck.

Personally I think we should go ahead and merge this to get more user feedback at this point.

@ljflores
Copy link
Copy Markdown
Member

ljflores commented Jun 2, 2023

@yuriw yuriw merged commit a728862 into ceph:main Jun 2, 2023
@markhpc
Copy link
Copy Markdown
Member Author

markhpc commented Jun 2, 2023

Thank you @ljflores!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants