common/options: Update RocksDB CF Tuning by markhpc · Pull Request #51821 · ceph/ceph

markhpc · 2023-05-30T01:15:27Z

In #47221 we updated the bluestore RocksDB Tunings to reduce overhead in the bluestore kv sync thread and improve small random write performance. In the original testing we saw a slight write-amp increase, but it was far lower than in previous testing of similar options:

https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/

Recently in rate-limited testing performed by Paul Cuzner at IBM, Paul observed higher CPU overhead in RocksDB get requests via extentmap::fault_range along with higher IOPS amplification on the underlying devices. Last week I was able to replicate those findings by running similar tests on the upstream mako cluster. Reverting back to the quincy rocksdb defaults reduced the disk write-amplification, but also lowered 4K IOPS (in rate unlimited tests), increased latency, and especially increased 99% tail latency. This is the historic problem we've faced with RocksDB. On one hand we want to hold onto pgmeta data long enough that tombstones prevent pglog entries from entering the database. On the other hand, we don't want to hold onto any data longer than we have to because this increases the work that RocksDB has to do in the bstore_kv_sync thread to keep records in sorted order. Sadly the new tuning in #47221 didn't reduce leakage of pgmeta data into the database when using small memtables as much as I initially thought.

The good news is that back in 2021 we laid the groundwork in PR #38855 to separate pgmeta data into a separate column family, allowing us to potentially tune pgmeta and deferred column families differently than other CFs: Mainly onode data. In an attempt to retain the benefits of the new reef rocksdb tunings with the write-amplification behavior the old settings, we specifically tune the pgmeta and deferred column families to retain a large amount of data in memtables before flushing them, while doubling down on quickly flushing other column families (mainly onode) to help reduce the amount of work being done in the bstore_kv_sync thread.

The RocksDB behavior overall looks much closer to Quincy:

rate-limited tests mimicking Paul Cuzner's testing:

Compaction Statistics	v17.2.6 (librbd)	v17.2.6 (krbd)	v17.2.6 (kcephfs)	reef (librbd)	reef (krbd)	reef (kcephfs)	reef new (librbd)	reef new (krbd)	reef new (kcephfs)
Total OSD Log Duration (seconds)	3174.725	3660.688	3352.91	3187.756	3753.734	3360.273	3174.826	3676.494	3331.977
Number of Compaction Events	79	72	62	108	103	93	58	51	42
Avg Compaction Time (seconds)	1.2	1.5	1.5	1.9	2.1	2.1	1.6	1.9	2.1
Total Compaction Time (seconds)	91.6	104.6	92.7	207.6	218.7	194.5	95.0	99.2	88.4
Avg Output Size: (MB)	143.3	171.6	172.2	258.5	284.1	282.0	192.6	240.1	240.3
Total Output Size: (MB)	11316.9	12352.8	10678.4	27918.6	29263.2	26229.4	11168.5	12244.9	10092.4
Total Input Records	172517209	183830873	172921746	341281440	354601953	330362204	151630197	159038412	143201639
Total Output Records	106776939	114160771	107240383	266427868	278340843	258230363	83881563	86861138	76251202
Avg Output Throughput (MB/s)	115.2	133.2	128.3	125.1	134.5	135.5	100.0	128.4	118.1
Avg Input Records/second	1551454.4	1759918.6	1915039.9	1496781.8	1605701.4	1682837.4	1182177.6	1396317.2	1466980.0
Avg Output Records/second	802347.3	906671.3	990708.0	1151879.9	1255532.3	1311169.5	620226.0	719748.2	748543.6
Avg Output/Input Ratio	0.51	0.50	0.51	0.69	0.72	0.73	0.44	0.45	0.45

And in full-speed tests:

Compaction Statistics	Reef - New Tuning	Reef - Original Tuning	v17.2.6 - Stock
Total OSD Log Duration (seconds)	11478.59	11520.46	11490.44
Number of Compaction Events	200	412	254
Avg Compaction Time (seconds)	2.47	2.81	1.96
Total Compaction Time (seconds)	494.62	1158.81	499.10
Avg Output Size: (MB)	205.04	268.45	129.98
Total Output Size: (MB)	41007.52	110602.42	33015.77
Total Input Records	615400246	1381251268	646524292
Total Output Records	348137162	1103275376	413326509
Avg Output Throughput (MB/s)	85.78	94.76	79.52
Avg Input Records/second	1100049.03	1156592.64	1397685.03
Avg Output Records/second	556637.88	884706.70	710541.65
Avg Output/Input Ratio	0.44	0.72	0.51

However, the new tuning is showing the same 4K random write performance improvement as the existing reef tuning, along with better latency and significantly better tail latencies in the rate limited tests (though not as good as the original reef tuning). While performance in the rate-unlimited 4K random write tests showed the same improvement vs Quincy as the original tuning, tail latencies were only slightly better than the Quincy tuning. The original reef tuning is still superior in this regard.

Overall this PR is expected to reduce write-amplification in RocksDB to being closer to Quincy levels while retaining most of the benefits of the original Reef tunings.

Overview of the specific changes:

Remove ttl=21600. This was added in reef to help combat the tombstone iteration issues we've had, but tends to introduce write-amplification over time. With the introduction of the compact-on-iteration feature in [WIP] kv/RocksDBStore: Improved RocksDB Settings and Tombstone behavior #47221, this no longer appears to be needed based on community testing and feedback. We'll no longer introduce this as a default behavior in Reef.
Switch from max 128 8MB to max 64 16MB buffers. This had a very slight latency reduction in the rate-limited tests while still allowing for fine tuning of the write-amplification (such as increasing the number of memtables to fill before flushing). Jumping to 32 32MB buffers slightly hurt the unlimited rate tests so we stick with 16MB buffers.
Switch the default min_write_buffer_number_to_merge from 16 8MB buffers to 6 16MB buffers. This primarily affects the onode column families. This is set even lower than the Reef defaults to help offset the other change we are going to make for the pgmeta and deferred write column families. If we want to lower write-amplification (potentially at the expense of higher CPU in the bstore_kv_sync thread, higher tail latency, and lower 4K randwrite IOPS), we can increase this to 7 or 8 buffers.
For the pgmeta (P) and deferred (L) column families, set the min_write_buffer_number_to_merge to 32 16MB buffers. This allows more data to accumulate in the memtables and leak less of these kinds of writes into the database since they are both generally short lived.

Detailed performance data for Quincy, the original Reef tuning, and this change is available:

Tests to mimic Paul Cuzner's rate-limited mixed-workload tests:
https://docs.google.com/spreadsheets/d/1HFFn0dAqLazLe8AYIUum5r8zHDBFzxEpZRu8OveT9K4/edit?usp=sharing

Rate-Unlimited Tests:
https://docs.google.com/spreadsheets/d/1srz_vY0_cllgRWxQb1UyzdMw2F1iaIfD-AT1E9Paqpo/edit?usp=sharing

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

Signed-off-by: Mark Nelson <mark.nelson@clyso.com>

markhpc · 2023-05-30T01:21:16Z

Note: As mentioned above, we can decrease write-amp to be even closer to the quincy default (it was already slightly lower in the rate-limited tests but higher in the rate unlimited tests) by setting the default min_write_buffer_number_to_merge, but we will likely lose some of the tail latency improvement (and potentially some of the 4K random write performance gains) by doing so. An alternative might be to enable lz4 compression in RocksDB by default. This had a mild reduction on rbd write-amplification but a more significant impact on RGW metadata write-amp. David Orman's team is seeing very good results with lz4 compression on their cluster.

pcuzner · 2023-05-31T03:12:10Z

Here's the results from my tests.

4 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4

Ceph Version	IOPS Achieved	IOPS Goal	CPU	Physical Read	Physical Write	Total Physical	Client IOPS/Core	99%ile Read	99%ile Write	Sample Period
Pacific 16.2.13	160,055	100.03%	73.8	127,837	278,519	406,356	2,168	0.98	3.96	05/15 16:30-17:30
Quincy 17.2.6	160,025	100.02%	71	127,806	280,338	408,144	2,254	1.02	1.71	05/19 15:55-16:55
Reef (default)	160,012	100.01%	77	127,832	281,662	409,494	2,077	1.2	1.72	05/05 20:25-21:25
Reef (rocksdb=quincy)	159,997	100.00%	70.9	127,809	280,633	408,442	2,256	0.711	1.42	05/17 18:20-19:20
Reef (PR # 51821)	159,945	99.97%	73.2	127,836	282,099	409,935	2,184	0.562	1.06	05/30 22:46-23:45

8 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4

Ceph Version	IOPS Achieved	IOPS Goal	CPU	Physical Read	Physical Write	Total Physical	Client IOPS/Core	99%ile Read	99%ile Write	Sample Period
Pacific 16.2.13	160,237	100.15%	77	195,362	330,450	525,812	2,082	2.6	7.35	05/15 20:15-21:15
Quincy 17.2.6	160,219	100.14%	74.2	190,478	339,711	530,189	2,159	1.9	2.83	05/19 19:10-20:10
Reef (default)	160,154	100.10%	80.5	238,899	351,277	590,176	1,989	1.93	2.74	05/07 12:00-13:00
Reef (rocksdb=quincy)	160,146	100.09%	73.9	189,042	338,514	527,555	2,168	2.46	2.96	05/17 21:40 22:40
Reef (PR # 51821)	160,230	100.14%	76	224,444	349,880	574,324	2,108	0.818	1.33	05/31 10:15-11:15

16 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4

Ceph Version	IOPS Achieved	IOPS Goal	CPU	Physical Read	Physical Write	Total Physical	Client IOPS/Core	99%ile Read	99%ile Write	Sample Period
Pacific 16.2.13	117,685	73.55%	72.8	239,795	423,141	662,936	1,616	5.1	8.33	05/16 12:25-13:25
Quincy 17.2.6	119,449	74.66%	70	214,448	414,531	628,979	1,707	6.06	6.42	05/19 22:30-23:30
Reef (default)	120,815	75.51%	75	260,172	426,471	686,643	1,611	5.57	5.92	05/08 13:50-14:50
Reef (rocksdb=quincy)	120,329	75.21%	70.1	241,161	418,373	659,534	1,717	5.65	6	05/18 12:00-13:00
Reef (PR # 51821)	118,134	73.83%	71.3	251,384	431,695	683,079	1,658	5.59	5.86	05/31 13:25-14:25

In my environment, it looks like the changes benefit the long tail latencies for 4KiB and 8KiB IO sizes at the cost of a slight increase in CPU and disk IO - IMO, it's a definite improvement over the original changes. My only concern is what happens at 16KiB and above. The latency improvement at 16KiB is negligible, but the increase in physical disk IOPS is around 9% compared to Quincy and around 4% more than Reef running the original rocksdb defaults.

markhpc · 2023-05-31T14:25:12Z

So we could be a little less aggressive on onode flushes by increasing the default number of memtables to fill before flush a little higher (maybe 7 or 8 instead of 6). The trade off would likely be higher tail latency to reduce DB write amp (bringing us a little closer to quincy, but hopefully still with some performance improvement). We won't be able to squeeze much more out of pgmeta/deferred I'm afraid. In my tests I saw slightly lower write-amp vs quincy in the rate-limited tests but higher write-amp vs quincy in the rate-unlimited tests (rotating between 4MB, 128KB, and 4KB IOs), so that would support the theory that we're still making a write-amp vs latency trade-off here, we've just mitigated most of it by specifically keeping pgmeta entries around as long as possible which is closer to the quincy behavior. It's also possible that we've actually (very slightly) lowered DB write amp for small IOs but increased it for larger ones. On the other hand, large writes should result in less DB write traffic relative to block write traffic...

pcuzner · 2023-05-31T21:18:23Z

Perhaps the other way to look at the increase in IOPS is that disk devices are consumables - we expect them to wear and we expect them to fail. Looking at the where the the read/write bias is, it's mostly read: reads up 17% and writes increase 4%. Given the obvious latency benefit to 4K and 8K workloads, and the fact that this PR brings CPU consumption closer to Quincy I think for block, this is a step forward. @markhpc have you tested for impact to object workloads?

markhpc · 2023-05-31T22:48:51Z

@pcuzner yeah, I did some tests with slightly less aggressive onode flushing behavior and it did appear to lower write amp a little. Probably not enough to change things at this point though since there's also likely a slight latency hit (Tests were too noisy to tell for sure). Let's do some quick object sanity checks and if things look good, get this change in before we give reef out to the community.

markhpc · 2023-06-02T05:00:59Z

Very interesting results when testing 4K object PUTS/LIST/GETS/DELS in RGW:

Compaction Statistics	v17.2.6	reef-9d5a260e	reef-9d5a260e-newtuning
Total OSD Log Duration (seconds)	9857.44	9300.18	9166.42
Number of Compaction Events	2808	2160	1724
Avg Compaction Time (seconds)	1.88	2.76	2.72
Total Compaction Time (seconds)	5270.85	5969.43	4686.30
Avg Output Size: (MB)	290.80	340.60	370.06
Total Output Size: (MB)	816566.94	735698.55	637987.92
Total Input Records	3230220383	4165810416	2676878737
Total Output Records	2308556504	3139196730	1788896685
Avg Output Throughput (MB/s)	161.34	125.24	140.55
Avg Input Records/second	645043.65	734816.94	601287.37
Avg Output Records/second	456709.21	571266.58	427716.74
Avg Output/Input Ratio	0.83	0.82	0.81

The new tuning generally appears to be a win for RGW. The number of compactions is lower, the total compaction duration is lower, and the write amp appears to be lower. Performance wise it was pretty close with some variations here and there. Deletes were maybe a little faster and listing was maybe a little slower. I suspect on this setup that RGW is still the major bottleneck.

Personally I think we should go ahead and merge this to get more user feedback at this point.

ljflores · 2023-06-02T16:22:52Z

Rados suite review: https://tracker.ceph.com/projects/rados/wiki/MAIN#httpstrellocomc8FwhCHxc1774-wip-yuri-testing-2023-06-01-0746

markhpc · 2023-06-02T20:13:46Z

Thank you @ljflores!

common/options: Update RocksDB CF Tuning

ea92ee7

Signed-off-by: Mark Nelson <mark.nelson@clyso.com>

markhpc added performance bluestore needs-reef-backport labels May 30, 2023

markhpc requested review from aclamk and pcuzner May 30, 2023 01:15

github-actions bot added the common label May 30, 2023

markhpc requested review from benhanokh, cfsnyder and ifed01 May 30, 2023 01:21

yuriw added the wip-yuri-testing label Jun 1, 2023

ljflores approved these changes Jun 2, 2023

View reviewed changes

yuriw merged commit a728862 into ceph:main Jun 2, 2023

rzarzynski mentioned this pull request Jun 3, 2023

reef: common/options: Update RocksDB CF Tuning #51900

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common/options: Update RocksDB CF Tuning#51821

common/options: Update RocksDB CF Tuning#51821
yuriw merged 1 commit intoceph:mainfrom
markhpc:wip-bs-rocksdb-cf-tuning

markhpc commented May 30, 2023 •

edited

Loading

Uh oh!

markhpc commented May 30, 2023

Uh oh!

pcuzner commented May 31, 2023

Uh oh!

markhpc commented May 31, 2023 •

edited

Loading

Uh oh!

pcuzner commented May 31, 2023

Uh oh!

markhpc commented May 31, 2023

Uh oh!

markhpc commented Jun 2, 2023 •

edited

Loading

Uh oh!

ljflores commented Jun 2, 2023

Uh oh!

markhpc commented Jun 2, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

markhpc commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Contribution Guidelines

Checklist

Uh oh!

markhpc commented May 30, 2023

Uh oh!

pcuzner commented May 31, 2023

4 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4

8 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4

16 KiB block size, 160K IOPS, 80:20 random R/W, Qdepth=4

Uh oh!

markhpc commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcuzner commented May 31, 2023

Uh oh!

markhpc commented May 31, 2023

Uh oh!

markhpc commented Jun 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ljflores commented Jun 2, 2023

Uh oh!

markhpc commented Jun 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

markhpc commented May 30, 2023 •

edited

Loading

markhpc commented May 31, 2023 •

edited

Loading

markhpc commented Jun 2, 2023 •

edited

Loading

markhpc commented Jun 2, 2023 •

edited

Loading