[WIP] kv/RocksDBStore: Improved RocksDB Settings and Tombstone behavior #47221

markhpc · 2022-07-22T00:06:47Z

Multi-part PR for implementing RocksDB tuning improvements for higher performance and better tombstone cleanup during iteration. So far mostly just rocksdb tuning settings from the new blog article, last-ditch attempts to compact when data gets too old, and compaction during iteration when we have an excessively high number of tombstone entries. Does not yet solve the problem of slow iteration over tombstones in memtables. That will likely require issuing a per-coumn family memtable flush. Thus, next step is per-column family memtable flushing and DB compaction on rmkey(s).

See commit messages from this PR and addition information at:

https://tracker.ceph.com/issues/53926
ceph/ceph.io#413

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

When telemetry requires re-opting-in (either whenever new collections which require nagging are available, or on major upgrades) a health warning is set by the module. This health warning should be reset once the user re-opts-in (with `ceph telemetry on`), but currently it might take longer. Fixing it here by waking up serve() immediately after re-opting-in, which will invoke refreshing health checks. Fixes: https://tracker.ceph.com/issues/56486 Signed-off-by: Yaarit Hatuka <yaarit@redhat.com>

Change RocksDB options based on research findings documented here: ceph/ceph.io#413 Signed-off-by: Mark Nelson <mnelson@redhat.com>

In many different contexts we are seeing issues with RocksDB tombstones causing extremely slow itereation performance. In the past we've tried to solve this using RangeDelete with unfortunate consequences. There are a couple of things we can do to mitigate some of the impact of this however. One option is to set a compaction TTL. This is documented in the RocksDB wiki here: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#periodic-and-ttl-compaction The idea here is that no data is allowed to sit in RocksDB for a given length of time without being compacted. Several users including Alexandre Marangone and Josh Baergen from DigitalOcean have documented significantly better performance under deletion workloads while utilizing this option: https://tracker.ceph.com/issues/53926 We are setting this to a fairly conservative 6 hours by default (The same value Josh Baergen reported using at Digital Ocean). This should limit the write-amplification impact that could potentially occur with a much more aggressive compaction TTL. Caveats to this approach: 1) It only works with tombstones accumulating in SST files. 2) It will only help with a gradual accumulation of tombstones over long periods of time. 3) It does nothing to help with accumulation of tombstones in memtables. Additional mitigation methods (especially compaction triggered by delete) will be necessary, though this still can serve as a useful "last line of defense" if tombstones are accumulating in SST files. Signed-off-by: Mark Nelson <mnelson@redhat.com>

This commit adds support to compact column families when a certain number of tombstone entries have been observed within a certain sliding window during iteration. It only helps when itereating over entries already in SST files and not when iterating over ranges in memtables. Likely we will still need to provide a mechanism to flush memtables and compact column families once a certain number of rmkey or rm_range_key calls are made. Signed-off-by: Mark Nelson <mnelson@redhat.com>

markhpc · 2022-07-22T17:37:50Z

Some generic 1 OSD sanity benchmarks:

RBD 4KB IOPS (higher is better)

test	main	wip-faster-rocksdb
randread	114597	113057
randwrite	75530	94072
read	28192	29888
write	298033	322859

RBD 4KB Cycles/OP (lower is better)

test	main	wip-faster-rocksdb
randread	221096	224686
randwrite	585403	560758
read	189125	186563
write	30013	26119

RBD 4KB Lat (lower is better)

test	main (avg ms)	main (99% ms)	wip-faster-rocksdb (avg ms)	wip-faster-rocksdb (99% ms)
randread	4.701	43.385	4.812	41.353
randwrite	8.065	67.158	6.060	34.275
read	19.359	163.414	17.319	90.735
write	1.714	3.113	1.583	2.695

RGW:

test	main	wip-faster-rocksdb
puts/second	5536	5515
list time (seconds)	593	718
gets/second	12743	12760
deletes/second	7187	7832

Overall write and delete gains as expected (though we may need more RGW instances to see the write benefit in the hsbench test). bucket list performance is down, which may be consistent with extra overhead of reads in a bigger L0 with more data and overlapping key ranges (some potential evidence of this also with limited CPU and compression enabled on in the blog article). It's possible that on a real cluster with multiple OSDs we don't see the effect of the list performance impact due to bucket index sharding across multiple OSDs.

baergj · 2022-07-28T14:15:26Z

@markhpc A note on that TTL setting - I've been hesitant to apply that to all OSD types simply because I don't know what the wear characteristics due to write amp will be like for object store flash drives. If someone is running QLC in the field, for example, how many write cycles will it eat up?

Happy to see it there, though, and maybe it'll just be fine and there's nothing to worry about.

markhpc · 2022-07-28T14:53:52Z

@baergj yeah, my hope is that we can perhaps back this off even farther if the other methods to compact during iteration and compact/flush on hitting a certain number of deletes between compactions. Maybe we only fall back on that every 24 hours or something.

Generally speaking we are going to increase write amp here though. One possible way to claw some of that back could be compression. I don't think it's helping with write-amp much for RBD, but RGW it appeared to have a pretty massive effect. On the other hand, that will likely use more CPU and especially in CPU limited scenarios might hurt bucket list performance, so I guess we just have to find that right combination of default trade-offs.

baergj · 2022-07-28T19:57:10Z

I wonder if this is starting to approach the limit of what makes sense for index OSDs vs. data OSDs, given the vastly different workloads between the two of them?

FWIW, I had also wondered about backing TTL off to 24h+ for data OSDs, but we haven't had much reason to apply TTL to our data OSDs. IIRC rocksdb had a 30day default TTL; not sure if that's still there at HEAD.

aclamk

Let's get it testing!

ljflores · 2022-12-12T19:39:50Z

@markhpc @yaarith somehow this PR has a telemetry commit in it?

yaarith · 2022-12-12T19:56:28Z

@ljflores @markhpc interesting, looks like the original commit (b8bbf64) was merged in July 21, when Mark created this PR, so this might have been a rebase/fork issue. I'll check to see that there are no issues with the second commit in the telemetry code.

yaarith · 2022-12-14T20:51:27Z

The telemetry code looks ok to me.

markhpc added core performance bluestore labels Jul 22, 2022

markhpc requested a review from a team as a code owner July 22, 2022 00:06

github-actions bot added the common label Jul 22, 2022

yaarith and others added 4 commits July 22, 2022 01:06

common/options: Use new RocksDB Tunings for Higher Performance

5d8cf1e

Change RocksDB options based on research findings documented here: ceph/ceph.io#413 Signed-off-by: Mark Nelson <mnelson@redhat.com>

markhpc force-pushed the wip-faster-rocksdb branch from 569e277 to fba5488 Compare July 28, 2022 12:00

github-actions bot added the pybind label Jul 28, 2022

neha-ojha requested a review from aclamk July 29, 2022 15:30

aclamk approved these changes Nov 28, 2022

View reviewed changes

markhpc merged commit 85e2757 into ceph:main Nov 28, 2022

ormandj mentioned this pull request Dec 19, 2022

rocksdb: Upgrade (bump) rocksdb to latest facebook's v7.9.2 #49006

Merged

14 tasks

cfsnyder mentioned this pull request Jan 10, 2023

bluestore: do not _colletion_list when _remove_collection #49617

Closed

14 tasks

This was referenced Apr 5, 2023

quincy: kv/RocksDBStore: Add CompactOnDeletion support #50893

Merged

pacific: kv/RocksDBStore: Add CompactOnDeletion support #50894

Merged

markhpc mentioned this pull request May 30, 2023

common/options: Update RocksDB CF Tuning #51821

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] kv/RocksDBStore: Improved RocksDB Settings and Tombstone behavior #47221

[WIP] kv/RocksDBStore: Improved RocksDB Settings and Tombstone behavior #47221

markhpc commented Jul 22, 2022 •

edited

markhpc commented Jul 22, 2022

baergj commented Jul 28, 2022

markhpc commented Jul 28, 2022 •

edited

baergj commented Jul 28, 2022

aclamk left a comment

ljflores commented Dec 12, 2022

yaarith commented Dec 12, 2022

yaarith commented Dec 14, 2022

[WIP] kv/RocksDBStore: Improved RocksDB Settings and Tombstone behavior #47221

[WIP] kv/RocksDBStore: Improved RocksDB Settings and Tombstone behavior #47221

Conversation

markhpc commented Jul 22, 2022 • edited

Contribution Guidelines

Checklist

markhpc commented Jul 22, 2022

baergj commented Jul 28, 2022

markhpc commented Jul 28, 2022 • edited

baergj commented Jul 28, 2022

aclamk left a comment

Choose a reason for hiding this comment

ljflores commented Dec 12, 2022

yaarith commented Dec 12, 2022

yaarith commented Dec 14, 2022

markhpc commented Jul 22, 2022 •

edited

markhpc commented Jul 28, 2022 •

edited