Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] kv/RocksDBStore: Improved RocksDB Settings and Tombstone behavior #47221

Merged
merged 4 commits into from Nov 28, 2022

Conversation

markhpc
Copy link
Member

@markhpc markhpc commented Jul 22, 2022

Multi-part PR for implementing RocksDB tuning improvements for higher performance and better tombstone cleanup during iteration. So far mostly just rocksdb tuning settings from the new blog article, last-ditch attempts to compact when data gets too old, and compaction during iteration when we have an excessively high number of tombstone entries. Does not yet solve the problem of slow iteration over tombstones in memtables. That will likely require issuing a per-coumn family memtable flush. Thus, next step is per-column family memtable flushing and DB compaction on rmkey(s).

See commit messages from this PR and addition information at:

https://tracker.ceph.com/issues/53926
ceph/ceph.io#413

Contribution Guidelines

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Component impact
    • Affects Dashboard, opened tracker ticket
    • Affects Orchestrator, opened tracker ticket
    • No impact that needs to be tracked
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)
Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox
  • jenkins test windows

yaarith and others added 4 commits July 22, 2022 01:06
When telemetry requires re-opting-in (either whenever new collections
which require nagging are available, or on major upgrades) a health
warning is set by the module. This health warning should be reset once
the user re-opts-in (with `ceph telemetry on`), but currently it might
take longer. Fixing it here by waking up serve() immediately after
re-opting-in, which will invoke refreshing health checks.

Fixes: https://tracker.ceph.com/issues/56486
Signed-off-by: Yaarit Hatuka <yaarit@redhat.com>
Change RocksDB options based on research findings documented here:

ceph/ceph.io#413

Signed-off-by: Mark Nelson <mnelson@redhat.com>
In many different contexts we are seeing issues with RocksDB tombstones causing extremely slow itereation performance.  In the past we've tried to solve this using RangeDelete with unfortunate consequences.  There are a couple of things we can do to mitigate some of the impact of this however.

One option is to set a compaction TTL.  This is documented in the RocksDB wiki here:

https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#periodic-and-ttl-compaction

The idea here is that no data is allowed to sit in RocksDB for a given length of time without being compacted.  Several users including Alexandre Marangone and Josh Baergen from DigitalOcean have documented significantly better performance under deletion workloads while utilizing this option:

https://tracker.ceph.com/issues/53926

We are setting this to a fairly conservative 6 hours by default (The same value Josh Baergen reported using at Digital Ocean).  This should limit the write-amplification impact that could potentially occur with a much more aggressive compaction TTL.

Caveats to this approach:

1) It only works with tombstones accumulating in SST files.
2) It will only help with a gradual accumulation of tombstones over long periods of time.
3) It does nothing to help with accumulation of tombstones in memtables.

Additional mitigation methods (especially compaction triggered by delete) will be necessary, though this still can serve as a useful "last line of defense" if tombstones are accumulating in SST files.

Signed-off-by: Mark Nelson <mnelson@redhat.com>
This commit adds support to compact column families when a certain number
of tombstone entries have been observed within a certain sliding window
during iteration.  It only helps when itereating over entries already in
SST files and not when iterating over ranges in memtables.

Likely we will still need to provide a mechanism to flush memtables and
compact column families once a certain number of rmkey or rm_range_key
calls are made.

Signed-off-by: Mark Nelson <mnelson@redhat.com>
@markhpc
Copy link
Member Author

markhpc commented Jul 22, 2022

Some generic 1 OSD sanity benchmarks:

RBD 4KB IOPS (higher is better)

test main wip-faster-rocksdb
randread 114597 113057
randwrite 75530 94072
read 28192 29888
write 298033 322859

RBD 4KB Cycles/OP (lower is better)

test main wip-faster-rocksdb
randread 221096 224686
randwrite 585403 560758
read 189125 186563
write 30013 26119

RBD 4KB Lat (lower is better)

test main (avg ms) main (99% ms) wip-faster-rocksdb (avg ms) wip-faster-rocksdb (99% ms)
randread 4.701 43.385 4.812 41.353
randwrite 8.065 67.158 6.060 34.275
read 19.359 163.414 17.319 90.735
write 1.714 3.113 1.583 2.695

RGW:

test main wip-faster-rocksdb
puts/second 5536 5515
list time (seconds) 593 718
gets/second 12743 12760
deletes/second 7187 7832

Overall write and delete gains as expected (though we may need more RGW instances to see the write benefit in the hsbench test). bucket list performance is down, which may be consistent with extra overhead of reads in a bigger L0 with more data and overlapping key ranges (some potential evidence of this also with limited CPU and compression enabled on in the blog article). It's possible that on a real cluster with multiple OSDs we don't see the effect of the list performance impact due to bucket index sharding across multiple OSDs.

@baergj
Copy link

baergj commented Jul 28, 2022

@markhpc A note on that TTL setting - I've been hesitant to apply that to all OSD types simply because I don't know what the wear characteristics due to write amp will be like for object store flash drives. If someone is running QLC in the field, for example, how many write cycles will it eat up?

Happy to see it there, though, and maybe it'll just be fine and there's nothing to worry about.

@markhpc
Copy link
Member Author

markhpc commented Jul 28, 2022

@baergj yeah, my hope is that we can perhaps back this off even farther if the other methods to compact during iteration and compact/flush on hitting a certain number of deletes between compactions. Maybe we only fall back on that every 24 hours or something.

Generally speaking we are going to increase write amp here though. One possible way to claw some of that back could be compression. I don't think it's helping with write-amp much for RBD, but RGW it appeared to have a pretty massive effect. On the other hand, that will likely use more CPU and especially in CPU limited scenarios might hurt bucket list performance, so I guess we just have to find that right combination of default trade-offs.

@baergj
Copy link

baergj commented Jul 28, 2022

I wonder if this is starting to approach the limit of what makes sense for index OSDs vs. data OSDs, given the vastly different workloads between the two of them?

FWIW, I had also wondered about backing TTL off to 24h+ for data OSDs, but we haven't had much reason to apply TTL to our data OSDs. IIRC rocksdb had a 30day default TTL; not sure if that's still there at HEAD.

@neha-ojha neha-ojha requested a review from aclamk July 29, 2022 15:30
Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get it testing!

@markhpc markhpc merged commit 85e2757 into ceph:main Nov 28, 2022
@ljflores
Copy link
Contributor

@markhpc @yaarith somehow this PR has a telemetry commit in it?

@yaarith
Copy link
Contributor

yaarith commented Dec 12, 2022

@ljflores @markhpc interesting, looks like the original commit (b8bbf64) was merged in July 21, when Mark created this PR, so this might have been a rebase/fork issue. I'll check to see that there are no issues with the second commit in the telemetry code.

@yaarith
Copy link
Contributor

yaarith commented Dec 14, 2022

The telemetry code looks ok to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants