New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] kv/RocksDBStore: Improved RocksDB Settings and Tombstone behavior #47221
Conversation
When telemetry requires re-opting-in (either whenever new collections which require nagging are available, or on major upgrades) a health warning is set by the module. This health warning should be reset once the user re-opts-in (with `ceph telemetry on`), but currently it might take longer. Fixing it here by waking up serve() immediately after re-opting-in, which will invoke refreshing health checks. Fixes: https://tracker.ceph.com/issues/56486 Signed-off-by: Yaarit Hatuka <yaarit@redhat.com>
Change RocksDB options based on research findings documented here: ceph/ceph.io#413 Signed-off-by: Mark Nelson <mnelson@redhat.com>
In many different contexts we are seeing issues with RocksDB tombstones causing extremely slow itereation performance. In the past we've tried to solve this using RangeDelete with unfortunate consequences. There are a couple of things we can do to mitigate some of the impact of this however. One option is to set a compaction TTL. This is documented in the RocksDB wiki here: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide#periodic-and-ttl-compaction The idea here is that no data is allowed to sit in RocksDB for a given length of time without being compacted. Several users including Alexandre Marangone and Josh Baergen from DigitalOcean have documented significantly better performance under deletion workloads while utilizing this option: https://tracker.ceph.com/issues/53926 We are setting this to a fairly conservative 6 hours by default (The same value Josh Baergen reported using at Digital Ocean). This should limit the write-amplification impact that could potentially occur with a much more aggressive compaction TTL. Caveats to this approach: 1) It only works with tombstones accumulating in SST files. 2) It will only help with a gradual accumulation of tombstones over long periods of time. 3) It does nothing to help with accumulation of tombstones in memtables. Additional mitigation methods (especially compaction triggered by delete) will be necessary, though this still can serve as a useful "last line of defense" if tombstones are accumulating in SST files. Signed-off-by: Mark Nelson <mnelson@redhat.com>
This commit adds support to compact column families when a certain number of tombstone entries have been observed within a certain sliding window during iteration. It only helps when itereating over entries already in SST files and not when iterating over ranges in memtables. Likely we will still need to provide a mechanism to flush memtables and compact column families once a certain number of rmkey or rm_range_key calls are made. Signed-off-by: Mark Nelson <mnelson@redhat.com>
Some generic 1 OSD sanity benchmarks: RBD 4KB IOPS (higher is better)
RBD 4KB Cycles/OP (lower is better)
RBD 4KB Lat (lower is better)
RGW:
Overall write and delete gains as expected (though we may need more RGW instances to see the write benefit in the hsbench test). bucket list performance is down, which may be consistent with extra overhead of reads in a bigger L0 with more data and overlapping key ranges (some potential evidence of this also with limited CPU and compression enabled on in the blog article). It's possible that on a real cluster with multiple OSDs we don't see the effect of the list performance impact due to bucket index sharding across multiple OSDs. |
569e277
to
fba5488
Compare
@markhpc A note on that TTL setting - I've been hesitant to apply that to all OSD types simply because I don't know what the wear characteristics due to write amp will be like for object store flash drives. If someone is running QLC in the field, for example, how many write cycles will it eat up? Happy to see it there, though, and maybe it'll just be fine and there's nothing to worry about. |
@baergj yeah, my hope is that we can perhaps back this off even farther if the other methods to compact during iteration and compact/flush on hitting a certain number of deletes between compactions. Maybe we only fall back on that every 24 hours or something. Generally speaking we are going to increase write amp here though. One possible way to claw some of that back could be compression. I don't think it's helping with write-amp much for RBD, but RGW it appeared to have a pretty massive effect. On the other hand, that will likely use more CPU and especially in CPU limited scenarios might hurt bucket list performance, so I guess we just have to find that right combination of default trade-offs. |
I wonder if this is starting to approach the limit of what makes sense for index OSDs vs. data OSDs, given the vastly different workloads between the two of them? FWIW, I had also wondered about backing TTL off to 24h+ for data OSDs, but we haven't had much reason to apply TTL to our data OSDs. IIRC rocksdb had a 30day default TTL; not sure if that's still there at HEAD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get it testing!
The telemetry code looks ok to me. |
Multi-part PR for implementing RocksDB tuning improvements for higher performance and better tombstone cleanup during iteration. So far mostly just rocksdb tuning settings from the new blog article, last-ditch attempts to compact when data gets too old, and compaction during iteration when we have an excessively high number of tombstone entries. Does not yet solve the problem of slow iteration over tombstones in memtables. That will likely require issuing a per-coumn family memtable flush. Thus, next step is per-column family memtable flushing and DB compaction on rmkey(s).
See commit messages from this PR and addition information at:
https://tracker.ceph.com/issues/53926
ceph/ceph.io#413
Contribution Guidelines
To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows