common/options: Set LZ4 compression for bluestore RocksDB. #53343

markhpc · 2023-09-08T17:17:14Z

In the fall of 2022, we tested LZ4 RocksDB compression in bluestore on NVMe backed OSDs here:

https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/

Since then, we've gotten feedback from users in the field testing compression with extremely positive results. Clyso has also worked with a customer that has a large RGW deployment that has seen extremely positive results.

Advantages of using RocksDB Compression

Significantly lower write amplification and space amplifcation.

In the article above, we saw a 4X reduction in space usage in RocksDB when writing very small (4KB) objects to RGW. On a real production cluster with 1.3 billion objects, Clyso observed a space usage reduction closer to 2.2X which was still a substantial improvement. This win is important in multiple cluster configurations:

Pure HDD - Pure HDD clusters are often seek limited under load. Compression means RocksDB can write more data out with less work, which can dramatically improve compaction times (especially as concurrent client workloads accumulate more data in L0/L1 that also needs to be compacted).
Hybrid Clusters (HDD Block + Flash DB/WAL) - In this configuration, spillover to the HDD can become a concern when there isn't enough space on the flash devices to hold all RocksDB SST files for all of the assoicated OSDs on flash. Compression has dramatic effect on being able to store all SST files in flash and avoid spillover.
Pure Flash based clusters - A primary concern for pure flash based clusters is write-amplificaiton and eventual wear out of the flash under write-intensive scenarios. RocksDB compression not only reduces space-amplification but also write-amplification. That means lower wear on the flash cells and longer flash life.

Reduced Compaction Times

The customer cluster that Clyso worked with utilized an HDD-only configuration. Prior to utilizing RocksDB Compaction, this cluster could take up to several days to complete a manual compaction of a given OSD during live operation. Enabling LZ4 compression in RocksDB reduced manual compaction time to closer to 25-30 minutes, with ~2 hours being the longest manual compaction time observed.

Potential Disadvantages of RocksDB Compression

Increased CPU usage

While there is CPU usage overhead associated with utilizing compression, the effect appeared to be negligable, even on an NVMe backed cluster. Despite restricting NVMe OSDs to 2 cores so that they were extremely CPU bound during PUT operations, enabling compression had no notable effect on PUT performance.

Lower GET throughput on NVMe

We noticed a very slight performance hit on NVMe backed clusters during GET operations, though the effect was primarily observed when using Snappy compression and not LZ4 compression. LZ4 GET performance was very close to performance with RocksDB uncompressed.

Other performance impact

Potential other concerns might include lower performance during iteration or other actions, however I expect this to be unlikely. RocksDB typically performs best when it can read data from SST files in large chunks and then work from the block cache. Large readahead values tend to be a win, either to read data into the block cache or so that data can be read quickly from the kernel page cache. As far as I can tell, compression is not having a negative impact here and in fact may be helping in cases where the disk is already quite busy. In general, we are already completely dependent on our own in-memory caches for things like bluestore onodes to achieve high performance on NVMe backed OSDs.

More importantly, the goal on 16.2.13+ should be to reduce the overhead of iterating over tombstones, and our primary method to do this right now is to issue compaction on iteration when too many tombstones are encountered. Reducing the impact of compaction directly benefits this goal.

Why LZ4 Compression?

Snappy and LZ4 compression are both potential default options. Ceph previously had a bug related to LZ4 compression that could corrupt data, so on the surface it might be tempting to default to using snappy compression. There are several reasons why I believe we should use LZ4 compression by default however.

The LZ4 bug is fixed, and there have been no reports of issues since the fix was put in place several years ago.
The Google developers have made changes to Snappy's build system that impacts Ceph. Many distributions are working around these changes, but the Google developers have explicitly stated that they plan to only support google specific use cases:

"We are unlikely to accept contributions to the build configuration files, such as CMakeLists.txt. We are focused on maintaining a build configuration that allows us to test that the project works in a few supported configurations inside Google. We are not currently interested in supporting other requirements, such as different operating systems, compilers, or build systems."

https://github.com/google/snappy/blob/main/README.md#contributing-to-the-snappy-project

LZ4 compression showed less of a performance impact during RGW 4KB object GETs versus Snappy. Snappy showed no performance gains vs LZ4 in any of the other tests nor did it appear to show a meaningful compression advantage.

Impact on Existing Clusters

Enabling/Disabling compression in RocksDB will require an OSD restart, but otherwise does not require user action. SST files will gradually be compressed over time as part of the compaction process. A manual compaction can be issued to help accelerate this process. The same goes if users would like to disable compression. New uncompressed SST files will be written over time as part of the compaction process, and a manual compaction can be issued to accelerate this process.

Conclusion

In general, enabling RocksDB compression in bluestore appears to be a dramatic win. I would like to make this our default behavior for Squid going forward assuming no issues are uncovered during teuthology testing.

Contribution Guidelines

To sign and title your commits, please refer to Submitting Patches to Ceph.
If you are submitting a fix for a stable branch (e.g. "pacific"), please refer to Submitting Patches to Ceph - Backports for the proper workflow.

Checklist

Tracker (select at least one)
- References tracker ticket
- Very recent bug; references commit where it was introduced
- New feature (ticket optional)
- Doc update (no ticket needed)
- Code cleanup (no ticket needed)
Component impact
- Affects Dashboard, opened tracker ticket
- Affects Orchestrator, opened tracker ticket
- No impact that needs to be tracked
Documentation (select at least one)
- Updates relevant documentation
- No doc update is appropriate
Tests (select at least one)
- Includes unit test(s)
- Includes integration test(s)
- Includes bug reproducer
- No tests

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox
jenkins test windows

In the fall of 2022, we tested LZ4 RocksDB compression in bluestore on NVMe backed OSDs here: https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/ Since then, we've gotten feedback from users in the field testing compression with extremely positive results. Clyso has also worked with a customer that has a large RGW deployment that has seen extremely positive results. Advantages of using compression =============================== 1) Significantly lower write amplification and space amplifcation. In the article above, we saw a 4X reduction in space usage in RocksDB when writing very small (4KB) objects to RGW. On a real production cluster with 1.3 billion objects, Clyso observed a space usage reduction closer to 2.2X which was still a substantial improvement. This win is important in multiple cluster configurations: 1A) Pure HDD Pure HDD clusters are often seek limited under load. This directly impacts how quickly RocksDB can write data out, which can increase compaction times. 1B) Hybrid Clusters (HDD Block + Flash DB/WAL) In this configuration, spillover to the HDD can become a concern when there isn't enough space on the flash devices to hold all RocksDB SST files for all of the assoicated OSDs on flash. Compression has dramatic effect on being able to store all SST files in flash and avoid spillover. 1C) Pure Flash based clusters A primary concern for pure flash based clusters is write-amplificaiton and eventual wear out of the flash under write-intensive scenarios. RocksDB compression not only reduces space-amplification but also write-amplification. That means lower wear on the flash cells and longer flash life. 2) Reduced Compaction Times The customer cluster that Clyso worked with utilized an HDD-only configuration. Prior to utilizing RocksDB Compaction, this cluster could take up to several days to complete a manual compaction of a given OSD during live operation. Enabling LZ4 compression in RocksDB reduced manual compaction time to closer to 25-30 minutes, with ~2 hours being the longest manual compaction time observed. Potential Disadvantages of RocksDB comppression =============================================== 1) Increased CPU usage While there is CPU usage overhead associated with utilizing compression, the effect appeared to be negligable, even on an NVMe backed cluster. Despite restricting NVMe OSDs to 2 cores so that they were extremely CPU bound during PUT operations, enabling compression had no notable effect on PUT performance. 2) Lower GET throughput on NVMe We noticed a very slight performance hit for GETs on NVMe backed clusters during GET operations, though the effect was primarily observed when using Snappy compression and not LZ4 compression. LZ4 GET performance was very close to performance with RocksDB uncompressed. 3) Other performance impact Potential other concerns might include lower performance during iteration or other actions, however I expect this to be unlikely. RocksDB typically performs best when it can read data from SST files in large chunks and then work from the block cache. Large readahead values tend to be a win, either to read data into the block cache or so that data can be read quickly from the kernel page cache. As far as I can tell, compression is not having a negative impact here and in fact may be helping in cases where the disk is already quite busy. In general, we are already completely dependent on our own in-memory caches for things like bluestore onodes to achieve high performance on NVMe backed OSDs. More importantly, the goal on 16.2.13+ should be to reduce the overehad of iterating over tombstones, and our primary method to do this right now is to issue compactions on iteration when too many tombstones are encountered. Reducing the impact of compaction directly benefits this goal. Why LZ4 Compression? Snappy and LZ4 compression are both potential default options. Ceph previously had a bug related to LZ4 compression that could corrupt data, so on the surface it might be tempting to default to using snappy compression. There are several reasons why I believe we should use LZ4 compression by default however. 1) The LZ4 bug is fixed, and there have been no reports of issues since the fix was put in place. 2) The Google developers have made changes to Snappy's build system that impacts Ceph. Many distributions are working around these changes, but the Google developers have explicitly stated that they plan to only support google specific use cases: "We are unlikely to accept contributions to the build configuration files, such as CMakeLists.txt. We are focused on maintaining a build configuration that allows us to test that the project works in a few supported configurations inside Google. We are not currently interested in supporting other requirements, such as different operating systems, compilers, or build systems." https://github.com/google/snappy/blob/main/README.md#contributing-to-the-snappy-project 3) LZ4 compression showed less of a performance impact during RGW 4KB object gets versus Snappy. Snappy showed no performance gains vs LZ4 in any of the other tests nor did it appear to show a meaningful compression advantage. Impact on existing clusters =========================== Enabling/Disabling compression in RocksDB will require an OSD restart, but otherwise does not require user action. SST files will gradually be compressed over time as part of the compaction process. A manual compaction can be issued to help accelerate this process. The same goes if users would like to disable compression. New uncompressed SST files will be written over time as part of the compaction process, and a manual compaction can be issued to accelerate this process. Conclusion ========== In general, enabling RocksDB compression in bluestore appears to be a dramatic win. I would like to make this our default behavior for Squid going forward assuming no issues are uncovered during teuthology testing. Signed-off-by: Mark Nelson <mark.nelson@clyso.com>

ormandj · 2023-09-08T17:38:02Z

We've been using LZ4 in production for quite some time now on very large RGW clusters (10+ PiB raw) in combination with compaction on deletion, and have seen significant reduction in db space requirements, allowing more data to be stored on the nvme-based storage (we deploy with db/wal on nvme, but data on rotational drives). This significantly improved performance on all clusters this has been applied to.

ifed01

Let it be!

markhpc added core performance labels Sep 8, 2023

markhpc requested review from cfsnyder and a team September 8, 2023 17:17

github-actions bot added the common label Sep 8, 2023

ifed01 approved these changes Sep 8, 2023

View reviewed changes

markhpc merged commit 923a4d7 into ceph:main Sep 11, 2023
10 of 17 checks passed

rzarzynski mentioned this pull request Jan 16, 2024

reef: common/options: Set LZ4 compression for bluestore RocksDB. #55197

Merged

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common/options: Set LZ4 compression for bluestore RocksDB. #53343

common/options: Set LZ4 compression for bluestore RocksDB. #53343

markhpc commented Sep 8, 2023 •

edited

ormandj commented Sep 8, 2023

ifed01 left a comment

common/options: Set LZ4 compression for bluestore RocksDB. #53343

common/options: Set LZ4 compression for bluestore RocksDB. #53343

Conversation

markhpc commented Sep 8, 2023 • edited

Advantages of using RocksDB Compression

Potential Disadvantages of RocksDB Compression

Impact on Existing Clusters

Conclusion

Contribution Guidelines

Checklist

ormandj commented Sep 8, 2023

ifed01 left a comment

Choose a reason for hiding this comment

markhpc commented Sep 8, 2023 •

edited