Skip to content

Improve distinct compression for index and data blocks#14140

Closed
pdillinger wants to merge 4 commits intofacebook:mainfrom
pdillinger:compress_index_vs_data_blocks
Closed

Improve distinct compression for index and data blocks#14140
pdillinger wants to merge 4 commits intofacebook:mainfrom
pdillinger:compress_index_vs_data_blocks

Conversation

@pdillinger
Copy link
Contributor

@pdillinger pdillinger commented Nov 21, 2025

Summary: This change enables a custom CompressionManager / Compressor to adopt custom handling for data and index blocks. In particular, index blocks for format_version >= 4 use a distinct variant of the block format. Thus, a potentially format-aware compression algorithm such as OpenZL should be told which kind of block we are compressing. (And previously I avoided passing block type in CompressBlock for efficient handling of things like dictionaries but also avoiding checks on every CompressBlock call.)

Most of the change is in BlockBasedTableBuilder to call MaybeCloneSpecialized for both kDataBlock and for kIndexBlock. But I also needed some small tweaks/additions to the public API also:

  • Require a Clone() function from Compressors, to support proper implementations of MaybeCloneSpecialized() in wrapper Compressors.
  • Assert that the default implementation of CompressorWrapper::MaybeCloneSpecialized() is only used in allowable cases.
  • Convenience function Compressor::CloneMaybeSpecialized()

This also fixes a serious bug/oversight in ManagedPtr for (ManagedWorkingArea) that somehow wasn't showing up before. It probably doesn't need a release note because CompressionManager stuff is still considered experimental.

Test Plan: Greatly expanded DBCompressionTest.CompressionManagerWrapper to make sure the distinction between data blocks and index blocks is properly communicated to a custom CompressionManager/Compressor. The test includes processing the expected structure of data and index blocks, to serve as a tested example for structure-aware compressors.

Summary: This change enables a custom CompressionManager / Compressor to
adopt custom handling for data and index blocks. In particular, index
blocks for format_version >= 4 use a distinct variant of the block
format. Thus, a potentially format-aware compression algorithm such as
OpenZL should be told which kind of block we are compressing. (And
previously I avoided passing block type in CompressBlock for efficient
handling of things like dictionaries but also avoiding checks on every
CompressBlock call.)

Most of the change is in BlockBasedTableBuilder to call
MaybeCloneSpecialized for both kDataBlock and for kIndexBlock. But I
also needed some small tweaks/additions to the public API also:
* Require a Clone() function from Compressors, to support proper
  implementations of MaybeCloneSpecialized() in wrapper Compressors.
* Assert that the default implementation of
  CompressorWrapper::MaybeCloneSpecialized() is only used in allowable
  cases.
* Convenience function Compressor::CloneMaybeSpecialized()

Test Plan: Greatly expanded DBCompressionTest.CompressionManagerWrapper
to make sure the distinction between data blocks and index blocks is
properly communicated to a custom CompressionManager/Compressor. The
test includes processing the expected structure of data and index
blocks, to serve as a tested example for structure-aware compressors.
@pdillinger pdillinger requested a review from hx235 November 21, 2025 01:04
@meta-cla meta-cla bot added the CLA Signed label Nov 21, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 21, 2025

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D87600019.

verify_decomp = r->data_block_verify_decompressor.get();
} else {
compressor = r->basic_compressor.get();
compressor = r->index_block_compressor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused here - when is_data_block=false, is it necessarily index block? If it's not, can you explain more why it's okay to use index_block_compressor? Similarly for "is_data_block ? r->data_block_working_area : r->index_block_working_area,".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently only index blocks and data blocks are compressed

Comment on lines +1320 to +1325
if (data_block_compressor != basic_compressor.get()) {
delete data_block_compressor;
}
if (index_block_compressor != basic_compressor.get()) {
delete index_block_compressor;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that index_block_compressor == data_block_compressor in custom implementation? Do we have enough API comment to disallow that?

Copy link
Contributor Author

@pdillinger pdillinger Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compressor is managed by unique_ptr. So these raw pointers came from distinct unique_ptrs (if they are not equal to basic_compressor). That's unlikely to change any time soon

@meta-codesync
Copy link

meta-codesync bot commented Nov 22, 2025

@pdillinger merged this pull request in 35148ac.

pdillinger added a commit to pdillinger/rocksdb that referenced this pull request Nov 24, 2025
Summary: ... from facebook#14140. The assertion in the default implementation of
CompressorWrapper::MaybeCloneSpecialized() could fail because this
wrapper wasn't overriding it when it should. (See the NOTE on that
implementation.)

Because this release already has a breaking modification to the Compressor API
(adding Clone()), I took this opportunity to add 'const' to
MaybeCloneSpecialized(). Also marked some compression classes as 'final'
that could be marked as such.

Test Plan: unit test expanded to cover this case (verified failing
before). Audited the rest of our CompressorWrappers.
meta-codesync bot pushed a commit that referenced this pull request Nov 24, 2025
Summary:
... from #14140. The assertion in the default implementation of CompressorWrapper::MaybeCloneSpecialized() could fail because this wrapper wasn't overriding it when it should. (See the NOTE on that implementation.)

Because this release already has a breaking modification to the Compressor API (adding Clone()), I took this opportunity to add 'const' to MaybeCloneSpecialized(). Also marked some compression classes as 'final' that could be marked as such.

Pull Request resolved: #14150

Test Plan: unit test expanded to cover this case (verified failing before). Audited the rest of our CompressorWrappers.

Reviewed By: archang19

Differential Revision: D87793987

Pulled By: pdillinger

fbshipit-source-id: 61c4469b84e4a47451a9942df09277faeeccfe63
pdillinger added a commit that referenced this pull request Nov 24, 2025
Summary:
... from #14140. The assertion in the default implementation of CompressorWrapper::MaybeCloneSpecialized() could fail because this wrapper wasn't overriding it when it should. (See the NOTE on that implementation.)

Because this release already has a breaking modification to the Compressor API (adding Clone()), I took this opportunity to add 'const' to MaybeCloneSpecialized(). Also marked some compression classes as 'final' that could be marked as such.

Pull Request resolved: #14150

Test Plan: unit test expanded to cover this case (verified failing before). Audited the rest of our CompressorWrappers.

Reviewed By: archang19

Differential Revision: D87793987

Pulled By: pdillinger

fbshipit-source-id: 61c4469b84e4a47451a9942df09277faeeccfe63
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants