Improve distinct compression for index and data blocks#14140
Improve distinct compression for index and data blocks#14140pdillinger wants to merge 4 commits intofacebook:mainfrom
Conversation
Summary: This change enables a custom CompressionManager / Compressor to adopt custom handling for data and index blocks. In particular, index blocks for format_version >= 4 use a distinct variant of the block format. Thus, a potentially format-aware compression algorithm such as OpenZL should be told which kind of block we are compressing. (And previously I avoided passing block type in CompressBlock for efficient handling of things like dictionaries but also avoiding checks on every CompressBlock call.) Most of the change is in BlockBasedTableBuilder to call MaybeCloneSpecialized for both kDataBlock and for kIndexBlock. But I also needed some small tweaks/additions to the public API also: * Require a Clone() function from Compressors, to support proper implementations of MaybeCloneSpecialized() in wrapper Compressors. * Assert that the default implementation of CompressorWrapper::MaybeCloneSpecialized() is only used in allowable cases. * Convenience function Compressor::CloneMaybeSpecialized() Test Plan: Greatly expanded DBCompressionTest.CompressionManagerWrapper to make sure the distinction between data blocks and index blocks is properly communicated to a custom CompressionManager/Compressor. The test includes processing the expected structure of data and index blocks, to serve as a tested example for structure-aware compressors.
|
@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D87600019. |
| verify_decomp = r->data_block_verify_decompressor.get(); | ||
| } else { | ||
| compressor = r->basic_compressor.get(); | ||
| compressor = r->index_block_compressor; |
There was a problem hiding this comment.
I'm confused here - when is_data_block=false, is it necessarily index block? If it's not, can you explain more why it's okay to use index_block_compressor? Similarly for "is_data_block ? r->data_block_working_area : r->index_block_working_area,".
There was a problem hiding this comment.
Currently only index blocks and data blocks are compressed
| if (data_block_compressor != basic_compressor.get()) { | ||
| delete data_block_compressor; | ||
| } | ||
| if (index_block_compressor != basic_compressor.get()) { | ||
| delete index_block_compressor; | ||
| } |
There was a problem hiding this comment.
Is it possible that index_block_compressor == data_block_compressor in custom implementation? Do we have enough API comment to disallow that?
There was a problem hiding this comment.
Compressor is managed by unique_ptr. So these raw pointers came from distinct unique_ptrs (if they are not equal to basic_compressor). That's unlikely to change any time soon
|
@pdillinger merged this pull request in 35148ac. |
Summary: ... from facebook#14140. The assertion in the default implementation of CompressorWrapper::MaybeCloneSpecialized() could fail because this wrapper wasn't overriding it when it should. (See the NOTE on that implementation.) Because this release already has a breaking modification to the Compressor API (adding Clone()), I took this opportunity to add 'const' to MaybeCloneSpecialized(). Also marked some compression classes as 'final' that could be marked as such. Test Plan: unit test expanded to cover this case (verified failing before). Audited the rest of our CompressorWrappers.
Summary: ... from #14140. The assertion in the default implementation of CompressorWrapper::MaybeCloneSpecialized() could fail because this wrapper wasn't overriding it when it should. (See the NOTE on that implementation.) Because this release already has a breaking modification to the Compressor API (adding Clone()), I took this opportunity to add 'const' to MaybeCloneSpecialized(). Also marked some compression classes as 'final' that could be marked as such. Pull Request resolved: #14150 Test Plan: unit test expanded to cover this case (verified failing before). Audited the rest of our CompressorWrappers. Reviewed By: archang19 Differential Revision: D87793987 Pulled By: pdillinger fbshipit-source-id: 61c4469b84e4a47451a9942df09277faeeccfe63
Summary: ... from #14140. The assertion in the default implementation of CompressorWrapper::MaybeCloneSpecialized() could fail because this wrapper wasn't overriding it when it should. (See the NOTE on that implementation.) Because this release already has a breaking modification to the Compressor API (adding Clone()), I took this opportunity to add 'const' to MaybeCloneSpecialized(). Also marked some compression classes as 'final' that could be marked as such. Pull Request resolved: #14150 Test Plan: unit test expanded to cover this case (verified failing before). Audited the rest of our CompressorWrappers. Reviewed By: archang19 Differential Revision: D87793987 Pulled By: pdillinger fbshipit-source-id: 61c4469b84e4a47451a9942df09277faeeccfe63
Summary: This change enables a custom CompressionManager / Compressor to adopt custom handling for data and index blocks. In particular, index blocks for format_version >= 4 use a distinct variant of the block format. Thus, a potentially format-aware compression algorithm such as OpenZL should be told which kind of block we are compressing. (And previously I avoided passing block type in CompressBlock for efficient handling of things like dictionaries but also avoiding checks on every CompressBlock call.)
Most of the change is in BlockBasedTableBuilder to call MaybeCloneSpecialized for both kDataBlock and for kIndexBlock. But I also needed some small tweaks/additions to the public API also:
This also fixes a serious bug/oversight in ManagedPtr for (ManagedWorkingArea) that somehow wasn't showing up before. It probably doesn't need a release note because CompressionManager stuff is still considered experimental.
Test Plan: Greatly expanded DBCompressionTest.CompressionManagerWrapper to make sure the distinction between data blocks and index blocks is properly communicated to a custom CompressionManager/Compressor. The test includes processing the expected structure of data and index blocks, to serve as a tested example for structure-aware compressors.