feat(compaction): support multiple PersistProcessor in PK compaction#170
Merged
Conversation
lxy-9602
commented
Mar 6, 2026
There was a problem hiding this comment.
Pull request overview
This PR introduces pluggable PersistProcessor<T> support for PK compaction, allowing different merge engines (DV, first_row, etc.) to define their own serialization strategy for lookup files. It also adds the underlying RowCompactedSerializer, BinarySerializerUtils, VarLengthIntUtils, and related binary data structures needed to implement these processors.
Changes:
- Introduces
PersistProcessor<T>base interface and four concrete implementations:PersistEmptyProcessor,PersistPositionProcessor,PersistValueProcessor, andPersistValueAndPosProcessor. - Adds
RowCompactedSerializer(with Java binary compatibility),BinarySerializerUtils,VarLengthIntUtils, andBinaryMapsupport. - Fixes
Bytes::CopyOfto no longer requirelen >= other.size(), allowing truncation.
Reviewed changes
Copilot reviewed 33 out of 33 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
persist_processor.h |
Defines the templated PersistProcessor<T> base class and its inner Factory |
persist_empty_processor.h |
bool-returning processor; serializes nothing |
persist_position_processor.h |
FilePosition-returning processor; encodes row position as varint |
persist_value_processor.h |
KeyValue-returning processor; serializes value + sequence + row kind |
persist_value_and_pos_processor.h |
PositionedKeyValue-returning processor; adds row position to the value layout |
positioned_key_value.h |
New PositionedKeyValue struct holding a KeyValue, filename, and row position |
file_position.h |
New FilePosition struct holding filename and row position |
lookup_serializer_factory.h |
Abstraction for creating serialize/deserialize functions |
default_lookup_serializer_factory.h |
Concrete factory using RowCompactedSerializer |
row_compacted_serializer.h/.cpp |
New row serializer with Java-compatible binary format |
binary_serializer_utils.h/.cpp |
Utilities for writing BinaryRow/BinaryArray/BinaryMap from internal types |
var_length_int_utils.h |
Variable-length integer encoding/decoding (LongPacker-style) |
binary_map.h |
New BinaryMap class implementing InternalMap on contiguous memory |
binary_data_read_utils.h |
Implements ReadMapData using the new BinaryMap |
abstract_binary_writer.h/.cpp |
Adds WriteMap to the abstract writer |
binary_array_writer.h/.cpp |
Adds GetElementSize and type-aware SetNullAt |
binary_array.h |
Initializes size_ and element_offset_ to avoid UB |
bytes.h/.cpp |
Adds EmptyBytes() static and relaxes CopyOf length constraint |
internal_row.h |
Removes unused parameter name from FieldGetterFunc typedef |
persist_processor_test.cpp |
Tests all four processor implementations |
var_length_int_utils_test.cpp |
Tests varint encode/decode for int and long |
row_compacted_serializer_test.cpp |
Tests serializer correctness and Java compatibility |
binary_serializer_utils_test.cpp |
Tests BinarySerializerUtils for nested and flat types |
binary_map_test.cpp |
Basic test for BinaryMap::ValueOf |
CMakeLists.txt |
Registers new source files and test targets |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
lxy-9602
commented
Mar 6, 2026
lxy-9602
commented
Mar 6, 2026
lxy-9602
commented
Mar 6, 2026
lxy-9602
commented
Mar 6, 2026
lucasfang
reviewed
Mar 9, 2026
lucasfang
reviewed
Mar 9, 2026
ChaomingZhangCN
pushed a commit
to ChaomingZhangCN/paimon-cpp
that referenced
this pull request
Apr 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: #93
This PR adds support for multiple
PersistProcessorimplementations in PK Compact for different merge engines.When performing PK (primary key) compaction with lookup requirements — such as in
Delete Vector(DV) orfirst_rowmerge scenarios — the compaction process needs to perform point lookups against historical files during merging. To optimize lookup performance, these historical data files are stored in an SST-like format, where each row is serialized into a byte stream usingRowCompactedSerializer.This change generalizes the serialization mechanism by introducing pluggable
PersistProcessor<T>support, allowing different merge engines to define their own way of persisting and reconstructing rows. EachPersistProcessorhandles:Bytesbuffer.Tests
RowCompactedSerializerTest
BinarySerializerUtilsTest
VarLengthIntUtilsTest
PersistProcessorTest
API and Format
Documentation