Skip to content

feat(compaction): support multiple PersistProcessor in PK compaction#170

Merged
lxy-9602 merged 13 commits into
alibaba:mainfrom
lxy-9602:add-persist-processor
Mar 9, 2026
Merged

feat(compaction): support multiple PersistProcessor in PK compaction#170
lxy-9602 merged 13 commits into
alibaba:mainfrom
lxy-9602:add-persist-processor

Conversation

@lxy-9602
Copy link
Copy Markdown
Collaborator

@lxy-9602 lxy-9602 commented Mar 6, 2026

Purpose

Linked issue: #93
This PR adds support for multiple PersistProcessor implementations in PK Compact for different merge engines.

When performing PK (primary key) compaction with lookup requirements — such as in Delete Vector (DV) or first_row merge scenarios — the compaction process needs to perform point lookups against historical files during merging. To optimize lookup performance, these historical data files are stored in an SST-like format, where each row is serialized into a byte stream using RowCompactedSerializer.

This change generalizes the serialization mechanism by introducing pluggable PersistProcessor<T> support, allowing different merge engines to define their own way of persisting and reconstructing rows. Each PersistProcessor handles:

  • Serializing a key/value into a Bytes buffer.
  • Deserializing the original value from disk during lookup.
  • Ensuring binary compatibility with the Java.

Tests

RowCompactedSerializerTest
BinarySerializerUtilsTest
VarLengthIntUtilsTest
PersistProcessorTest

API and Format

Documentation

Comment thread src/paimon/common/data/serializer/binary_serializer_utils.cpp
@lucasfang lucasfang requested a review from Copilot March 6, 2026 09:23
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces pluggable PersistProcessor<T> support for PK compaction, allowing different merge engines (DV, first_row, etc.) to define their own serialization strategy for lookup files. It also adds the underlying RowCompactedSerializer, BinarySerializerUtils, VarLengthIntUtils, and related binary data structures needed to implement these processors.

Changes:

  • Introduces PersistProcessor<T> base interface and four concrete implementations: PersistEmptyProcessor, PersistPositionProcessor, PersistValueProcessor, and PersistValueAndPosProcessor.
  • Adds RowCompactedSerializer (with Java binary compatibility), BinarySerializerUtils, VarLengthIntUtils, and BinaryMap support.
  • Fixes Bytes::CopyOf to no longer require len >= other.size(), allowing truncation.

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
persist_processor.h Defines the templated PersistProcessor<T> base class and its inner Factory
persist_empty_processor.h bool-returning processor; serializes nothing
persist_position_processor.h FilePosition-returning processor; encodes row position as varint
persist_value_processor.h KeyValue-returning processor; serializes value + sequence + row kind
persist_value_and_pos_processor.h PositionedKeyValue-returning processor; adds row position to the value layout
positioned_key_value.h New PositionedKeyValue struct holding a KeyValue, filename, and row position
file_position.h New FilePosition struct holding filename and row position
lookup_serializer_factory.h Abstraction for creating serialize/deserialize functions
default_lookup_serializer_factory.h Concrete factory using RowCompactedSerializer
row_compacted_serializer.h/.cpp New row serializer with Java-compatible binary format
binary_serializer_utils.h/.cpp Utilities for writing BinaryRow/BinaryArray/BinaryMap from internal types
var_length_int_utils.h Variable-length integer encoding/decoding (LongPacker-style)
binary_map.h New BinaryMap class implementing InternalMap on contiguous memory
binary_data_read_utils.h Implements ReadMapData using the new BinaryMap
abstract_binary_writer.h/.cpp Adds WriteMap to the abstract writer
binary_array_writer.h/.cpp Adds GetElementSize and type-aware SetNullAt
binary_array.h Initializes size_ and element_offset_ to avoid UB
bytes.h/.cpp Adds EmptyBytes() static and relaxes CopyOf length constraint
internal_row.h Removes unused parameter name from FieldGetterFunc typedef
persist_processor_test.cpp Tests all four processor implementations
var_length_int_utils_test.cpp Tests varint encode/decode for int and long
row_compacted_serializer_test.cpp Tests serializer correctness and Java compatibility
binary_serializer_utils_test.cpp Tests BinarySerializerUtils for nested and flat types
binary_map_test.cpp Basic test for BinaryMap::ValueOf
CMakeLists.txt Registers new source files and test targets

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread src/paimon/common/utils/var_length_int_utils.h Outdated
Comment thread src/paimon/common/data/binary_map.h Outdated
Comment thread src/paimon/common/data/binary_map.h Outdated
Comment thread src/paimon/common/data/serializer/row_compacted_serializer.cpp
Comment thread src/paimon/common/data/serializer/binary_serializer_utils.cpp Outdated
Comment thread src/paimon/common/data/serializer/row_compacted_serializer.h
Comment thread src/paimon/common/data/abstract_binary_writer.cpp
Comment thread src/paimon/common/data/binary_data_read_utils.h
Comment thread src/paimon/common/data/binary_map.h
Comment thread src/paimon/common/utils/var_length_int_utils.h
Comment thread src/paimon/core/mergetree/lookup/default_lookup_serializer_factory.h Outdated
Comment thread src/paimon/common/data/serializer/binary_serializer_utils_test.cpp Outdated
Comment thread src/paimon/core/mergetree/lookup/file_position.h Outdated
Copy link
Copy Markdown
Collaborator

@lucasfang lucasfang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@lxy-9602 lxy-9602 merged commit e9e208d into alibaba:main Mar 9, 2026
8 checks passed
ChaomingZhangCN pushed a commit to ChaomingZhangCN/paimon-cpp that referenced this pull request Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants