Skip to content

feat: introduce binary row format with reader/writer support#22

Open
lszskye wants to merge 4 commits into
apache:mainfrom
lszskye:p2-4
Open

feat: introduce binary row format with reader/writer support#22
lszskye wants to merge 4 commits into
apache:mainfrom
lszskye:p2-4

Conversation

@lszskye
Copy link
Copy Markdown
Contributor

@lszskye lszskye commented May 26, 2026

Purpose

Introduce the binary row data format, compatible with Java Paimon's BinaryRow serialization format. This module provides efficient in-memory row representation with fixed-length and variable-length parts within a single MemorySegment.

Key components:

  • BinaryWriter: Abstract interface defining the write protocol.
  • AbstractBinaryWriter: Template-method base class implementing variable-length field writing (strings, bytes, decimals, timestamps, arrays, rows, maps) with auto-growing buffer management.
  • BinaryRowWriter: Concrete writer for BinaryRow, supporting all Paimon data types.
  • BinaryRow: Immutable binary row representation backed by MemorySegment. Implements InternalRow for reading all field types, supports null tracking via bit set, hashing, equality comparison, and copy operations.
  • BinaryDataReadUtils: Static utilities for reading typed fields (Timestamp, Decimal, BinaryString, Array, Row, Map) from memory segments with proper offset/size decoding.
  • BinaryArray: A binary implementation of InternalArray backed by a single MemorySegment. Supports all Paimon data types. Provides bulk conversion methods (ToIntArray, ToLongArray, etc.) and factory methods (FromIntArray, FromLongArray).
  • BinaryArrayWriter: Writer for BinaryArray, inheriting from AbstractBinaryWriter.
  • BinaryMap: A binary implementation of InternalMap backed by a single MemorySegment.

Tests

  • BinaryRowTest
  • BinaryRowWriterTest
  • BinaryArrayTest
  • BinaryMapTest

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the binary row implementation. This PR currently does not compile against its base branch because it includes and uses BinaryArray/BinaryMap, but src/paimon/common/data/binary_array.h and src/paimon/common/data/binary_map.h are not present in the PR or in main. Please either add those dependencies in this PR, rebase after the PR that introduces them is merged, or remove the nested array/map support from this change.

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The previous missing BinaryArray/BinaryMap issue is fixed, but I found one remaining correctness blocker in the inline bytes writer path.

Comment thread src/paimon/common/data/abstract_binary_writer.cpp Outdated
Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The previous inline bytes writer issue is fixed, but there is still one signed left-shift overflow issue in the newly added BinaryRow code.

Comment thread src/paimon/common/data/binary_row.cpp Outdated
Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. I re-reviewed the current PR head end-to-end and found two remaining blockers that should be fixed before merging.

};
return field_setter;
}
case arrow::Type::type::DECIMAL: {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch still will not handle arrow::decimal128(...). In Arrow C++, decimal128 reports arrow::Type::DECIMAL128 (the rest of this codebase also switches on DECIMAL128, e.g. DataDefine::VariantValueToLiteral and BinaryArrayWriter::GetElementSize tests). As written, CreateFieldSetter(arrow::decimal128(...)) falls through to the unsupported-type error, so the decimal cases added in binary_row_writer_test.cpp cannot work in a real build. Please switch this to DECIMAL128 (and add DECIMAL256 handling only if intended).

arrow::internal::checked_cast<arrow::Decimal128Type*>(field_type.get());
assert(decimal_type);
auto precision = decimal_type->precision();
field_setter = [field_idx, precision](const VariantType& field,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-compact decimals, the setter captures only precision and then calls WriteDecimal with a Decimal value whose scale may differ from the Arrow field scale. WriteDecimal currently only asserts the precision and never checks or normalizes scale, but the reader reconstructs decimals with the schema scale. That means a value like Decimal(5, 3, 123) written to a decimal128(5, 2) field will be read back as 1.23 instead of the original 0.123 (or being rejected). Please capture the Arrow scale here and either validate value.Scale() == scale or rescale/reject before writing, so the stored unscaled bytes match the schema scale.

@leaves12138
Copy link
Copy Markdown

Re-reviewed the current head (94b3929). The two blocking issues I found are already listed in my latest review comments: the decimal setter switches on the wrong Arrow decimal enum, and non-compact decimal writes do not validate/normalize scale. I did not find additional blockers in this pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants