Skip to content

feat: add common data structures (BinarySection, BinaryString, InternalRow)#18

Merged
JingsongLi merged 2 commits into
apache:mainfrom
lxy-9602:add-data-get-set
May 26, 2026
Merged

feat: add common data structures (BinarySection, BinaryString, InternalRow)#18
JingsongLi merged 2 commits into
apache:mainfrom
lxy-9602:add-data-get-set

Conversation

@lxy-9602
Copy link
Copy Markdown
Contributor

@lxy-9602 lxy-9602 commented May 26, 2026

Purpose

No Linked issue.

Introduce

  • BinarySection — base class for binary data backed by a MemorySegment, providing offset/size-based access (binary_section.h/cpp)
  • BinaryString — variable-length binary string with comparison, copy, hash, and substring operations (binary_string.h/cpp)
  • DataDefine — variant type definitions (NullableDataField, DataField) for Paimon columnar data (data_define.h)
  • DataGetters / DataSetters — abstract interfaces for typed columnar read/write access (data_getters.h, data_setters.h)
  • InternalRow — abstract row interface with typed field accessors and field count (internal_row.h/cpp)
  • InternalArray — abstract array interface for columnar array data (internal_array.h)
  • InternalMap — abstract map interface wrapping key/value InternalArrays (internal_map.h)

Tests

  • binary_section_test.cpp — BinarySection construction and memory access
  • binary_string_test.cpp — BinaryString comparison, copy, hash, substring operations
  • internal_row_test.cpp — InternalRow field access and JoinedRow correctness

API and Format

Documentation

Generative AI tooling

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I found one blocker that should be fixed before merge.

BinarySection::HIGHEST_FIRST_BIT is defined as 0x80L << 56. In C++, this left-shifts a signed integer into the sign bit, which is undefined behavior when the result is not representable as a signed long. This mask is used by ReadBinary to distinguish inline fixed-length data from variable-length data, so compiler-dependent behavior here can break binary/string decoding.

Please define these masks using unsigned arithmetic and cast after the shift, for example static_cast<int64_t>(0x80ULL << 56) (and preferably use unsigned masks consistently for the related 0x7F mask as well).

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. The previous signed-shift issue is fixed, and I did not find any remaining blocking issue in this round.

@JingsongLi JingsongLi merged commit e7d0f1d into apache:main May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants