Skip to content

feat: introduce data type with JSON serialization#13

Merged
leaves12138 merged 2 commits into
apache:mainfrom
lszskye:p2-8
May 25, 2026
Merged

feat: introduce data type with JSON serialization#13
leaves12138 merged 2 commits into
apache:mainfrom
lszskye:p2-8

Conversation

@lszskye
Copy link
Copy Markdown
Contributor

@lszskye lszskye commented May 25, 2026

Purpose

Introduce the Paimon data type system for apache-paimon-cpp, including:

  • DataType: Base class representing Paimon data types, with Arrow type mapping and JSON serialization. Supports atomic types (BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, STRING, BYTES, DATE, DECIMAL, TIMESTAMP, BLOB) and complex types (ARRAY, MAP, ROW).
  • DataField: Represents a field in a row type with id, name, type, and optional description. Provides bidirectional conversion between Paimon DataField and Arrow Field/Schema, including metadata handling for field ids.
  • RowType: A structured type containing named fields, serialized as JSON with field definitions.
  • ArrayType: A list/array type with element type.
  • MapType: A key-value map type. Note: keys are always non-nullable due to Apache Arrow limitations.
  • RowKind: Represents row kinds (INSERT, UPDATE_BEFORE, UPDATE_AFTER, DELETE) with byte value and short string serialization.
  • DataTypeJsonParser: A comprehensive type parser supporting the full Paimon type syntax (including parameterized types like DECIMAL(10,2), TIMESTAMP(9) WITH LOCAL TIME ZONE, etc.).

Tests

  • DataTypeTest
  • DataFieldTest
  • DataTypeJsonParserTest
  • RowKindTest

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I found one parser correctness issue that should be fixed before merge.

Tokenize is missing a break after handling CHAR_END_SUBTYPE (>). This makes every > token also emit a BEGIN_PARAMETER (() token at the same position. As a result, SQL-style nested type strings using subtype syntax, such as ROW<id INT> or ARRAY<INT>, cannot be parsed correctly because the token stream contains an extra ( after each >.

Please add the missing break and include a regression test for a type string containing >.

Copy link
Copy Markdown

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the latest update. The previously reported CHAR_END_SUBTYPE fallthrough has been fixed. I did not find further blockers in this round.

@leaves12138 leaves12138 merged commit 45d8609 into apache:main May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants