Skip to content

[hive] Fix insert into static partitions on managed Paimon tables#7824

Open
ArnavBalyan wants to merge 1 commit into
apache:masterfrom
ArnavBalyan:arnavb/hive-static-partition-insert
Open

[hive] Fix insert into static partitions on managed Paimon tables#7824
ArnavBalyan wants to merge 1 commit into
apache:masterfrom
ArnavBalyan:arnavb/hive-static-partition-insert

Conversation

@ArnavBalyan
Copy link
Copy Markdown
Member

Purpose

  • Insert into static partition against a Hive managed table fails with ArrayIndexOutOfBoundsException in TableWriteImpl.checkNullability.
  • This is because Hive sends data rows without the static partition columns and instead passes their values in path/partition_columns serde property.
  • PaimonOutputFormat.getHiveRecordWriter does not respect these properties, and row arrives incomplete.
  • Read the partition column names, parse the values and wrap the writer which is aware of the extra properies and can construct the row properly.
  • Closes user reported bug [Bug] When executing an insert operation on a Hive partitioned table, an array out-of-bounds error occurs #7064.

Tests

  • UT

@ArnavBalyan
Copy link
Copy Markdown
Member Author

cc @JingsongLi thanks!

@ArnavBalyan
Copy link
Copy Markdown
Member Author

cc @JingsongLi gentle reminder if you could PTAL thanks!

@ArnavBalyan
Copy link
Copy Markdown
Member Author

cc @leaves12138 @JingsongLi gentle reminder thanks ! :)

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: [hive] Fix insert into static partitions on managed Paimon tables

Overall this is a well-structured fix for a legitimate bug (#7064). The approach of detecting static partition columns via serde properties and wrapping the writer to reconstruct full rows is sound. A few observations:

Correctness

  1. Partition column ordering assumption: PartitionedRecordWriter uses JoinedRow(source, partitionRow) which assumes partition columns are always appended after data columns. This is correct for Hive-managed tables (Hive always places PARTITIONED BY columns at the end of the schema), but if a table were originally created via Flink/Spark with partition columns in the middle and later accessed from Hive, this could silently produce incorrect rows. Consider adding a defensive check that validates the partition column indices are actually at the tail of the schema in buildStaticPartitionRow.

  2. Mixed static/dynamic partitions: If a user writes PARTITION (region='us', year) (one static, one dynamic), META_TABLE_PARTITION_COLUMNS will list both columns, but the path will only contain region=us. The lookupCaseInsensitive for year returns null, causing buildStaticPartitionRow to return null, which falls through to the unwrapped writer. That writer will then also fail because Hive still strips the static partition column from the data. This is not a regression (it was already broken), but it might be worth a code comment noting this limitation.

  3. Separator in META_TABLE_PARTITION_COLUMNS: The split uses org.apache.paimon.fs.Path.SEPARATOR which is "/". This matches the Hive convention for this property. Correct, but somewhat fragile -- a named constant or comment clarifying why / is the right delimiter here would improve readability.

Design

  1. Direct access to inner.batchTableWrite().write(toWrite): PartitionedRecordWriter bypasses PaimonRecordWriter.write() and calls the underlying BatchTableWrite directly. Currently this is fine since PaimonRecordWriter.write() does nothing beyond unwrapping RowDataContainer and forwarding. However, this creates tight coupling -- if PaimonRecordWriter.write() later gains additional logic (metrics, validation, etc.), PartitionedRecordWriter would silently miss it. A safer pattern would be to accept an InternalRow transformer in PaimonRecordWriter, or at minimum add a package-private writeRow(InternalRow) method that both paths invoke.

  2. forWriteOnly extraction: Nice refactoring to separate the WRITE_ONLY option application from writer creation. Clean.

Minor

  1. The exception wrapping differs: PaimonRecordWriter.write() throws RuntimeException on write failure, while PartitionedRecordWriter.write() wraps in IOException. The inconsistency is minor but could confuse error handling upstream.

  2. Test coverage is good -- unit tests for buildStaticPartitionRow cover the key scenarios (typed values, unpartitioned tables, missing path segments), and the integration test validates end-to-end correctness.

Summary

The fix correctly addresses the reported bug for the common case (fully static partition inserts on Hive-managed tables). The main suggestion is to add a defensive assertion that partition columns are indeed at the schema tail, and to reduce the coupling between PartitionedRecordWriter and PaimonRecordWriter's internals.

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 It seems that there are really issues with the testing

@ArnavBalyan
Copy link
Copy Markdown
Member Author

Thanks for the review! Have addressed all comments

1. Partition column ordering assumption
Added a check that throws a clear error when partition columns aren't the trailing columns of the schema. Covered by a new unit test.

2. Mixed static/dynamic partitions
Documented why we fall back to the unwrapped writer when a partition key has no value in the path, and flags this as a known unsupported case.

3. Separator in the partition columns property
Extracted the forward slash into a constant with a comment explaining it's the Hive metastore convention, not the OS path separator.

4. Direct access to the underlying batch write
Added a write method on the inner writer. Both the wrapped and unwrapped paths now go through this, so any future logic added to the write path won't be silently bypassed.

5. Inconsistent exception wrapping
Aligned both write paths to throw the same exception type, so error handling upstream is consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants