[hive] Fix insert into static partitions on managed Paimon tables#7824
[hive] Fix insert into static partitions on managed Paimon tables#7824ArnavBalyan wants to merge 1 commit into
Conversation
|
cc @JingsongLi thanks! |
|
cc @JingsongLi gentle reminder if you could PTAL thanks! |
|
cc @leaves12138 @JingsongLi gentle reminder thanks ! :) |
JingsongLi
left a comment
There was a problem hiding this comment.
Review: [hive] Fix insert into static partitions on managed Paimon tables
Overall this is a well-structured fix for a legitimate bug (#7064). The approach of detecting static partition columns via serde properties and wrapping the writer to reconstruct full rows is sound. A few observations:
Correctness
-
Partition column ordering assumption:
PartitionedRecordWriterusesJoinedRow(source, partitionRow)which assumes partition columns are always appended after data columns. This is correct for Hive-managed tables (Hive always placesPARTITIONED BYcolumns at the end of the schema), but if a table were originally created via Flink/Spark with partition columns in the middle and later accessed from Hive, this could silently produce incorrect rows. Consider adding a defensive check that validates the partition column indices are actually at the tail of the schema inbuildStaticPartitionRow. -
Mixed static/dynamic partitions: If a user writes
PARTITION (region='us', year)(one static, one dynamic),META_TABLE_PARTITION_COLUMNSwill list both columns, but the path will only containregion=us. ThelookupCaseInsensitiveforyearreturns null, causingbuildStaticPartitionRowto return null, which falls through to the unwrapped writer. That writer will then also fail because Hive still strips the static partition column from the data. This is not a regression (it was already broken), but it might be worth a code comment noting this limitation. -
Separator in
META_TABLE_PARTITION_COLUMNS: The split usesorg.apache.paimon.fs.Path.SEPARATORwhich is"/". This matches the Hive convention for this property. Correct, but somewhat fragile -- a named constant or comment clarifying why/is the right delimiter here would improve readability.
Design
-
Direct access to
inner.batchTableWrite().write(toWrite):PartitionedRecordWriterbypassesPaimonRecordWriter.write()and calls the underlyingBatchTableWritedirectly. Currently this is fine sincePaimonRecordWriter.write()does nothing beyond unwrappingRowDataContainerand forwarding. However, this creates tight coupling -- ifPaimonRecordWriter.write()later gains additional logic (metrics, validation, etc.),PartitionedRecordWriterwould silently miss it. A safer pattern would be to accept anInternalRowtransformer inPaimonRecordWriter, or at minimum add a package-privatewriteRow(InternalRow)method that both paths invoke. -
forWriteOnlyextraction: Nice refactoring to separate theWRITE_ONLYoption application from writer creation. Clean.
Minor
-
The exception wrapping differs:
PaimonRecordWriter.write()throwsRuntimeExceptionon write failure, whilePartitionedRecordWriter.write()wraps inIOException. The inconsistency is minor but could confuse error handling upstream. -
Test coverage is good -- unit tests for
buildStaticPartitionRowcover the key scenarios (typed values, unpartitioned tables, missing path segments), and the integration test validates end-to-end correctness.
Summary
The fix correctly addresses the reported bug for the common case (fully static partition inserts on Hive-managed tables). The main suggestion is to add a defensive assertion that partition columns are indeed at the schema tail, and to reduce the coupling between PartitionedRecordWriter and PaimonRecordWriter's internals.
JingsongLi
left a comment
There was a problem hiding this comment.
-1 It seems that there are really issues with the testing
|
Thanks for the review! Have addressed all comments 1. Partition column ordering assumption 2. Mixed static/dynamic partitions 3. Separator in the partition columns property 4. Direct access to the underlying batch write 5. Inconsistent exception wrapping |
Purpose
ArrayIndexOutOfBoundsExceptioninTableWriteImpl.checkNullability.path/partition_columnsserde property.Tests