fix: VARIANT Hive sync error when performing CREATE table DDL#18511
Conversation
f4eb005 to
dcfbbeb
Compare
f1f780e to
24d45c5
Compare
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for contributing! The approach of converting VariantType to a Hive-compatible struct is solid, and storing the original schema in properties for reconstruction is a nice touch. However, the conversion currently only handles top-level VariantType fields — please see the inline comment about nested variants.
| */ | ||
| private[hudi] def toHiveCompatibleSchema(schema: StructType): StructType = { | ||
| StructType(schema.map { field => | ||
| if (sparkAdapter.isVariantType(field.dataType)) { |
There was a problem hiding this comment.
🤖 This only converts top-level VariantType fields. If a VariantType is nested inside a StructType, ArrayType, or MapType (e.g., STRUCT<data: VARIANT>), the inner variant won't be converted and Hive will still reject the schema. Could you make this recursive to handle nested types as well? The HiveSchemaUtil.convertField() in the sync layer handles this recursively for reference.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| StructType(schema.map { field => | ||
| if (sparkAdapter.isVariantType(field.dataType)) { | ||
| field.copy(dataType = StructType(Seq( | ||
| StructField(HoodieSchema.Variant.VARIANT_VALUE_FIELD, BinaryType, nullable = false), |
There was a problem hiding this comment.
🤖 The canonical field order in HoodieSchema.createVariant() is (metadata, value) (see line 734 of HoodieSchema.java: "Field order is (metadata, value) to match the Parquet spec and Iceberg convention"). Here value comes first. While fields are accessed by name so it likely won't break anything, it might be worth matching the canonical order for consistency with the rest of the codebase.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Style & Readability Review — Two minor readability suggestions: Javadoc formatting and test field naming could be clearer.
| } | ||
|
|
||
| /** | ||
| * Converts Spark DataTypes that Hive doesn't support to their physical representations. |
There was a problem hiding this comment.
🤖 nit: consider using backticks for the type representation to improve source code readability: struct<value:binary, metadata:binary> instead of HTML entities.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| assume(HoodieSparkUtils.gteqSpark4_0, "Variant type requires Spark 4.0 or higher") | ||
|
|
||
| val schema = StructType(Seq( | ||
| StructField("id", LongType, nullable = false), |
There was a problem hiding this comment.
🤖 nit: the field name 'v' could be more descriptive, like 'variant_col' or 'data', to clarify what's being tested.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for contributing! The fix correctly preserves the original schema in table properties while passing a Hive-compatible schema to the Hive metastore — nice approach. One concern about nested VariantType handling in the inline comment.
| private[hudi] def toHiveCompatibleSchema(schema: StructType): StructType = { | ||
| StructType(schema.map { field => | ||
| if (sparkAdapter.isVariantType(field.dataType)) { | ||
| field.copy(dataType = StructType(Seq( |
There was a problem hiding this comment.
🤖 This only converts top-level VariantType fields. If a user defines a column like col1 STRUCT<a: VARIANT> or col1 ARRAY<VARIANT>, the nested VariantType won't be converted and will still fail in Hive. Could you make this recursive over nested StructType/ArrayType/MapType?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
There was a problem hiding this comment.
@voonhous i had a same question if we are supporting nested types that contain VARIANT? If not then maybe we should ensure there is some guard for this?
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for addressing the feedback — the recursive toHiveCompatibleType now correctly handles VariantType nested inside StructType/ArrayType/MapType, and the expanded tests cover all three nested cases plus a dedicated test for buildHiveCompatibleCatalogTable. Field order was also harmonized to the canonical (metadata, value) ordering to match HoodieSchema.createVariant(). LGTM.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Style & Readability Review — one small duplication worth cleaning up, otherwise the code is clean and well-commented.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Nice work on this PR! Clean, well-scoped fix for a real pain point with Hive 2.x/3.x not supporting the VARIANT type. The recursion through nested struct/array/map types is thorough, the preservation of the original schema in table properties keeps Spark reads correct, and the canonical (metadata, value) field order matches both the Parquet spec and the physical layout Hudi writes. This looks ready for a Hudi committer or PMC member to take it from here.
445d264 to
bddbf76
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! The PR converts VariantType (and variants nested inside StructType/ArrayType/MapType) to a Hive-compatible struct<metadata:binary, value:binary> in the CatalogTable schema passed to the Hive metastore, while preserving the original schema in data-source properties so Spark can reconstruct the VariantType on read. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One simplification opportunity in the production code; tests are well-structured and clean.
cc @yihua
fe8c9cf to
e4cbfd0
Compare
- Hive 2.x/3.x does not support VARIANT type natively. - When creating a Hudi table with VARIANT columns via SQL CREATE TABLE, Spark's HiveClient passes "variant" as a literal type string which Hive rejects. - Convert VariantType to struct<value:binary, metadata:binary> in the CatalogTable schema before passing to HiveClient, while preserving the original VariantType in table properties so Spark can reconstruct it when reading. - Includes unit test for the conversion. - Recursively convert VariantType inside nested StructType/ArrayType/MapType so columns like STRUCT<a:VARIANT>, ARRAY<VARIANT>, and MAP<STRING,VARIANT> are also rewritten to the Hive-compatible physical struct. - Emit the variant struct with canonical (metadata, value) field order to match HoodieSchema.createVariant() and the Parquet/Iceberg convention. - Extract buildHiveCompatibleCatalogTable helper so the schema conversion and property merge are directly unit-testable. - Expand TestVariantDataType with nested-variant cases, canonical-order assertions, and coverage for buildHiveCompatibleCatalogTable. - Clean up Scaladoc (use backticks) and rename the test field from v to variant_col.
e4cbfd0 to
afb46bb
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18511 +/- ##
=========================================
Coverage 68.85% 68.86%
- Complexity 28447 28451 +4
=========================================
Files 2475 2475
Lines 136549 136580 +31
Branches 16609 16616 +7
=========================================
+ Hits 94025 94049 +24
- Misses 34963 34971 +8
+ Partials 7561 7560 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Closes :#18512
Hive 2.x/3.x does not support VARIANT type natively.
When creating a Hudi table with VARIANT columns via SQL CREATE TABLE, Spark's HiveClient passes "variant" as a literal type string which Hive rejects.
Summary and Changelog
Impact
User can now perform
CREATE TABLEDDL with VARIANT column type.Risk Level
Low
Documentation Update
None
Contributor's checklist