Skip to content

fix: VARIANT Hive sync error when performing CREATE table DDL#18511

Merged
voonhous merged 1 commit into
apache:masterfrom
voonhous:fix-create-table-vector-hive-sync
Apr 22, 2026
Merged

fix: VARIANT Hive sync error when performing CREATE table DDL#18511
voonhous merged 1 commit into
apache:masterfrom
voonhous:fix-create-table-vector-hive-sync

Conversation

@voonhous
Copy link
Copy Markdown
Member

@voonhous voonhous commented Apr 16, 2026

Describe the issue this Pull Request addresses

Closes :#18512

Hive 2.x/3.x does not support VARIANT type natively.

When creating a Hudi table with VARIANT columns via SQL CREATE TABLE, Spark's HiveClient passes "variant" as a literal type string which Hive rejects.

Summary and Changelog

  • Convert VariantType to struct<value:binary, metadata:binary> in the CatalogTable schema before passing to HiveClient, while preserving the original VariantType in table properties so Spark can reconstruct it when reading.
  • Add unit test for the conversion.

Impact

User can now perform CREATE TABLE DDL with VARIANT column type.

Risk Level

Low

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@voonhous voonhous self-assigned this Apr 16, 2026
@voonhous voonhous force-pushed the fix-create-table-vector-hive-sync branch from f4eb005 to dcfbbeb Compare April 16, 2026 15:19
@voonhous voonhous requested review from bvaradar, rahil-c and yihua April 16, 2026 15:20
@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 16, 2026
@voonhous voonhous linked an issue Apr 16, 2026 that may be closed by this pull request
@voonhous voonhous removed their assignment Apr 16, 2026
@voonhous voonhous force-pushed the fix-create-table-vector-hive-sync branch 2 times, most recently from f1f780e to 24d45c5 Compare April 16, 2026 16:33
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The approach of converting VariantType to a Hive-compatible struct is solid, and storing the original schema in properties for reconstruction is a nice touch. However, the conversion currently only handles top-level VariantType fields — please see the inline comment about nested variants.

*/
private[hudi] def toHiveCompatibleSchema(schema: StructType): StructType = {
StructType(schema.map { field =>
if (sparkAdapter.isVariantType(field.dataType)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This only converts top-level VariantType fields. If a VariantType is nested inside a StructType, ArrayType, or MapType (e.g., STRUCT<data: VARIANT>), the inner variant won't be converted and Hive will still reject the schema. Could you make this recursive to handle nested types as well? The HiveSchemaUtil.convertField() in the sync layer handles this recursively for reference.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

StructType(schema.map { field =>
if (sparkAdapter.isVariantType(field.dataType)) {
field.copy(dataType = StructType(Seq(
StructField(HoodieSchema.Variant.VARIANT_VALUE_FIELD, BinaryType, nullable = false),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The canonical field order in HoodieSchema.createVariant() is (metadata, value) (see line 734 of HoodieSchema.java: "Field order is (metadata, value) to match the Parquet spec and Iceberg convention"). Here value comes first. While fields are accessed by name so it likely won't break anything, it might be worth matching the canonical order for consistency with the rest of the codebase.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — Two minor readability suggestions: Javadoc formatting and test field naming could be clearer.

}

/**
* Converts Spark DataTypes that Hive doesn't support to their physical representations.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: consider using backticks for the type representation to improve source code readability: struct<value:binary, metadata:binary> instead of HTML entities.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

assume(HoodieSparkUtils.gteqSpark4_0, "Variant type requires Spark 4.0 or higher")

val schema = StructType(Seq(
StructField("id", LongType, nullable = false),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the field name 'v' could be more descriptive, like 'variant_col' or 'data', to clarify what's being tested.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The fix correctly preserves the original schema in table properties while passing a Hive-compatible schema to the Hive metastore — nice approach. One concern about nested VariantType handling in the inline comment.

private[hudi] def toHiveCompatibleSchema(schema: StructType): StructType = {
StructType(schema.map { field =>
if (sparkAdapter.isVariantType(field.dataType)) {
field.copy(dataType = StructType(Seq(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This only converts top-level VariantType fields. If a user defines a column like col1 STRUCT<a: VARIANT> or col1 ARRAY<VARIANT>, the nested VariantType won't be converted and will still fail in Hive. Could you make this recursive over nested StructType/ArrayType/MapType?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voonhous i had a same question if we are supporting nested types that contain VARIANT? If not then maybe we should ensure there is some guard for this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for addressing the feedback — the recursive toHiveCompatibleType now correctly handles VariantType nested inside StructType/ArrayType/MapType, and the expanded tests cover all three nested cases plus a dedicated test for buildHiveCompatibleCatalogTable. Field order was also harmonized to the canonical (metadata, value) ordering to match HoodieSchema.createVariant(). LGTM.

@github-actions github-actions Bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Apr 17, 2026
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — one small duplication worth cleaning up, otherwise the code is clean and well-commented.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Nice work on this PR! Clean, well-scoped fix for a real pain point with Hive 2.x/3.x not supporting the VARIANT type. The recursion through nested struct/array/map types is thorough, the preservation of the original schema in table properties keeps Spark reads correct, and the canonical (metadata, value) field order matches both the Parquet spec and the physical layout Hudi writes. This looks ready for a Hudi committer or PMC member to take it from here.

@voonhous voonhous force-pushed the fix-create-table-vector-hive-sync branch from 445d264 to bddbf76 Compare April 21, 2026 15:49
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR converts VariantType (and variants nested inside StructType/ArrayType/MapType) to a Hive-compatible struct<metadata:binary, value:binary> in the CatalogTable schema passed to the Hive metastore, while preserving the original schema in data-source properties so Spark can reconstruct the VariantType on read. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One simplification opportunity in the production code; tests are well-structured and clean.

cc @yihua

@voonhous voonhous force-pushed the fix-create-table-vector-hive-sync branch 2 times, most recently from fe8c9cf to e4cbfd0 Compare April 21, 2026 16:00
- Hive 2.x/3.x does not support VARIANT type natively.
- When creating a Hudi table with VARIANT columns via SQL CREATE TABLE, Spark's HiveClient passes "variant" as a literal type string which Hive rejects.
- Convert VariantType to struct<value:binary, metadata:binary> in the CatalogTable schema before passing to HiveClient, while preserving the original VariantType in table properties so Spark can reconstruct it when reading.
- Includes unit test for the conversion.
- Recursively convert VariantType inside nested StructType/ArrayType/MapType so columns like STRUCT<a:VARIANT>, ARRAY<VARIANT>, and MAP<STRING,VARIANT> are also rewritten to the Hive-compatible physical struct.
- Emit the variant struct with canonical (metadata, value) field order to match HoodieSchema.createVariant() and the Parquet/Iceberg convention.
- Extract buildHiveCompatibleCatalogTable helper so the schema conversion and property merge are directly unit-testable.
- Expand TestVariantDataType with nested-variant cases, canonical-order assertions, and coverage for buildHiveCompatibleCatalogTable.
- Clean up Scaladoc (use backticks) and rename the test field from v to variant_col.
@voonhous voonhous force-pushed the fix-create-table-vector-hive-sync branch from e4cbfd0 to afb46bb Compare April 21, 2026 16:49
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous voonhous merged commit 0d57435 into apache:master Apr 22, 2026
156 of 162 checks passed
@voonhous voonhous deleted the fix-create-table-vector-hive-sync branch April 23, 2026 13:26
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.86%. Comparing base (e303579) to head (afb46bb).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
...rk/sql/hudi/command/CreateHoodieTableCommand.scala 90.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18511   +/-   ##
=========================================
  Coverage     68.85%   68.86%           
- Complexity    28447    28451    +4     
=========================================
  Files          2475     2475           
  Lines        136549   136580   +31     
  Branches      16609    16616    +7     
=========================================
+ Hits          94025    94049   +24     
- Misses        34963    34971    +8     
+ Partials       7561     7560    -1     
Flag Coverage Δ
common-and-other-modules 44.46% <0.00%> (-0.01%) ⬇️
hadoop-mr-java-client 44.77% <ø> (-0.01%) ⬇️
spark-client-hadoop-common 48.55% <ø> (+<0.01%) ⬆️
spark-java-tests 49.40% <0.00%> (-0.04%) ⬇️
spark-scala-tests 45.31% <90.00%> (+<0.01%) ⬆️
utilities 38.01% <0.00%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...rk/sql/hudi/command/CreateHoodieTableCommand.scala 66.44% <90.00%> (+3.48%) ⬆️

... and 22 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Creating a table with VARIANT type using DDL throws hive sync error

7 participants