fix: VARIANT Hive sync error when performing CREATE table DDL by voonhous · Pull Request #18511 · apache/hudi

voonhous · 2026-04-16T15:19:23Z

Describe the issue this Pull Request addresses

Hive 2.x/3.x does not support VARIANT type natively.

When creating a Hudi table with VARIANT columns via SQL CREATE TABLE, Spark's HiveClient passes "variant" as a literal type string which Hive rejects.

Summary and Changelog

Convert VariantType to struct<value:binary, metadata:binary> in the CatalogTable schema before passing to HiveClient, while preserving the original VariantType in table properties so Spark can reconstruct it when reading.
Add unit test for the conversion.

Impact

User can now perform CREATE TABLE DDL with VARIANT column type.

Risk Level

Low

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The approach of converting VariantType to a Hive-compatible struct is solid, and storing the original schema in properties for reconstruction is a nice touch. However, the conversion currently only handles top-level VariantType fields — please see the inline comment about nested variants.

yihua · 2026-04-16T17:57:02Z

+   */
+  private[hudi] def toHiveCompatibleSchema(schema: StructType): StructType = {
+    StructType(schema.map { field =>
+      if (sparkAdapter.isVariantType(field.dataType)) {


🤖 This only converts top-level VariantType fields. If a VariantType is nested inside a StructType, ArrayType, or MapType (e.g., STRUCT<data: VARIANT>), the inner variant won't be converted and Hive will still reject the schema. Could you make this recursive to handle nested types as well? The HiveSchemaUtil.convertField() in the sync layer handles this recursively for reference.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-16T17:57:02Z

+    StructType(schema.map { field =>
+      if (sparkAdapter.isVariantType(field.dataType)) {
+        field.copy(dataType = StructType(Seq(
+          StructField(HoodieSchema.Variant.VARIANT_VALUE_FIELD, BinaryType, nullable = false),


🤖 The canonical field order in HoodieSchema.createVariant() is (metadata, value) (see line 734 of HoodieSchema.java: "Field order is (metadata, value) to match the Parquet spec and Iceberg convention"). Here value comes first. While fields are accessed by name so it likely won't break anything, it might be worth matching the canonical order for consistency with the rest of the codebase.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — Two minor readability suggestions: Javadoc formatting and test field naming could be clearer.

yihua · 2026-04-16T18:04:49Z

  }

+  /**
+   * Converts Spark DataTypes that Hive doesn't support to their physical representations.


🤖 nit: consider using backticks for the type representation to improve source code readability: struct<value:binary, metadata:binary> instead of HTML entities.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-16T18:04:49Z

+    assume(HoodieSparkUtils.gteqSpark4_0, "Variant type requires Spark 4.0 or higher")
+
+    val schema = StructType(Seq(
+      StructField("id", LongType, nullable = false),


🤖 nit: the field name 'v' could be more descriptive, like 'variant_col' or 'data', to clarify what's being tested.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The fix correctly preserves the original schema in table properties while passing a Hive-compatible schema to the Hive metastore — nice approach. One concern about nested VariantType handling in the inline comment.

yihua · 2026-04-16T18:05:24Z

+  private[hudi] def toHiveCompatibleSchema(schema: StructType): StructType = {
+    StructType(schema.map { field =>
+      if (sparkAdapter.isVariantType(field.dataType)) {
+        field.copy(dataType = StructType(Seq(


🤖 This only converts top-level VariantType fields. If a user defines a column like col1 STRUCT<a: VARIANT> or col1 ARRAY<VARIANT>, the nested VariantType won't be converted and will still fail in Hive. Could you make this recursive over nested StructType/ArrayType/MapType?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

@voonhous i had a same question if we are supporting nested types that contain VARIANT? If not then maybe we should ensure there is some guard for this?

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for addressing the feedback — the recursive toHiveCompatibleType now correctly handles VariantType nested inside StructType/ArrayType/MapType, and the expanded tests cover all three nested cases plus a dedicated test for buildHiveCompatibleCatalogTable. Field order was also harmonized to the canonical (metadata, value) ordering to match HoodieSchema.createVariant(). LGTM.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — one small duplication worth cleaning up, otherwise the code is clean and well-commented.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Nice work on this PR! Clean, well-scoped fix for a real pain point with Hive 2.x/3.x not supporting the VARIANT type. The recursion through nested struct/array/map types is thorough, the preservation of the original schema in table properties keeps Spark reads correct, and the canonical (metadata, value) field order matches both the Parquet spec and the physical layout Hudi writes. This looks ready for a Hudi committer or PMC member to take it from here.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR converts VariantType (and variants nested inside StructType/ArrayType/MapType) to a Hive-compatible struct<metadata:binary, value:binary> in the CatalogTable schema passed to the Hive metastore, while preserving the original schema in data-source properties so Spark can reconstruct the VariantType on read. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One simplification opportunity in the production code; tests are well-structured and clean.

cc @yihua

- Hive 2.x/3.x does not support VARIANT type natively. - When creating a Hudi table with VARIANT columns via SQL CREATE TABLE, Spark's HiveClient passes "variant" as a literal type string which Hive rejects. - Convert VariantType to struct<value:binary, metadata:binary> in the CatalogTable schema before passing to HiveClient, while preserving the original VariantType in table properties so Spark can reconstruct it when reading. - Includes unit test for the conversion. - Recursively convert VariantType inside nested StructType/ArrayType/MapType so columns like STRUCT<a:VARIANT>, ARRAY<VARIANT>, and MAP<STRING,VARIANT> are also rewritten to the Hive-compatible physical struct. - Emit the variant struct with canonical (metadata, value) field order to match HoodieSchema.createVariant() and the Parquet/Iceberg convention. - Extract buildHiveCompatibleCatalogTable helper so the schema conversion and property merge are directly unit-testable. - Expand TestVariantDataType with nested-variant cases, canonical-order assertions, and coverage for buildHiveCompatibleCatalogTable. - Clean up Scaladoc (use backticks) and rename the test field from v to variant_col.

hudi-bot · 2026-04-21T18:23:15Z

CI report:

fe8c9cf UNKNOWN
e4cbfd0 UNKNOWN
afb46bb Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-04-23T19:19:09Z

Codecov Report

❌ Patch coverage is 90.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.86%. Comparing base (e303579) to head (afb46bb).
⚠️ Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
...rk/sql/hudi/command/CreateHoodieTableCommand.scala	90.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18511   +/-   ##
=========================================
  Coverage     68.85%   68.86%           
- Complexity    28447    28451    +4     
=========================================
  Files          2475     2475           
  Lines        136549   136580   +31     
  Branches      16609    16616    +7     
=========================================
+ Hits          94025    94049   +24     
- Misses        34963    34971    +8     
+ Partials       7561     7560    -1

Flag	Coverage Δ
common-and-other-modules	`44.46% <0.00%> (-0.01%)`	⬇️
hadoop-mr-java-client	`44.77% <ø> (-0.01%)`	⬇️
spark-client-hadoop-common	`48.55% <ø> (+<0.01%)`	⬆️
spark-java-tests	`49.40% <0.00%> (-0.04%)`	⬇️
spark-scala-tests	`45.31% <90.00%> (+<0.01%)`	⬆️
utilities	`38.01% <0.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...rk/sql/hudi/command/CreateHoodieTableCommand.scala	`66.44% <90.00%> (+3.48%)`	⬆️

... and 22 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

voonhous self-assigned this Apr 16, 2026

voonhous force-pushed the fix-create-table-vector-hive-sync branch from f4eb005 to dcfbbeb Compare April 16, 2026 15:19

voonhous requested review from bvaradar, rahil-c and yihua April 16, 2026 15:20

github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 16, 2026

voonhous linked an issue Apr 16, 2026 that may be closed by this pull request

Creating a table with VARIANT type using DDL throws hive sync error #18512

Closed

voonhous removed their assignment Apr 16, 2026

voonhous force-pushed the fix-create-table-vector-hive-sync branch 2 times, most recently from f1f780e to 24d45c5 Compare April 16, 2026 16:33

yihua reviewed Apr 16, 2026

View reviewed changes

yihua reviewed Apr 17, 2026

View reviewed changes

github-actions Bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Apr 17, 2026

yihua reviewed Apr 18, 2026

View reviewed changes

Comment thread ...spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala Outdated

yihua reviewed Apr 18, 2026

View reviewed changes

rahil-c approved these changes Apr 21, 2026

View reviewed changes

voonhous force-pushed the fix-create-table-vector-hive-sync branch from 445d264 to bddbf76 Compare April 21, 2026 15:49

hudi-agent reviewed Apr 21, 2026

View reviewed changes

Comment thread ...spark-common/src/main/scala/org/apache/spark/sql/hudi/command/CreateHoodieTableCommand.scala

voonhous force-pushed the fix-create-table-vector-hive-sync branch 2 times, most recently from fe8c9cf to e4cbfd0 Compare April 21, 2026 16:00

bvaradar approved these changes Apr 21, 2026

View reviewed changes

voonhous force-pushed the fix-create-table-vector-hive-sync branch from e4cbfd0 to afb46bb Compare April 21, 2026 16:49

voonhous merged commit 0d57435 into apache:master Apr 22, 2026
156 of 162 checks passed

voonhous deleted the fix-create-table-vector-hive-sync branch April 23, 2026 13:26

Conversation

voonhous commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hudi-bot commented Apr 21, 2026

CI report:

Uh oh!

Uh oh!

codecov-commenter commented Apr 23, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

voonhous commented Apr 16, 2026 •

edited

Loading