feat(sync): Map VARIANT type to struct in Hive, Spark, and BigQuery sync by voonhous · Pull Request #18483 · apache/hudi

voonhous · 2026-04-08T11:17:17Z

Describe the issue this Pull Request addresses

Map VARIANT type to struct in Hive, Spark, and BigQuery sync:

VARIANT columns were not handled in HiveSchemaUtil, SparkSchemaUtils, and BigQuerySchemaResolver, causing UnsupportedOperationException when syncing tables with variant columns.
Map VARIANT to its underlying physical type (struct) so external engines can read via metastore.

Summary and Changelog

Add metasync support for BLOB logical type for:

BigQuerySchemaResolver
SparkSchemaUtils
HiveSchemaUtil

Impact

Users are able to perform metasync for VARIANT logical types.

Risk Level

Low.

Documentation Update

None.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — clean, targeted fix that correctly maps VARIANT to its physical struct representation across all three sync backends. The fallthrough to the RECORD/struct path is the right approach since VARIANT is physically backed by a RECORD schema with metadata and value BYTES fields, and the tests confirm the expected output for each backend.

yihua · 2026-04-08T11:26:35Z

hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySchemaResolver.java

        standardSQLTypeName = StandardSQLTypeName.NUMERIC;
        break;
      case RECORD:
+      case VARIANT:


🤖 Now that VARIANT is handled here, it might be worth also adding case BLOB: in this same switch — SparkSchemaUtils and HiveSchemaUtil already cover BLOB, so BigQuery is currently the odd one out and would still throw RuntimeException for BLOB fields. Happy to leave it as a follow-up if out of scope, just flagging the asymmetry.

🤖 This review comment was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Ignore, this is addressed in a separate PR here:
#18482

bvaradar

Otherwise LGTM

bvaradar · 2026-04-09T20:07:05Z

hudi-gcp/src/test/java/org/apache/hudi/gcp/bigquery/TestBigQuerySchemaResolver.java

  }
+
+  @Test
+  void convertSchema_variantField() {


nit: kindly use camelcase

The other tests in this file uses a combination of camelCase_snakeCase for naming tests. So, i think it's okay to follow.

bvaradar · 2026-04-09T20:11:07Z

hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestSparkSchemaUtils.java

+
+  @Test
+  public void testConvertSchemaWithVariantField() {
+    HoodieSchema schema = HoodieSchema.createRecord("root", null, null, false, Arrays.asList(


can we also cover nested variants ?

🤖 Looking at the diff, nested variants are already covered in TestHiveSchemaUtil (the nested_variant_field case with a record containing a variant_data sub-field) and in TestBigQuerySchemaResolver (convertSchema_nestedVariantField). The gap is specifically in TestSparkSchemaUtils — the new testConvertSchemaWithVariantField only exercises a top-level variant. A similar test with a struct-containing-variant would close the parity.

rahil-c · 2026-04-10T22:44:19Z

Same feedback as what i mentioned for the blob PR, just want to confirm if we need to consider other syncs, and if we can add one functional or e2e test for this?
#18482

- VARIANT columns were not handled in HiveSchemaUtil, SparkSchemaUtils, and BigQuerySchemaResolver, causing UnsupportedOperationException when syncing tables with variant columns. - Map VARIANT to its underlying physical type (struct) so external engines can read via metastore.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The VARIANT→STRUCT mapping is the right physical representation for external engines, and the Hive and BigQuery changes look correct. There's one gap in the Spark sync path worth looking at before merge — inline comments below.

yihua · 2026-04-13T11:26:56Z

hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SparkSchemaUtils.java

            + ",\"valueContainsNull\":" + valueOptional + "}";
      case RECORD:
      case BLOB:
+      case VARIANT:


🤖 BLOB fields tag themselves with {"hudi_type":"BLOB"} in the Spark field metadata so that HoodieSparkSchemaConverters.toHoodieType() can reconstruct the logical type on the way back. VARIANT fields don't get that annotation here, so after a sync round-trip the VARIANT column would be reconstructed as a plain STRUCT rather than a VARIANT. Could you add a similar isVariantField()-style check in convertToSparkSchemaJson (the calling method) and a corresponding reconstruction case in HoodieSparkSchemaConverters?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-13T11:26:57Z

hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestSparkSchemaUtils.java

+
+  @Test
+  public void testConvertSchemaWithVariantField() {
+    HoodieSchema schema = HoodieSchema.createRecord("root", null, null, false, Arrays.asList(


🤖 Looking at the diff, nested variants are already covered in TestHiveSchemaUtil (the nested_variant_field case with a record containing a variant_data sub-field) and in TestBigQuerySchemaResolver (convertSchema_nestedVariantField). The gap is specifically in TestSparkSchemaUtils — the new testConvertSchemaWithVariantField only exercises a top-level variant. A similar test with a struct-containing-variant would close the parity.

codecov-commenter · 2026-04-13T12:40:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.83%. Comparing base (fc7f303) to head (d970429).

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18483   +/-   ##
=========================================
  Coverage     68.83%   68.83%           
- Complexity    28171    28173    +2     
=========================================
  Files          2459     2459           
  Lines        135095   135095           
  Branches      16378    16378           
=========================================
+ Hits          92992    92995    +3     
+ Misses        34737    34736    -1     
+ Partials       7366     7364    -2

Flag	Coverage Δ
common-and-other-modules	`44.54% <ø> (+<0.01%)`	⬆️
hadoop-mr-java-client	`44.83% <ø> (ø)`
spark-client-hadoop-common	`48.44% <ø> (ø)`
spark-java-tests	`48.89% <ø> (-0.03%)`	⬇️
spark-scala-tests	`45.49% <ø> (+<0.01%)`	⬆️
utilities	`38.20% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...ache/hudi/gcp/bigquery/BigQuerySchemaResolver.java	`90.00% <ø> (ø)`
...java/org/apache/hudi/hive/util/HiveSchemaUtil.java	`86.39% <ø> (ø)`
...apache/hudi/sync/common/util/SparkSchemaUtils.java	`92.10% <ø> (ø)`

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-04-13T13:48:45Z

CI report:

d970429 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous requested review from rahil-c, vinothchandar and yihua April 8, 2026 11:17

yihua reviewed Apr 8, 2026

View reviewed changes

github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 8, 2026

bvaradar reviewed Apr 9, 2026

View reviewed changes

voonhous force-pushed the variant_read_check branch from ff034ea to 5fc569d Compare April 13, 2026 11:15

Adding tests for nullable and nested variant

d970429

voonhous force-pushed the variant_read_check branch from 5fc569d to d970429 Compare April 13, 2026 11:20

yihua reviewed Apr 13, 2026

View reviewed changes

Conversation

voonhous commented Apr 8, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bvaradar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rahil-c commented Apr 10, 2026

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 13, 2026

Codecov Report

Uh oh!

hudi-bot commented Apr 13, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants