Skip to content

feat(sync): Map VARIANT type to struct in Hive, Spark, and BigQuery sync#18483

Open
voonhous wants to merge 2 commits intoapache:masterfrom
voonhous:variant_read_check
Open

feat(sync): Map VARIANT type to struct in Hive, Spark, and BigQuery sync#18483
voonhous wants to merge 2 commits intoapache:masterfrom
voonhous:variant_read_check

Conversation

@voonhous
Copy link
Copy Markdown
Member

@voonhous voonhous commented Apr 8, 2026

Describe the issue this Pull Request addresses

Map VARIANT type to struct in Hive, Spark, and BigQuery sync:

  • VARIANT columns were not handled in HiveSchemaUtil, SparkSchemaUtils, and BigQuerySchemaResolver, causing UnsupportedOperationException when syncing tables with variant columns.
  • Map VARIANT to its underlying physical type (struct) so external engines can read via metastore.

Summary and Changelog

Add metasync support for BLOB logical type for:

  1. BigQuerySchemaResolver
  2. SparkSchemaUtils
  3. HiveSchemaUtil

Impact

Users are able to perform metasync for VARIANT logical types.

Risk Level

Low.

Documentation Update

None.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — clean, targeted fix that correctly maps VARIANT to its physical struct representation across all three sync backends. The fallthrough to the RECORD/struct path is the right approach since VARIANT is physically backed by a RECORD schema with metadata and value BYTES fields, and the tests confirm the expected output for each backend.

standardSQLTypeName = StandardSQLTypeName.NUMERIC;
break;
case RECORD:
case VARIANT:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Now that VARIANT is handled here, it might be worth also adding case BLOB: in this same switch — SparkSchemaUtils and HiveSchemaUtil already cover BLOB, so BigQuery is currently the odd one out and would still throw RuntimeException for BLOB fields. Happy to leave it as a follow-up if out of scope, just flagging the asymmetry.

🤖 This review comment was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore, this is addressed in a separate PR here:
#18482

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Apr 8, 2026
Copy link
Copy Markdown
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

}

@Test
void convertSchema_variantField() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: kindly use camelcase

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other tests in this file uses a combination of camelCase_snakeCase for naming tests. So, i think it's okay to follow.


@Test
public void testConvertSchemaWithVariantField() {
HoodieSchema schema = HoodieSchema.createRecord("root", null, null, false, Arrays.asList(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also cover nested variants ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Looking at the diff, nested variants are already covered in TestHiveSchemaUtil (the nested_variant_field case with a record containing a variant_data sub-field) and in TestBigQuerySchemaResolver (convertSchema_nestedVariantField). The gap is specifically in TestSparkSchemaUtils — the new testConvertSchemaWithVariantField only exercises a top-level variant. A similar test with a struct-containing-variant would close the parity.

@rahil-c
Copy link
Copy Markdown
Collaborator

rahil-c commented Apr 10, 2026

Same feedback as what i mentioned for the blob PR, just want to confirm if we need to consider other syncs, and if we can add one functional or e2e test for this?
#18482

- VARIANT columns were not handled in HiveSchemaUtil, SparkSchemaUtils, and BigQuerySchemaResolver, causing UnsupportedOperationException when syncing tables with variant columns.
- Map VARIANT to its underlying physical type (struct) so external engines can read via metastore.
@voonhous voonhous force-pushed the variant_read_check branch from ff034ea to 5fc569d Compare April 13, 2026 11:15
@voonhous voonhous force-pushed the variant_read_check branch from 5fc569d to d970429 Compare April 13, 2026 11:20
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The VARIANT→STRUCT mapping is the right physical representation for external engines, and the Hive and BigQuery changes look correct. There's one gap in the Spark sync path worth looking at before merge — inline comments below.

+ ",\"valueContainsNull\":" + valueOptional + "}";
case RECORD:
case BLOB:
case VARIANT:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 BLOB fields tag themselves with {"hudi_type":"BLOB"} in the Spark field metadata so that HoodieSparkSchemaConverters.toHoodieType() can reconstruct the logical type on the way back. VARIANT fields don't get that annotation here, so after a sync round-trip the VARIANT column would be reconstructed as a plain STRUCT rather than a VARIANT. Could you add a similar isVariantField()-style check in convertToSparkSchemaJson (the calling method) and a corresponding reconstruction case in HoodieSparkSchemaConverters?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.


@Test
public void testConvertSchemaWithVariantField() {
HoodieSchema schema = HoodieSchema.createRecord("root", null, null, false, Arrays.asList(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Looking at the diff, nested variants are already covered in TestHiveSchemaUtil (the nested_variant_field case with a record containing a variant_data sub-field) and in TestBigQuerySchemaResolver (convertSchema_nestedVariantField). The gap is specifically in TestSparkSchemaUtils — the new testConvertSchemaWithVariantField only exercises a top-level variant. A similar test with a struct-containing-variant would close the parity.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.83%. Comparing base (fc7f303) to head (d970429).

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18483   +/-   ##
=========================================
  Coverage     68.83%   68.83%           
- Complexity    28171    28173    +2     
=========================================
  Files          2459     2459           
  Lines        135095   135095           
  Branches      16378    16378           
=========================================
+ Hits          92992    92995    +3     
+ Misses        34737    34736    -1     
+ Partials       7366     7364    -2     
Flag Coverage Δ
common-and-other-modules 44.54% <ø> (+<0.01%) ⬆️
hadoop-mr-java-client 44.83% <ø> (ø)
spark-client-hadoop-common 48.44% <ø> (ø)
spark-java-tests 48.89% <ø> (-0.03%) ⬇️
spark-scala-tests 45.49% <ø> (+<0.01%) ⬆️
utilities 38.20% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ache/hudi/gcp/bigquery/BigQuerySchemaResolver.java 90.00% <ø> (ø)
...java/org/apache/hudi/hive/util/HiveSchemaUtil.java 86.39% <ø> (ø)
...apache/hudi/sync/common/util/SparkSchemaUtils.java 92.10% <ø> (ø)

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants