feat(sync): Map VARIANT type to struct in Hive, Spark, and BigQuery sync#18483
feat(sync): Map VARIANT type to struct in Hive, Spark, and BigQuery sync#18483voonhous wants to merge 2 commits intoapache:masterfrom
Conversation
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
LGTM — clean, targeted fix that correctly maps VARIANT to its physical struct representation across all three sync backends. The fallthrough to the RECORD/struct path is the right approach since VARIANT is physically backed by a RECORD schema with metadata and value BYTES fields, and the tests confirm the expected output for each backend.
| standardSQLTypeName = StandardSQLTypeName.NUMERIC; | ||
| break; | ||
| case RECORD: | ||
| case VARIANT: |
There was a problem hiding this comment.
🤖 Now that VARIANT is handled here, it might be worth also adding case BLOB: in this same switch — SparkSchemaUtils and HiveSchemaUtil already cover BLOB, so BigQuery is currently the odd one out and would still throw RuntimeException for BLOB fields. Happy to leave it as a follow-up if out of scope, just flagging the asymmetry.
🤖 This review comment was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
There was a problem hiding this comment.
Ignore, this is addressed in a separate PR here:
#18482
| } | ||
|
|
||
| @Test | ||
| void convertSchema_variantField() { |
There was a problem hiding this comment.
The other tests in this file uses a combination of camelCase_snakeCase for naming tests. So, i think it's okay to follow.
|
|
||
| @Test | ||
| public void testConvertSchemaWithVariantField() { | ||
| HoodieSchema schema = HoodieSchema.createRecord("root", null, null, false, Arrays.asList( |
There was a problem hiding this comment.
can we also cover nested variants ?
There was a problem hiding this comment.
🤖 Looking at the diff, nested variants are already covered in TestHiveSchemaUtil (the nested_variant_field case with a record containing a variant_data sub-field) and in TestBigQuerySchemaResolver (convertSchema_nestedVariantField). The gap is specifically in TestSparkSchemaUtils — the new testConvertSchemaWithVariantField only exercises a top-level variant. A similar test with a struct-containing-variant would close the parity.
|
Same feedback as what i mentioned for the blob PR, just want to confirm if we need to consider other syncs, and if we can add one functional or e2e test for this? |
- VARIANT columns were not handled in HiveSchemaUtil, SparkSchemaUtils, and BigQuerySchemaResolver, causing UnsupportedOperationException when syncing tables with variant columns. - Map VARIANT to its underlying physical type (struct) so external engines can read via metastore.
ff034ea to
5fc569d
Compare
5fc569d to
d970429
Compare
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for contributing! The VARIANT→STRUCT mapping is the right physical representation for external engines, and the Hive and BigQuery changes look correct. There's one gap in the Spark sync path worth looking at before merge — inline comments below.
| + ",\"valueContainsNull\":" + valueOptional + "}"; | ||
| case RECORD: | ||
| case BLOB: | ||
| case VARIANT: |
There was a problem hiding this comment.
🤖 BLOB fields tag themselves with {"hudi_type":"BLOB"} in the Spark field metadata so that HoodieSparkSchemaConverters.toHoodieType() can reconstruct the logical type on the way back. VARIANT fields don't get that annotation here, so after a sync round-trip the VARIANT column would be reconstructed as a plain STRUCT rather than a VARIANT. Could you add a similar isVariantField()-style check in convertToSparkSchemaJson (the calling method) and a corresponding reconstruction case in HoodieSparkSchemaConverters?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| @Test | ||
| public void testConvertSchemaWithVariantField() { | ||
| HoodieSchema schema = HoodieSchema.createRecord("root", null, null, false, Arrays.asList( |
There was a problem hiding this comment.
🤖 Looking at the diff, nested variants are already covered in TestHiveSchemaUtil (the nested_variant_field case with a record containing a variant_data sub-field) and in TestBigQuerySchemaResolver (convertSchema_nestedVariantField). The gap is specifically in TestSparkSchemaUtils — the new testConvertSchemaWithVariantField only exercises a top-level variant. A similar test with a struct-containing-variant would close the parity.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18483 +/- ##
=========================================
Coverage 68.83% 68.83%
- Complexity 28171 28173 +2
=========================================
Files 2459 2459
Lines 135095 135095
Branches 16378 16378
=========================================
+ Hits 92992 92995 +3
+ Misses 34737 34736 -1
+ Partials 7366 7364 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Map VARIANT type to struct in Hive, Spark, and BigQuery sync:
Summary and Changelog
Add metasync support for BLOB logical type for:
Impact
Users are able to perform metasync for VARIANT logical types.
Risk Level
Low.
Documentation Update
None.
Contributor's checklist