Skip to content

fix: prevent parseTypeDescriptor crash for VARIANT#18510

Merged
bvaradar merged 3 commits intoapache:masterfrom
voonhous:fix-variant-dataframe-write-path
Apr 21, 2026
Merged

fix: prevent parseTypeDescriptor crash for VARIANT#18510
bvaradar merged 3 commits intoapache:masterfrom
voonhous:fix-variant-dataframe-write-path

Conversation

@voonhous
Copy link
Copy Markdown
Member

Describe the issue this Pull Request addresses

Closes: #18509

  • The BLOB/VECTOR guard conditions in HoodieSparkSchemaConverters called parseTypeDescriptor() for any StructType/ArrayType with hudi_type metadata, which threw IllegalArgumentException for types like VARIANT that are not custom logical types.

Summary and Changelog

  • Add isCustomLogicalTypeDescriptor() safe check to short-circuit the guards before parseTypeDescriptor() is called.
  • Add regression test that reproduces the struct+metadata VARIANT path.

Impact

Users can now create tables with VARIANT types using Spark4.0 Dataframe API.

Risk Level

Low

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…n schema conversion guards

- The BLOB/VECTOR guard conditions in HoodieSparkSchemaConverters called parseTypeDescriptor() for any StructType/ArrayType with hudi_type metadata, which threw IllegalArgumentException for types like VARIANT that are not custom logical types.
- Add isCustomLogicalTypeDescriptor() safe check to short-circuit the guards before parseTypeDescriptor() is called.
- Add regression test that reproduces the struct+metadata VARIANT path.
@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 16, 2026
@voonhous voonhous assigned voonhous and unassigned voonhous Apr 16, 2026
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — clean, targeted fix that correctly short-circuits the parseTypeDescriptor() call for non-custom logical types like VARIANT. The new isCustomLogicalTypeDescriptor() guard safely extracts the type name and checks membership in CUSTOM_LOGICAL_TYPES without throwing, and the regression test covers the exact crash scenario.

@rahil-c
Copy link
Copy Markdown
Collaborator

rahil-c commented Apr 17, 2026

@voonhous would it make more sense to just add VARIANT to the CUSTOM_LOGICAL_TYPES and do the same thing we did for VECTOR and BLOB, and adding a case in parseTypeDescriptor

I know that its technically not a custom logical type since spark 4.0 has a native VARIANT type, however i think spark 3.5 does not have this. My take is that if a user has to pass hudi_type=... (since their engine's api does not have a native type requiring them to attach this hudi type metadata field, then it falls under a custom logical type case)

@voonhous
Copy link
Copy Markdown
Member Author

@voonhous would it make more sense to just add VARIANT to the CUSTOM_LOGICAL_TYPES and do the same thing we did for VECTOR and BLOB, and adding a case in parseTypeDescriptor

Okay, will just do it.

voonhous added a commit to voonhous/hudi that referenced this pull request Apr 17, 2026
- Address review feedback on apache#18510, restructure the crash fix so a StructType tagged with hudi_type=VARIANT is handled consistently with BLOB/VECTOR.
- The hudi_type metadata is the deliberate escape hatch for engines without a native representation (notably Spark 3.5), so using it is itself as custom-logical-type signal.
- Add VARIANT to CUSTOM_LOGICAL_TYPES and give it a case in parseTypeDescriptor, mirroring BLOB.
- In HoodieSparkSchemaConverters, add a dedicated VARIANT pattern case that validates the expected unshredded structure ({metadata, value} binary fields) and produces HoodieSchema.Variant.
- On Spark 4.0+ the column round-trips as native VariantType via the existing reverse conversion path.
- Remove the isCustomLogicalTypeDescriptor short-circuit helper; with VARIANT now properly registered, the BLOB/VECTOR guards no longer need the pre-check.
- Add unit tests for parseTypeDescriptor VARIANT (success, case insensitivity, parameter rejection) and integration tests asserting VARIANT promotion and malformed-struct rejection.
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Nice updates — the previous parseTypeDescriptor crash is now fixed in a cleaner way by adding VARIANT to CUSTOM_LOGICAL_TYPES and giving struct-tagged variants a dedicated pattern-match branch that validates the {metadata: binary, value: binary} shape and promotes them to a first-class VARIANT. The helper isCustomLogicalTypeDescriptor was removed since it's redundant once VARIANT is in the set; tests cover the happy path, case-insensitivity, parameter rejection, and malformed struct rejection. Prior review feedback has been addressed.

@github-actions github-actions Bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels Apr 17, 2026
- Address review feedback on apache#18510, restructure the crash fix so a StructType tagged with hudi_type=VARIANT is handled consistently with BLOB/VECTOR.
- The hudi_type metadata is the deliberate escape hatch for engines without a native representation (notably Spark 3.5), so using it is itself as custom-logical-type signal.
- Add VARIANT to CUSTOM_LOGICAL_TYPES and give it a case in parseTypeDescriptor, mirroring BLOB.
- In HoodieSparkSchemaConverters, add a dedicated VARIANT pattern case that validates the expected unshredded structure ({metadata, value} binary fields) and produces HoodieSchema.Variant.
- On Spark 4.0+ the column round-trips as native VariantType via the existing reverse conversion path.
- Remove the isCustomLogicalTypeDescriptor short-circuit helper; with VARIANT now properly registered, the BLOB/VECTOR guards no longer need the pre-check.
- Add unit tests for parseTypeDescriptor VARIANT (success, case insensitivity, parameter rejection) and integration tests asserting VARIANT promotion and malformed-struct rejection.
@voonhous voonhous force-pushed the fix-variant-dataframe-write-path branch from 08c7635 to 4c548d6 Compare April 17, 2026 12:46
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — a few small naming and style nits in the Scala converter, otherwise clean.


case variantStruct: StructType if metadata.contains(HoodieSchema.TYPE_METADATA_FIELD) &&
HoodieSchema.parseTypeDescriptor(metadata.getString(HoodieSchema.TYPE_METADATA_FIELD)).getType == HoodieSchemaType.VARIANT =>
validateVariantStructure(variantStruct)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the trailing null is opaque at the call site — could you either use a named argument (if Scala allows it across the Java boundary) or add a short inline comment like /* typedValueSchema */ to clarify what the third parameter represents?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.


/**
* Validates that a StructType matches the expected unshredded variant schema
* (two non-null {@code BinaryType} fields: {@code metadata} and {@code value}).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: {@code BinaryType} is Java-flavored Javadoc — the Scala convention in this codebase is to use backticks instead (e.g., `BinaryType`, `metadata`, `value`).

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

*/
private def validateVariantStructure(structType: StructType): Unit = {
val fieldsByName = structType.fields.map(f => f.name -> f).toMap
val ok = structType.length == 2 &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: ok is a pretty opaque name here — something like isValid or hasValidShape would make the guard condition and the if (!ok) check a bit easier to parse at a glance.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — targeted fix that extends the existing custom logical type machinery to cover VARIANT, with proper structural validation mirroring the BLOB/VECTOR paths and good regression test coverage.

@voonhous
Copy link
Copy Markdown
Member Author

Addressed code coverage complains.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for adding the negative test coverage — the new tests exercise all four invalid-variant paths (wrong field count, nullable metadata, wrong value type, wrong field names) and assert the expected IllegalArgumentException, which nicely rounds out the validation logic from the prior review.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@bvaradar bvaradar merged commit e303579 into apache:master Apr 21, 2026
56 checks passed
@voonhous voonhous deleted the fix-variant-dataframe-write-path branch April 21, 2026 16:16
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.95238% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.84%. Comparing base (5b68607) to head (d213c06).
⚠️ Report is 71 commits behind head on master.

Files with missing lines Patch % Lines
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala 76.47% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18510      +/-   ##
============================================
- Coverage     68.85%   68.84%   -0.02%     
- Complexity    28241    28252      +11     
============================================
  Files          2460     2464       +4     
  Lines        135348   135462     +114     
  Branches      16410    16425      +15     
============================================
+ Hits          93200    93255      +55     
- Misses        34770    34821      +51     
- Partials       7378     7386       +8     
Flag Coverage Δ
common-and-other-modules 44.57% <19.04%> (-0.01%) ⬇️
hadoop-mr-java-client 44.75% <25.00%> (-0.11%) ⬇️
spark-client-hadoop-common 48.43% <4.76%> (-0.09%) ⬇️
spark-java-tests 48.93% <66.66%> (-0.05%) ⬇️
spark-scala-tests 45.46% <71.42%> (-0.04%) ⬇️
utilities 38.19% <4.76%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...va/org/apache/hudi/common/schema/HoodieSchema.java 87.64% <100.00%> (+0.04%) ⬆️
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala 79.31% <76.47%> (-0.26%) ⬇️

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Variant write error on Dataframe path

6 participants