fix: prevent parseTypeDescriptor crash for VARIANT#18510
fix: prevent parseTypeDescriptor crash for VARIANT#18510bvaradar merged 3 commits intoapache:masterfrom
Conversation
…n schema conversion guards - The BLOB/VECTOR guard conditions in HoodieSparkSchemaConverters called parseTypeDescriptor() for any StructType/ArrayType with hudi_type metadata, which threw IllegalArgumentException for types like VARIANT that are not custom logical types. - Add isCustomLogicalTypeDescriptor() safe check to short-circuit the guards before parseTypeDescriptor() is called. - Add regression test that reproduces the struct+metadata VARIANT path.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
LGTM — clean, targeted fix that correctly short-circuits the parseTypeDescriptor() call for non-custom logical types like VARIANT. The new isCustomLogicalTypeDescriptor() guard safely extracts the type name and checks membership in CUSTOM_LOGICAL_TYPES without throwing, and the regression test covers the exact crash scenario.
|
@voonhous would it make more sense to just add I know that its technically not a custom logical type since spark 4.0 has a native VARIANT type, however i think spark 3.5 does not have this. My take is that if a user has to pass |
Okay, will just do it. |
- Address review feedback on apache#18510, restructure the crash fix so a StructType tagged with hudi_type=VARIANT is handled consistently with BLOB/VECTOR. - The hudi_type metadata is the deliberate escape hatch for engines without a native representation (notably Spark 3.5), so using it is itself as custom-logical-type signal. - Add VARIANT to CUSTOM_LOGICAL_TYPES and give it a case in parseTypeDescriptor, mirroring BLOB. - In HoodieSparkSchemaConverters, add a dedicated VARIANT pattern case that validates the expected unshredded structure ({metadata, value} binary fields) and produces HoodieSchema.Variant. - On Spark 4.0+ the column round-trips as native VariantType via the existing reverse conversion path. - Remove the isCustomLogicalTypeDescriptor short-circuit helper; with VARIANT now properly registered, the BLOB/VECTOR guards no longer need the pre-check. - Add unit tests for parseTypeDescriptor VARIANT (success, case insensitivity, parameter rejection) and integration tests asserting VARIANT promotion and malformed-struct rejection.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Nice updates — the previous parseTypeDescriptor crash is now fixed in a cleaner way by adding VARIANT to CUSTOM_LOGICAL_TYPES and giving struct-tagged variants a dedicated pattern-match branch that validates the {metadata: binary, value: binary} shape and promotes them to a first-class VARIANT. The helper isCustomLogicalTypeDescriptor was removed since it's redundant once VARIANT is in the set; tests cover the happy path, case-insensitivity, parameter rejection, and malformed struct rejection. Prior review feedback has been addressed.
- Address review feedback on apache#18510, restructure the crash fix so a StructType tagged with hudi_type=VARIANT is handled consistently with BLOB/VECTOR. - The hudi_type metadata is the deliberate escape hatch for engines without a native representation (notably Spark 3.5), so using it is itself as custom-logical-type signal. - Add VARIANT to CUSTOM_LOGICAL_TYPES and give it a case in parseTypeDescriptor, mirroring BLOB. - In HoodieSparkSchemaConverters, add a dedicated VARIANT pattern case that validates the expected unshredded structure ({metadata, value} binary fields) and produces HoodieSchema.Variant. - On Spark 4.0+ the column round-trips as native VariantType via the existing reverse conversion path. - Remove the isCustomLogicalTypeDescriptor short-circuit helper; with VARIANT now properly registered, the BLOB/VECTOR guards no longer need the pre-check. - Add unit tests for parseTypeDescriptor VARIANT (success, case insensitivity, parameter rejection) and integration tests asserting VARIANT promotion and malformed-struct rejection.
08c7635 to
4c548d6
Compare
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Style & Readability Review — a few small naming and style nits in the Scala converter, otherwise clean.
|
|
||
| case variantStruct: StructType if metadata.contains(HoodieSchema.TYPE_METADATA_FIELD) && | ||
| HoodieSchema.parseTypeDescriptor(metadata.getString(HoodieSchema.TYPE_METADATA_FIELD)).getType == HoodieSchemaType.VARIANT => | ||
| validateVariantStructure(variantStruct) |
There was a problem hiding this comment.
🤖 nit: the trailing null is opaque at the call site — could you either use a named argument (if Scala allows it across the Java boundary) or add a short inline comment like /* typedValueSchema */ to clarify what the third parameter represents?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| /** | ||
| * Validates that a StructType matches the expected unshredded variant schema | ||
| * (two non-null {@code BinaryType} fields: {@code metadata} and {@code value}). |
There was a problem hiding this comment.
🤖 nit: {@code BinaryType} is Java-flavored Javadoc — the Scala convention in this codebase is to use backticks instead (e.g., `BinaryType`, `metadata`, `value`).
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| */ | ||
| private def validateVariantStructure(structType: StructType): Unit = { | ||
| val fieldsByName = structType.fields.map(f => f.name -> f).toMap | ||
| val ok = structType.length == 2 && |
There was a problem hiding this comment.
🤖 nit: ok is a pretty opaque name here — something like isValid or hasValidShape would make the guard condition and the if (!ok) check a bit easier to parse at a glance.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
LGTM — targeted fix that extends the existing custom logical type machinery to cover VARIANT, with proper structural validation mirroring the BLOB/VECTOR paths and good regression test coverage.
|
Addressed code coverage complains. |
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for adding the negative test coverage — the new tests exercise all four invalid-variant paths (wrong field count, nullable metadata, wrong value type, wrong field names) and assert the expected IllegalArgumentException, which nicely rounds out the validation logic from the prior review.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18510 +/- ##
============================================
- Coverage 68.85% 68.84% -0.02%
- Complexity 28241 28252 +11
============================================
Files 2460 2464 +4
Lines 135348 135462 +114
Branches 16410 16425 +15
============================================
+ Hits 93200 93255 +55
- Misses 34770 34821 +51
- Partials 7378 7386 +8
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Closes: #18509
Summary and Changelog
Impact
Users can now create tables with VARIANT types using Spark4.0 Dataframe API.
Risk Level
Low
Documentation Update
None
Contributor's checklist