fix(schema): Allow nested projection on BLOB and VARIANT columns in p…#18566
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR fixes nested field projection on BLOB/VARIANT columns by short-circuiting pruneDataSchemaInternal when the data schema carries a logical-type RECORD (BLOB/VARIANT) but the required schema is plain RECORD — a mismatch produced by Spark's nested pruning dropping the hudi_type metadata. The rationale (BLOB/VARIANT inner layouts are fixed by LogicalType.validate() and partial pruning would violate that contract) is sound, and Spark still prunes at eval time. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming issue in the Scala tests — the VECTOR test title claims to verify metadata preservation but the assertion only checks the id column.
cc @yihua
| } | ||
| } | ||
|
|
||
| test("Test DELETE on VECTOR column preserves custom-type metadata") { |
There was a problem hiding this comment.
🤖 nit: the test name says "preserves custom-type metadata" but the assertion is select id from $tableName — it never reads back the VECTOR column, so nothing about the metadata is actually verified. Could you either rename it to something like "Test DELETE on table with VECTOR column" or add an assertion on the embedding values to match what the BLOB test does?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
Test failures are expected as the error being fixed here can only be triggered before prior errors are fixed. |
|
use this file for merge conflict resolution after #18540 is merged. |
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR fixes the IllegalArgumentException: Data schema is not a record crash on nested projection of BLOB/VARIANT columns by (a) loosening the Scala converter pattern match so non-canonical (pruned) BLOB/VARIANT structs fall through to the plain RECORD path and (b) short-circuiting pruneDataSchemaInternal when the data-side is BLOB/VARIANT to return the full logical-type schema. The fix flow is consistent end-to-end, nullability and recursion paths check out, and the existing shredded-VARIANT / VECTOR code paths aren't regressed. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor inconsistency between the two canonicality helpers worth a look, otherwise the change is clean and well-commented.
cc @yihua
| * nested-pruned by Spark. | ||
| */ | ||
| private def isCanonicalVariantStruct(structType: StructType): Boolean = { | ||
| val fieldsByName = structType.fields.map(f => f.name -> f).toMap |
There was a problem hiding this comment.
🤖 nit: isCanonicalBlobStruct delegates to structType.equals(expectedBlobStructType), but isCanonicalVariantStruct re-implements the check field-by-field instead of delegating to structType.equals(expectedVariantStructType). Could this be simplified to match the blob pattern? That way if expectedVariantStructType ever changes, there's only one place to update rather than two.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
b15cf56 to
0065072
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for iterating on this! The fix for nested projection on BLOB/VARIANT looks sound — the pruner short-circuit, the split between strict ingest-time validation and permissive read-path conversion, and the catalog-alignment helper all hang together well. I traced the INSERT/UPDATE/MERGE paths with the new alignSchemaWithCatalog flow (including the deliberate alignNullability=false for MERGE) and the nullable-everywhere BLOB projection at the Spark type layer against the RFC-100 write-boundary enforcement, and didn't find new correctness concerns beyond what's already been raised in prior rounds. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of minor readability nits on the new helpers in HoodieSparkSchemaConverters.
cc @yihua
| private def isCanonicalBlobStruct(structType: StructType): Boolean = | ||
| matchesStructure(structType, expectedBlobStructType, SQLConf.get.caseSensitiveAnalysis) | ||
|
|
||
| private def matchesStructure(source: DataType, expected: DataType, caseSensitive: Boolean): Boolean = |
There was a problem hiding this comment.
🤖 nit: matchesStructure uses positional matching (zip), but that isn't obvious from the name — and isCanonicalVariantStruct right below uses by-name lookup instead. Could you add a brief note (e.g. // Positional: field order is part of the RFC-100 BLOB canonical contract) so a future reader doesn't try to "normalize" this to by-name matching and accidentally break BLOB field-order validation?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
| case ArrayType(elementType, _) => | ||
| ArrayType(withAllFieldsNullable(elementType), containsNull = true) | ||
| case MapType(keyType, valueType, _) => | ||
| MapType(keyType, withAllFieldsNullable(valueType), valueContainsNull = true) |
There was a problem hiding this comment.
🤖 nit: the method is called withAllFieldsNullable but map key types are intentionally left unchanged here. Could you add a short comment like // Map keys cannot be null in Spark — keyType is left as-is so readers don't mistake the omission for a bug?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
|
|
||
| private def sparkTypeForVectorElementType( | ||
| elementType: HoodieSchema.Vector.VectorElementType): DataType = elementType match { | ||
| elementType: HoodieSchema.Vector.VectorElementType): DataType = elementType match { |
|
@voonhous LGTM but can you check this one weird case in case a user would try this (unlikley but sharing below): |
…runeDataSchema
pruneDataSchemaInternal switches on the required schema's HoodieSchemaType
and, for the RECORD case, guards that the data schema is also RECORD.
HoodieSchemaType enumerates BLOB and VARIANT as distinct values from
RECORD even though both use the Avro RECORD physical type, so the guard
throws "Data schema is not a record" whenever:
- the file's on-disk schema carries the blob/variant logical type
(HoodieSchemaType.fromAvro returns BLOB / VARIANT), and
- Spark's nested-schema pruning strips the hudi_type=BLOB / VARIANT
metadata from the StructField before handing it to the reader,
downgrading the required side to a plain RECORD.
This is reachable by any nested field projection on a BLOB or VARIANT
column routed through HoodieFileGroupReaderBasedFileFormat - e.g.
`SELECT payload.reference.external_path FROM t`.
Short-circuit the RECORD case when the data schema is BLOB or VARIANT:
both types' inner layouts are fixed by their LogicalType.validate()
contracts ({type,data,reference} and {metadata,value} respectively),
so partial pruning is not legal. Return the data schema unchanged;
Spark's projection still prunes at eval time.
VECTOR is represented as Avro FIXED and has no inner fields, so it
falls through the default case unchanged. No fix needed there.
Regression coverage:
- TestHoodieSchemaUtils.testPruningPreservesBlobWhenRequiredIsPlainRecord
directly exercises pruneDataSchema with a data schema carrying a BLOB
field and a required schema that prunes down to a plain RECORD (the
exact shape Spark's nested pruning produces).
- TestDeleteFromTable adds "Test DELETE on BLOB column preserves
custom-type metadata" and "Test DELETE on VECTOR column preserves
custom-type metadata". The BLOB case is the end-to-end reproducer; it
projects `payload.reference.external_path` after DELETE and previously
crashed in the HoodieFileGroupReaderBasedFileFormat read path. The
VECTOR case guards against a regression in the FIXED (default) branch.
yihua
left a comment
There was a problem hiding this comment.
LGTM as a stop-gap fix. We should further look into how to simplify this. The schema-handling logic for BLOB and VARIANT has become overwhelmingly complex on Spark now.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18566 +/- ##
=========================================
Coverage 68.81% 68.81%
- Complexity 28585 28588 +3
=========================================
Files 2492 2492
Lines 137302 137323 +21
Branches 16756 16767 +11
=========================================
+ Hits 94487 94502 +15
Misses 35194 35194
- Partials 7621 7627 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
…runeDataSchema
Describe the issue this Pull Request addresses
Closes: #18565
NOTE: Merge this after #18540, this is merged PR, so rebase this once #18540 is landed.
Nested field projection on a BLOB or VARIANT column throws
IllegalArgumentException: Data schema is not a recordat query planning. Full repro, stack trace, and environment are in the linked issue.Root cause:
HoodieSchemaTypeenumeratesBLOBandVARIANTas distinct enum values fromRECORD, though both use Avro RECORD physically (HoodieSchemaType.java:120-127).pruneDataSchemaInternal'scase RECORDguards withdataSchema.getType() != RECORDand throws when file schema isBLOB/VARIANTbut Spark's pruning drops thehudi_typeStructField metadata, downgrading the required side to plain RECORD.HoodieFileGroupReaderBasedFileFormathits this.Summary and Changelog
Users can now project nested fields of BLOB / VARIANT columns via SQL.
Fix:
HoodieSchemaUtils.pruneDataSchemaInternal: short-circuit the RECORD case when data schema isBLOBorVARIANT. Return the data schema unchanged.{type, data, reference}and VARIANT's{metadata, value}inner layouts are fixed byLogicalType.validate(). Partial pruning violates the contract. Spark's projection still prunes at eval time, so the full-struct read is free in practice (tiny structs).Tests:
TestHoodieSchemaUtils.testPruningPreservesBlobWhenRequiredIsPlainRecorddirectly exercises the pruner with a BLOB data field and a plain-RECORD required schema (the exact shape Spark's nested pruning produces).TestDeleteFromTableadds two Spark SQL regression tests:Test DELETE on BLOB column preserves custom-type metadatais the end-to-end reproducer. Projectspayload.reference.external_pathafter DELETE.Test DELETE on VECTOR column preserves custom-type metadataguards the FIXED branch.Impact
Risk Level
Low.
Documentation Update
none
Contributor's checklist