Skip to content

fix(schema): Allow nested projection on BLOB and VARIANT columns in p…#18566

Merged
yihua merged 1 commit into
apache:masterfrom
voonhous:fix-#18565
Apr 27, 2026
Merged

fix(schema): Allow nested projection on BLOB and VARIANT columns in p…#18566
yihua merged 1 commit into
apache:masterfrom
voonhous:fix-#18565

Conversation

@voonhous
Copy link
Copy Markdown
Member

@voonhous voonhous commented Apr 23, 2026

…runeDataSchema

Describe the issue this Pull Request addresses

Closes: #18565
NOTE: Merge this after #18540, this is merged PR, so rebase this once #18540 is landed.

Nested field projection on a BLOB or VARIANT column throws IllegalArgumentException: Data schema is not a record at query planning. Full repro, stack trace, and environment are in the linked issue.

Root cause:

  • HoodieSchemaType enumerates BLOB and VARIANT as distinct enum values from RECORD, though both use Avro RECORD physically (HoodieSchemaType.java:120-127).
  • pruneDataSchemaInternal's case RECORD guards with dataSchema.getType() != RECORD and throws when file schema is BLOB / VARIANT but Spark's pruning drops the hudi_type StructField metadata, downgrading the required side to plain RECORD.
  • Any nested projection on these columns routed through HoodieFileGroupReaderBasedFileFormat hits this.

Summary and Changelog

Users can now project nested fields of BLOB / VARIANT columns via SQL.

Fix:

  • HoodieSchemaUtils.pruneDataSchemaInternal: short-circuit the RECORD case when data schema is BLOB or VARIANT. Return the data schema unchanged.
  • Rationale: BLOB's {type, data, reference} and VARIANT's {metadata, value} inner layouts are fixed by LogicalType.validate(). Partial pruning violates the contract. Spark's projection still prunes at eval time, so the full-struct read is free in practice (tiny structs).
  • VECTOR is Avro FIXED with no inner fields. Falls through default case, not touched.

Tests:

  • TestHoodieSchemaUtils.testPruningPreservesBlobWhenRequiredIsPlainRecord directly exercises the pruner with a BLOB data field and a plain-RECORD required schema (the exact shape Spark's nested pruning produces).
  • TestDeleteFromTable adds two Spark SQL regression tests:
    • Test DELETE on BLOB column preserves custom-type metadata is the end-to-end reproducer. Projects payload.reference.external_path after DELETE.
    • Test DELETE on VECTOR column preserves custom-type metadata guards the FIXED branch.

Impact

  • User-facing: nested projection on BLOB / VARIANT columns now works. Previously threw at plan time.
  • Public API: none.
  • Performance: none. Pruner runs once per plan on driver. Full-struct read on BLOB / VARIANT adds at most 2 extra small fields through the reader; Spark still prunes before operator output.

Risk Level

Low.

  • Single relaxation inside one guard clause.
  • Returned schema matches what the full-column read path already uses.
  • No storage format, public API, or other reader paths touched.
  • Covered by one Java unit test plus two Spark SQL regression tests.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@voonhous voonhous requested review from rahil-c and yihua April 23, 2026 14:12
@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Apr 23, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR fixes nested field projection on BLOB/VARIANT columns by short-circuiting pruneDataSchemaInternal when the data schema carries a logical-type RECORD (BLOB/VARIANT) but the required schema is plain RECORD — a mismatch produced by Spark's nested pruning dropping the hudi_type metadata. The rationale (BLOB/VARIANT inner layouts are fixed by LogicalType.validate() and partial pruning would violate that contract) is sound, and Spark still prunes at eval time. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming issue in the Scala tests — the VECTOR test title claims to verify metadata preservation but the assertion only checks the id column.

cc @yihua

}
}

test("Test DELETE on VECTOR column preserves custom-type metadata") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the test name says "preserves custom-type metadata" but the assertion is select id from $tableName — it never reads back the VECTOR column, so nothing about the metadata is actually verified. Could you either rename it to something like "Test DELETE on table with VECTOR column" or add an assertion on the embedding values to match what the BLOB test does?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

@voonhous
Copy link
Copy Markdown
Member Author

Test failures are expected as the error being fixed here can only be triggered before prior errors are fixed.

@voonhous
Copy link
Copy Markdown
Member Author

use this file for merge conflict resolution after #18540 is merged.

HoodieSparkSchemaConverters.scala.txt

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR fixes the IllegalArgumentException: Data schema is not a record crash on nested projection of BLOB/VARIANT columns by (a) loosening the Scala converter pattern match so non-canonical (pruned) BLOB/VARIANT structs fall through to the plain RECORD path and (b) short-circuiting pruneDataSchemaInternal when the data-side is BLOB/VARIANT to return the full logical-type schema. The fix flow is consistent end-to-end, nullability and recursion paths check out, and the existing shredded-VARIANT / VECTOR code paths aren't regressed. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor inconsistency between the two canonicality helpers worth a look, otherwise the change is clean and well-commented.

cc @yihua

* nested-pruned by Spark.
*/
private def isCanonicalVariantStruct(structType: StructType): Boolean = {
val fieldsByName = structType.fields.map(f => f.name -> f).toMap
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: isCanonicalBlobStruct delegates to structType.equals(expectedBlobStructType), but isCanonicalVariantStruct re-implements the check field-by-field instead of delegating to structType.equals(expectedVariantStructType). Could this be simplified to match the blob pattern? That way if expectedVariantStructType ever changes, there's only one place to update rather than two.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Apr 24, 2026
@voonhous voonhous force-pushed the fix-#18565 branch 3 times, most recently from b15cf56 to 0065072 Compare April 24, 2026 20:37
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for iterating on this! The fix for nested projection on BLOB/VARIANT looks sound — the pruner short-circuit, the split between strict ingest-time validation and permissive read-path conversion, and the catalog-alignment helper all hang together well. I traced the INSERT/UPDATE/MERGE paths with the new alignSchemaWithCatalog flow (including the deliberate alignNullability=false for MERGE) and the nullable-everywhere BLOB projection at the Spark type layer against the RFC-100 write-boundary enforcement, and didn't find new correctness concerns beyond what's already been raised in prior rounds. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of minor readability nits on the new helpers in HoodieSparkSchemaConverters.

cc @yihua

private def isCanonicalBlobStruct(structType: StructType): Boolean =
matchesStructure(structType, expectedBlobStructType, SQLConf.get.caseSensitiveAnalysis)

private def matchesStructure(source: DataType, expected: DataType, caseSensitive: Boolean): Boolean =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: matchesStructure uses positional matching (zip), but that isn't obvious from the name — and isCanonicalVariantStruct right below uses by-name lookup instead. Could you add a brief note (e.g. // Positional: field order is part of the RFC-100 BLOB canonical contract) so a future reader doesn't try to "normalize" this to by-name matching and accidentally break BLOB field-order validation?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

case ArrayType(elementType, _) =>
ArrayType(withAllFieldsNullable(elementType), containsNull = true)
case MapType(keyType, valueType, _) =>
MapType(keyType, withAllFieldsNullable(valueType), valueContainsNull = true)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the method is called withAllFieldsNullable but map key types are intentionally left unchanged here. Could you add a short comment like // Map keys cannot be null in Spark — keyType is left as-is so readers don't mistake the omission for a bug?

- AI-generated; verify before applying. React 👍/👎 to flag quality.


private def sparkTypeForVectorElementType(
elementType: HoodieSchema.Vector.VectorElementType): DataType = elementType match {
elementType: HoodieSchema.Vector.VectorElementType): DataType = elementType match {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whitespace

@rahil-c
Copy link
Copy Markdown
Collaborator

rahil-c commented Apr 26, 2026

@voonhous LGTM but can you check this one weird case in case a user would try this (unlikley but sharing below):


The validator code
                                                                                      
  (descriptorType, f.dataType) match {
    case (HoodieSchemaType.BLOB,    st: StructType) => validateBlobStructure(st)      
    case (HoodieSchemaType.VARIANT, st: StructType) => validateVariantStructure(st)   
    case _ =>   // <-- silently no-op                                                 
  }                                                                                   
                                                                                      
  The pattern only matches when both the tag says BLOB and the data type is a         
  StructType. Anything else falls into case _ and does nothing.
                                                                                      
  The bug, concretely                                                                 
   
  Suppose a user (or a buggy upstream transform) builds this schema:                  
   
  val blobMetadata = new MetadataBuilder()                                            
    .putString(HoodieSchema.TYPE_METADATA_FIELD, HoodieSchemaType.BLOB.name())        
    .build()                                                                          
                                                                                      
  val schema = new StructType()                                                       
    .add("id",      LongType)
    .add("payload", LongType, nullable = true, metadata = blobMetadata)               
    //              ^^^^^^^^                              ^^^^^^^^^^^^
    //              wrong type                            says "I'm a BLOB"           
                                                                                      
  The user is asserting "payload is a BLOB" via the metadata, but the data type is a  
  LongType, not the canonical BLOB struct.                                            
                                                                                      
  What happens today

  1. validateCustomTypeStructures(schema) runs.                                       
  2. It sees the hudi_type=BLOB tag on payload.
  3. The match tuple is (BLOB, LongType) — neither pattern matches → falls into case _
   → returns without throwing.                                                        
  4. Then convertStructTypeToHoodieSchema runs.                                       
  5. The BLOB case in toHoodieTypeNested is case blobStruct: StructType if            
  metadata.contains(...) && ...isCanonicalBlobStruct(blobStruct) => — requires a      
  StructType, so it doesn't match either.
  6. LongType falls through to the normal case LongType => HoodieSchema.create(LONG)  
  arm.                                                                                
  7. Result: the field is silently written as a plain LONG. The BLOB tag is ignored, 
  no error.                                                                           
   
  The user thinks they wrote a BLOB column; the table actually has a LONG column.     
   
  The fix                                                                             

  Add an explicit reject for "tag says BLOB/VARIANT but the type is wrong":           
   
  (descriptorType, f.dataType) match {                                                
    case (HoodieSchemaType.BLOB,    st: StructType) => validateBlobStructure(st)
    case (HoodieSchemaType.VARIANT, st: StructType) => validateVariantStructure(st)   
    case (HoodieSchemaType.BLOB,    other) =>                                         
      throw new IllegalArgumentException(                                             
        s"Field '${f.name}' is tagged hudi_type=BLOB but has type $other; expected a  
  StructType.")                                                                       
    case (HoodieSchemaType.VARIANT, other) =>
      throw new IllegalArgumentException(                                             
        s"Field '${f.name}' is tagged hudi_type=VARIANT but has type $other; expected
  a StructType.")                                                                     
    case _ =>
  }                                                                                   

  Now the misuse fails fast at the write boundary instead of silently producing the   
  wrong on-disk schema.

@voonhous
Copy link
Copy Markdown
Member Author

voonhous commented Apr 26, 2026

@voonhous LGTM but can you check this one weird case in case a user would try this (unlikley but sharing below):

New issue and new PR. 1 issue, 1 PR.
Created new issue to track this: #18603

…runeDataSchema

pruneDataSchemaInternal switches on the required schema's HoodieSchemaType
and, for the RECORD case, guards that the data schema is also RECORD.
HoodieSchemaType enumerates BLOB and VARIANT as distinct values from
RECORD even though both use the Avro RECORD physical type, so the guard
throws "Data schema is not a record" whenever:

  - the file's on-disk schema carries the blob/variant logical type
    (HoodieSchemaType.fromAvro returns BLOB / VARIANT), and
  - Spark's nested-schema pruning strips the hudi_type=BLOB / VARIANT
    metadata from the StructField before handing it to the reader,
    downgrading the required side to a plain RECORD.

This is reachable by any nested field projection on a BLOB or VARIANT
column routed through HoodieFileGroupReaderBasedFileFormat - e.g.
`SELECT payload.reference.external_path FROM t`.

Short-circuit the RECORD case when the data schema is BLOB or VARIANT:
both types' inner layouts are fixed by their LogicalType.validate()
contracts ({type,data,reference} and {metadata,value} respectively),
so partial pruning is not legal. Return the data schema unchanged;
Spark's projection still prunes at eval time.

VECTOR is represented as Avro FIXED and has no inner fields, so it
falls through the default case unchanged. No fix needed there.

Regression coverage:

  - TestHoodieSchemaUtils.testPruningPreservesBlobWhenRequiredIsPlainRecord
    directly exercises pruneDataSchema with a data schema carrying a BLOB
    field and a required schema that prunes down to a plain RECORD (the
    exact shape Spark's nested pruning produces).

  - TestDeleteFromTable adds "Test DELETE on BLOB column preserves
    custom-type metadata" and "Test DELETE on VECTOR column preserves
    custom-type metadata". The BLOB case is the end-to-end reproducer; it
    projects `payload.reference.external_path` after DELETE and previously
    crashed in the HoodieFileGroupReaderBasedFileFormat read path. The
    VECTOR case guards against a regression in the FIXED (default) branch.
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a stop-gap fix. We should further look into how to simplify this. The schema-handling logic for BLOB and VARIANT has become overwhelmingly complex on Spark now.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@github-actions github-actions Bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels Apr 27, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 72.50000% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.81%. Comparing base (fdf27db) to head (96d21d3).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala 70.00% 0 Missing and 9 partials ⚠️
...g/apache/hudi/common/schema/HoodieSchemaUtils.java 66.66% 0 Missing and 1 partial ⚠️
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18566   +/-   ##
=========================================
  Coverage     68.81%   68.81%           
- Complexity    28585    28588    +3     
=========================================
  Files          2492     2492           
  Lines        137302   137323   +21     
  Branches      16756    16767   +11     
=========================================
+ Hits          94487    94502   +15     
  Misses        35194    35194           
- Partials       7621     7627    +6     
Flag Coverage Δ
common-and-other-modules 44.30% <5.00%> (-0.01%) ⬇️
hadoop-mr-java-client 44.82% <0.00%> (-0.02%) ⬇️
spark-client-hadoop-common 48.41% <0.00%> (-0.02%) ⬇️
spark-java-tests 49.49% <62.50%> (+<0.01%) ⬆️
spark-scala-tests 45.33% <72.50%> (+<0.01%) ⬆️
utilities 37.90% <27.50%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../org/apache/hudi/HoodieSchemaConversionUtils.scala 75.86% <100.00%> (+0.42%) ⬆️
...di/command/AlterHoodieTableAddColumnsCommand.scala 77.77% <100.00%> (ø)
.../command/AlterHoodieTableChangeColumnCommand.scala 80.55% <100.00%> (ø)
...g/apache/hudi/common/schema/HoodieSchemaUtils.java 83.56% <66.66%> (-0.18%) ⬇️
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala 78.46% <66.66%> (+0.03%) ⬆️
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala 77.08% <70.00%> (+0.19%) ⬆️

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yihua yihua merged commit 7ae0fd9 into apache:master Apr 27, 2026
55 of 56 checks passed
@voonhous voonhous deleted the fix-#18565 branch April 27, 2026 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nested field projection on BLOB / VARIANT columns throws "Data schema is not a record"

6 participants