fix(schema): Allow nested projection on BLOB and VARIANT columns in p… by voonhous · Pull Request #18566 · apache/hudi

voonhous · 2026-04-23T14:11:00Z

…runeDataSchema

Describe the issue this Pull Request addresses

Closes: #18565
NOTE: Merge this after #18540, this is merged PR, so rebase this once #18540 is landed.

Nested field projection on a BLOB or VARIANT column throws IllegalArgumentException: Data schema is not a record at query planning. Full repro, stack trace, and environment are in the linked issue.

Root cause:

HoodieSchemaType enumerates BLOB and VARIANT as distinct enum values from RECORD, though both use Avro RECORD physically (HoodieSchemaType.java:120-127).
pruneDataSchemaInternal's case RECORD guards with dataSchema.getType() != RECORD and throws when file schema is BLOB / VARIANT but Spark's pruning drops the hudi_type StructField metadata, downgrading the required side to plain RECORD.
Any nested projection on these columns routed through HoodieFileGroupReaderBasedFileFormat hits this.

Summary and Changelog

Users can now project nested fields of BLOB / VARIANT columns via SQL.

Fix:

HoodieSchemaUtils.pruneDataSchemaInternal: short-circuit the RECORD case when data schema is BLOB or VARIANT. Return the data schema unchanged.
Rationale: BLOB's {type, data, reference} and VARIANT's {metadata, value} inner layouts are fixed by LogicalType.validate(). Partial pruning violates the contract. Spark's projection still prunes at eval time, so the full-struct read is free in practice (tiny structs).
VECTOR is Avro FIXED with no inner fields. Falls through default case, not touched.

Tests:

TestHoodieSchemaUtils.testPruningPreservesBlobWhenRequiredIsPlainRecord directly exercises the pruner with a BLOB data field and a plain-RECORD required schema (the exact shape Spark's nested pruning produces).
TestDeleteFromTable adds two Spark SQL regression tests:
- Test DELETE on BLOB column preserves custom-type metadata is the end-to-end reproducer. Projects payload.reference.external_path after DELETE.
- Test DELETE on VECTOR column preserves custom-type metadata guards the FIXED branch.

Impact

User-facing: nested projection on BLOB / VARIANT columns now works. Previously threw at plan time.
Public API: none.
Performance: none. Pruner runs once per plan on driver. Full-struct read on BLOB / VARIANT adds at most 2 extra small fields through the reader; Spark still prunes before operator output.

Risk Level

Low.

Single relaxation inside one guard clause.
Returned schema matches what the full-column read path already uses.
No storage format, public API, or other reader paths touched.
Covered by one Java unit test plus two Spark SQL regression tests.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR fixes nested field projection on BLOB/VARIANT columns by short-circuiting pruneDataSchemaInternal when the data schema carries a logical-type RECORD (BLOB/VARIANT) but the required schema is plain RECORD — a mismatch produced by Spark's nested pruning dropping the hudi_type metadata. The rationale (BLOB/VARIANT inner layouts are fixed by LogicalType.validate() and partial pruning would violate that contract) is sound, and Spark still prunes at eval time. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming issue in the Scala tests — the VECTOR test title claims to verify metadata preservation but the assertion only checks the id column.

cc @yihua

hudi-agent · 2026-04-23T14:51:30Z

    }
  }
+
+  test("Test DELETE on VECTOR column preserves custom-type metadata") {


🤖 nit: the test name says "preserves custom-type metadata" but the assertion is select id from $tableName — it never reads back the VECTOR column, so nothing about the metadata is actually verified. Could you either rename it to something like "Test DELETE on table with VECTOR column" or add an assertion on the embedding values to match what the BLOB test does?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

voonhous · 2026-04-23T17:04:43Z

Test failures are expected as the error being fixed here can only be triggered before prior errors are fixed.

voonhous · 2026-04-24T17:36:04Z

use this file for merge conflict resolution after #18540 is merged.

HoodieSparkSchemaConverters.scala.txt

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR fixes the IllegalArgumentException: Data schema is not a record crash on nested projection of BLOB/VARIANT columns by (a) loosening the Scala converter pattern match so non-canonical (pruned) BLOB/VARIANT structs fall through to the plain RECORD path and (b) short-circuiting pruneDataSchemaInternal when the data-side is BLOB/VARIANT to return the full logical-type schema. The fix flow is consistent end-to-end, nullability and recursion paths check out, and the existing shredded-VARIANT / VECTOR code paths aren't regressed. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor inconsistency between the two canonicality helpers worth a look, otherwise the change is clean and well-commented.

cc @yihua

hudi-agent · 2026-04-24T17:43:42Z

+   * nested-pruned by Spark.
+   */
+  private def isCanonicalVariantStruct(structType: StructType): Boolean = {
+    val fieldsByName = structType.fields.map(f => f.name -> f).toMap


🤖 nit: isCanonicalBlobStruct delegates to structType.equals(expectedBlobStructType), but isCanonicalVariantStruct re-implements the check field-by-field instead of delegating to structType.equals(expectedVariantStructType). Could this be simplified to match the blob pattern? That way if expectedVariantStructType ever changes, there's only one place to update rather than two.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for iterating on this! The fix for nested projection on BLOB/VARIANT looks sound — the pruner short-circuit, the split between strict ingest-time validation and permissive read-path conversion, and the catalog-alignment helper all hang together well. I traced the INSERT/UPDATE/MERGE paths with the new alignSchemaWithCatalog flow (including the deliberate alignNullability=false for MERGE) and the nullable-everywhere BLOB projection at the Spark type layer against the RFC-100 write-boundary enforcement, and didn't find new correctness concerns beyond what's already been raised in prior rounds. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of minor readability nits on the new helpers in HoodieSparkSchemaConverters.

cc @yihua

hudi-agent · 2026-04-24T22:17:19Z

+  private def isCanonicalBlobStruct(structType: StructType): Boolean =
+    matchesStructure(structType, expectedBlobStructType, SQLConf.get.caseSensitiveAnalysis)
+
+  private def matchesStructure(source: DataType, expected: DataType, caseSensitive: Boolean): Boolean =


🤖 nit: matchesStructure uses positional matching (zip), but that isn't obvious from the name — and isCanonicalVariantStruct right below uses by-name lookup instead. Could you add a brief note (e.g. // Positional: field order is part of the RFC-100 BLOB canonical contract) so a future reader doesn't try to "normalize" this to by-name matching and accidentally break BLOB field-order validation?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-24T22:17:19Z

+    case ArrayType(elementType, _) =>
+      ArrayType(withAllFieldsNullable(elementType), containsNull = true)
+    case MapType(keyType, valueType, _) =>
+      MapType(keyType, withAllFieldsNullable(valueType), valueContainsNull = true)


🤖 nit: the method is called withAllFieldsNullable but map key types are intentionally left unchanged here. Could you add a short comment like // Map keys cannot be null in Spark — keyType is left as-is so readers don't mistake the omission for a bug?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

rahil-c · 2026-04-26T16:47:31Z


  private def sparkTypeForVectorElementType(
-      elementType: HoodieSchema.Vector.VectorElementType): DataType = elementType match {
+                                             elementType: HoodieSchema.Vector.VectorElementType): DataType = elementType match {


nit: whitespace

rahil-c · 2026-04-26T17:27:50Z

@voonhous LGTM but can you check this one weird case in case a user would try this (unlikley but sharing below):


The validator code
                                                                                      
  (descriptorType, f.dataType) match {
    case (HoodieSchemaType.BLOB,    st: StructType) => validateBlobStructure(st)      
    case (HoodieSchemaType.VARIANT, st: StructType) => validateVariantStructure(st)   
    case _ =>   // <-- silently no-op                                                 
  }                                                                                   
                                                                                      
  The pattern only matches when both the tag says BLOB and the data type is a         
  StructType. Anything else falls into case _ and does nothing.
                                                                                      
  The bug, concretely                                                                 
   
  Suppose a user (or a buggy upstream transform) builds this schema:                  
   
  val blobMetadata = new MetadataBuilder()                                            
    .putString(HoodieSchema.TYPE_METADATA_FIELD, HoodieSchemaType.BLOB.name())        
    .build()                                                                          
                                                                                      
  val schema = new StructType()                                                       
    .add("id",      LongType)
    .add("payload", LongType, nullable = true, metadata = blobMetadata)               
    //              ^^^^^^^^                              ^^^^^^^^^^^^
    //              wrong type                            says "I'm a BLOB"           
                                                                                      
  The user is asserting "payload is a BLOB" via the metadata, but the data type is a  
  LongType, not the canonical BLOB struct.                                            
                                                                                      
  What happens today

  1. validateCustomTypeStructures(schema) runs.                                       
  2. It sees the hudi_type=BLOB tag on payload.
  3. The match tuple is (BLOB, LongType) — neither pattern matches → falls into case _
   → returns without throwing.                                                        
  4. Then convertStructTypeToHoodieSchema runs.                                       
  5. The BLOB case in toHoodieTypeNested is case blobStruct: StructType if            
  metadata.contains(...) && ...isCanonicalBlobStruct(blobStruct) => — requires a      
  StructType, so it doesn't match either.
  6. LongType falls through to the normal case LongType => HoodieSchema.create(LONG)  
  arm.                                                                                
  7. Result: the field is silently written as a plain LONG. The BLOB tag is ignored, 
  no error.                                                                           
   
  The user thinks they wrote a BLOB column; the table actually has a LONG column.     
   
  The fix                                                                             

  Add an explicit reject for "tag says BLOB/VARIANT but the type is wrong":           
   
  (descriptorType, f.dataType) match {                                                
    case (HoodieSchemaType.BLOB,    st: StructType) => validateBlobStructure(st)
    case (HoodieSchemaType.VARIANT, st: StructType) => validateVariantStructure(st)   
    case (HoodieSchemaType.BLOB,    other) =>                                         
      throw new IllegalArgumentException(                                             
        s"Field '${f.name}' is tagged hudi_type=BLOB but has type $other; expected a  
  StructType.")                                                                       
    case (HoodieSchemaType.VARIANT, other) =>
      throw new IllegalArgumentException(                                             
        s"Field '${f.name}' is tagged hudi_type=VARIANT but has type $other; expected
  a StructType.")                                                                     
    case _ =>
  }                                                                                   

  Now the misuse fails fast at the write boundary instead of silently producing the   
  wrong on-disk schema.

voonhous · 2026-04-26T18:23:14Z

@voonhous LGTM but can you check this one weird case in case a user would try this (unlikley but sharing below):

New issue and new PR. 1 issue, 1 PR.
Created new issue to track this: #18603

…runeDataSchema pruneDataSchemaInternal switches on the required schema's HoodieSchemaType and, for the RECORD case, guards that the data schema is also RECORD. HoodieSchemaType enumerates BLOB and VARIANT as distinct values from RECORD even though both use the Avro RECORD physical type, so the guard throws "Data schema is not a record" whenever: - the file's on-disk schema carries the blob/variant logical type (HoodieSchemaType.fromAvro returns BLOB / VARIANT), and - Spark's nested-schema pruning strips the hudi_type=BLOB / VARIANT metadata from the StructField before handing it to the reader, downgrading the required side to a plain RECORD. This is reachable by any nested field projection on a BLOB or VARIANT column routed through HoodieFileGroupReaderBasedFileFormat - e.g. `SELECT payload.reference.external_path FROM t`. Short-circuit the RECORD case when the data schema is BLOB or VARIANT: both types' inner layouts are fixed by their LogicalType.validate() contracts ({type,data,reference} and {metadata,value} respectively), so partial pruning is not legal. Return the data schema unchanged; Spark's projection still prunes at eval time. VECTOR is represented as Avro FIXED and has no inner fields, so it falls through the default case unchanged. No fix needed there. Regression coverage: - TestHoodieSchemaUtils.testPruningPreservesBlobWhenRequiredIsPlainRecord directly exercises pruneDataSchema with a data schema carrying a BLOB field and a required schema that prunes down to a plain RECORD (the exact shape Spark's nested pruning produces). - TestDeleteFromTable adds "Test DELETE on BLOB column preserves custom-type metadata" and "Test DELETE on VECTOR column preserves custom-type metadata". The BLOB case is the end-to-end reproducer; it projects `payload.reference.external_path` after DELETE and previously crashed in the HoodieFileGroupReaderBasedFileFormat read path. The VECTOR case guards against a regression in the FIXED (default) branch.

yihua

LGTM as a stop-gap fix. We should further look into how to simplify this. The schema-handling logic for BLOB and VARIANT has become overwhelmingly complex on Spark now.

hudi-bot · 2026-04-27T07:13:19Z

CI report:

b15cf56 UNKNOWN
0065072 Azure: SUCCESS
96d21d3 Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-04-27T07:44:30Z

Codecov Report

❌ Patch coverage is 72.50000% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.81%. Comparing base (fdf27db) to head (96d21d3).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala	70.00%	0 Missing and 9 partials ⚠️
...g/apache/hudi/common/schema/HoodieSchemaUtils.java	66.66%	0 Missing and 1 partial ⚠️
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18566   +/-   ##
=========================================
  Coverage     68.81%   68.81%           
- Complexity    28585    28588    +3     
=========================================
  Files          2492     2492           
  Lines        137302   137323   +21     
  Branches      16756    16767   +11     
=========================================
+ Hits          94487    94502   +15     
  Misses        35194    35194           
- Partials       7621     7627    +6

Flag	Coverage Δ
common-and-other-modules	`44.30% <5.00%> (-0.01%)`	⬇️
hadoop-mr-java-client	`44.82% <0.00%> (-0.02%)`	⬇️
spark-client-hadoop-common	`48.41% <0.00%> (-0.02%)`	⬇️
spark-java-tests	`49.49% <62.50%> (+<0.01%)`	⬆️
spark-scala-tests	`45.33% <72.50%> (+<0.01%)`	⬆️
utilities	`37.90% <27.50%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../org/apache/hudi/HoodieSchemaConversionUtils.scala	`75.86% <100.00%> (+0.42%)`	⬆️
...di/command/AlterHoodieTableAddColumnsCommand.scala	`77.77% <100.00%> (ø)`
.../command/AlterHoodieTableChangeColumnCommand.scala	`80.55% <100.00%> (ø)`
...g/apache/hudi/common/schema/HoodieSchemaUtils.java	`83.56% <66.66%> (-0.18%)`	⬇️
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`78.46% <66.66%> (+0.03%)`	⬆️
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala	`77.08% <70.00%> (+0.19%)`	⬆️

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

voonhous requested review from rahil-c and yihua April 23, 2026 14:12

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Apr 23, 2026

hudi-agent reviewed Apr 23, 2026

View reviewed changes

voonhous force-pushed the fix-#18565 branch from 1e4a7db to 4cb34d2 Compare April 24, 2026 15:10

voonhous mentioned this pull request Apr 24, 2026

test(schema): Add MOR log-only compaction tests for custom types #18583

Merged

3 tasks

hudi-agent reviewed Apr 24, 2026

View reviewed changes

voonhous force-pushed the fix-#18565 branch from 4cb34d2 to d84728f Compare April 24, 2026 18:43

github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Apr 24, 2026

voonhous force-pushed the fix-#18565 branch 3 times, most recently from b15cf56 to 0065072 Compare April 24, 2026 20:37

hudi-agent reviewed Apr 24, 2026

View reviewed changes

hudi-agent mentioned this pull request Apr 24, 2026

[OSS PR #18566] fix(schema): Allow nested projection on BLOB and VARIANT columns in p… hudi-agent/hudi#21

Open

rahil-c reviewed Apr 26, 2026

View reviewed changes

voonhous mentioned this pull request Apr 26, 2026

Check custom type edge case #18603

Open

rahil-c mentioned this pull request Apr 27, 2026

feat(schema) Ensure spark SQL insert into preserves hudi_type for VEC… #18608

Draft

3 tasks

yihua force-pushed the fix-#18565 branch from 0065072 to 96d21d3 Compare April 27, 2026 06:18

yihua approved these changes Apr 27, 2026

View reviewed changes

github-actions Bot added size:M PR with lines of changes in (100, 300] and removed size:L PR with lines of changes in (300, 1000] labels Apr 27, 2026

yihua merged commit 7ae0fd9 into apache:master Apr 27, 2026
55 of 56 checks passed

voonhous deleted the fix-#18565 branch April 27, 2026 10:43

voonhous linked an issue May 13, 2026 that may be closed by this pull request

Nested field projection on BLOB / VARIANT columns throws "Data schema is not a record" #18565

Closed

Conversation

voonhous commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

voonhous commented Apr 23, 2026

Uh oh!

voonhous commented Apr 24, 2026

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

rahil-c Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

rahil-c commented Apr 26, 2026

Uh oh!

voonhous commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Apr 27, 2026

CI report:

Uh oh!

codecov-commenter commented Apr 27, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

voonhous commented Apr 23, 2026 •

edited

Loading

voonhous commented Apr 26, 2026 •

edited

Loading