fix(schema): Handle BLOB and VARIANT in Hive-reader rewriteRecordWith… by voonhous · Pull Request #18580 · apache/hudi

voonhous · 2026-04-24T10:40:19Z

…NewSchema

Describe the issue this Pull Request addresses

HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal switches on newSchema.getType() and only names RECORD, ENUM, ARRAY, MAP, UNION.

BLOB (feat(blob): Blob schema definition #18108) and VARIANT (feat: Add Unshredded Variant read & write support #17833) are Hudi logical types physically stored as Avro records, but exposed as distinct HoodieSchemaType values.
A newSchema typed BLOB or VARIANT fell through to rewritePrimaryType and threw cannot support rewrite value for schema type.

Reproduces on the Hive read path every time Hive projects its HMS-derived struct shape onto Hudi's canonical BLOB record:

HMS view: record named after the column, type field = plain STRING.
Hudi canonical view: record "blob", type = ENUM blob_storage_type, logicalType: "blob".

VECTOR was fine by accident, it maps to Avro FIXED.

Summary and Changelog

Add case BLOB and case VARIANT fallthrough to the existing RECORD body.

BLOB's {type, data, reference} and VARIANT's {metadata, value} are pinned by their LogicalType.validate() contracts, so the existing field-by-name iteration in the RECORD body is correct for both.
The existing case ENUM already converts STRING to ENUM for BLOB's type field.

Regression tests in TestHoodieArrayWritableSchemaUtils pin the fix as a unit test, no Spark / Hive / Testcontainers:

testRewriteBlobToBlobProjectionEquivalentShortCircuits + testRewritePlainRecordToBlobSucceedsAfterFix.
Parallel pair for VARIANT.

Each test feeds the exact schema pair seen in the E2E failure signature and asserts the rewrite path now succeeds (pre-fix: throws).

Impact

Hive reads of Hudi tables with BLOB or VARIANT columns that go through projectRecord stop crashing.
No public API change. No on-disk format change. No config change.
Zero runtime cost, one extra switch-case entry on a hot path.

Risk Level

low

Change is additive: two fallthrough cases into an already-tested branch body.
BLOB and VARIANT inner layouts are frozen by BlobLogicalType.validate / VariantLogicalType.validate, so the RECORD body's field-by-name iteration cannot produce a different result for them than for the corresponding plain record.
Covered by direct unit tests in the module that owns the changed file.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-bot · 2026-04-24T15:39:36Z

CI report:

2f1e842 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds BLOB and VARIANT fallthrough to the existing RECORD branch in HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal, unblocking Hive reads of tables with these logical types. The change is minimal and additive, and the regression tests pin both the short-circuit path and the plain-record→canonical-BLOB/VARIANT rewrite. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor test-naming nit below — the production fix and the helper utilities are clean.

cc @yihua

hudi-agent · 2026-04-24T17:42:38Z

+
+  @Test
+  void testRewritePlainRecordToBlobSucceedsAfterFix() {
+    HoodieSchema oldSchema = HoodieSchemaTestUtils.createPlainBlobRecord("blob_data");


🤖 nit: SucceedsAfterFix in the test name ties the test to PR/bug context that won't mean anything to a future reader — "after which fix?" Could you rename to something that describes the behavior, like testRewritePlainBlobRecordToCanonicalBlobSchema? Same goes for testRewritePlainRecordToVariantSucceedsAfterFix at line 435.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

…NewSchema Fixes issue: apache#18578 HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal switches on newSchema.getType() and only named RECORD/ENUM/ARRAY/MAP/UNION. BLOB (apache#18108) and VARIANT (apache#17833) are Hudi logical types physically stored as Avro records but exposed as distinct HoodieSchemaTypes, so a new schema typed BLOB/VARIANT fell through to rewritePrimaryType and threw "cannot support rewrite value for schema type". This reproduces on the Hive read path whenever Hive projects from its HMS-derived struct shape (record name = column name, type field = plain STRING) onto Hudi's canonical BLOB schema (record "blob", type = ENUM blob_storage_type, logicalType "blob") - the exact signature seen in ITTestCustomTypeHiveSync#testBlobTypeWithHiveSyncSQL. VECTOR was fine by accident because it maps to Avro FIXED. Add case BLOB and case VARIANT fallthrough to the existing RECORD body. Inner field layouts are fixed by BlobLogicalType.validate / VariantLogicalType.validate, so field-by-name iteration is correct. The existing ENUM case at line 137 already handles the STRING -> ENUM conversion for the BLOB "type" field. Tests pin the fix without Spark / Hive / Testcontainers - they call HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchema directly with synthetic schemas that mirror the E2E failure signature, for both BLOB and VARIANT.

voonhous · 2026-04-29T04:19:12Z

Merging this in after renaming the tests and addressing comments since prior CI has already succeeded.

Test renames = no logic change, so this should be safe.

codecov-commenter · 2026-04-29T05:55:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.06%. Comparing base (edaa168) to head (af364b2).
⚠️ Report is 19 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18580      +/-   ##
============================================
- Coverage     68.88%   68.06%   -0.82%     
- Complexity    28532    28920     +388     
============================================
  Files          2479     2518      +39     
  Lines        136810   140570    +3760     
  Branches      16660    17416     +756     
============================================
+ Hits          94244    95684    +1440     
- Misses        34982    37030    +2048     
- Partials       7584     7856     +272

Flag	Coverage Δ
common-and-other-modules	`44.36% <ø> (-0.07%)`	⬇️
hadoop-mr-java-client	`44.96% <ø> (+0.20%)`	⬆️
spark-client-hadoop-common	`48.43% <ø> (-0.05%)`	⬇️
spark-java-tests	`48.65% <ø> (-0.84%)`	⬇️
spark-scala-tests	`44.70% <ø> (-0.58%)`	⬇️
utilities	`37.69% <ø> (-0.30%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...i/hadoop/utils/HoodieArrayWritableSchemaUtils.java	`64.10% <ø> (-1.24%)`	⬇️

... and 99 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…NewSchema (apache#18580) Fixes issue: apache#18578 HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal switches on newSchema.getType() and only named RECORD/ENUM/ARRAY/MAP/UNION. BLOB (apache#18108) and VARIANT (apache#17833) are Hudi logical types physically stored as Avro records but exposed as distinct HoodieSchemaTypes, so a new schema typed BLOB/VARIANT fell through to rewritePrimaryType and threw "cannot support rewrite value for schema type". This reproduces on the Hive read path whenever Hive projects from its HMS-derived struct shape (record name = column name, type field = plain STRING) onto Hudi's canonical BLOB schema (record "blob", type = ENUM blob_storage_type, logicalType "blob") - the exact signature seen in ITTestCustomTypeHiveSync#testBlobTypeWithHiveSyncSQL. VECTOR was fine by accident because it maps to Avro FIXED. Add case BLOB and case VARIANT fallthrough to the existing RECORD body. Inner field layouts are fixed by BlobLogicalType.validate / VariantLogicalType.validate, so field-by-name iteration is correct. The existing ENUM case at line 137 already handles the STRING -> ENUM conversion for the BLOB "type" field. Tests pin the fix without Spark / Hive / Testcontainers - they call HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchema directly with synthetic schemas that mirror the E2E failure signature, for both BLOB and VARIANT.

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Apr 24, 2026

voonhous force-pushed the fix-#18578 branch from e6afa05 to 11e7039 Compare April 24, 2026 11:05

voonhous mentioned this pull request Apr 24, 2026

fix(hive): Tolerate pruned ArrayWritable in nested BLOB projection #18581

Merged

3 tasks

voonhous force-pushed the fix-#18578 branch from 11e7039 to 2f1e842 Compare April 24, 2026 14:06

voonhous mentioned this pull request Apr 24, 2026

test(schema): Add MOR log-only compaction tests for custom types #18583

Merged

3 tasks

hudi-agent reviewed Apr 24, 2026

View reviewed changes

rahil-c requested review from rahil-c and yihua April 26, 2026 19:53

rahil-c approved these changes Apr 26, 2026

View reviewed changes

danny0405 approved these changes Apr 28, 2026

View reviewed changes

voonhous force-pushed the fix-#18578 branch from 2f1e842 to af364b2 Compare April 29, 2026 04:18

voonhous merged commit 7c2c56e into apache:master Apr 29, 2026
62 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(schema): Handle BLOB and VARIANT in Hive-reader rewriteRecordWith…#18580

fix(schema): Handle BLOB and VARIANT in Hive-reader rewriteRecordWith…#18580
voonhous merged 1 commit into
apache:masterfrom
voonhous:fix-#18578

voonhous commented Apr 24, 2026

Uh oh!

hudi-bot commented Apr 24, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Apr 24, 2026

Uh oh!

voonhous Apr 29, 2026

Uh oh!

voonhous commented Apr 29, 2026

Uh oh!

Uh oh!

codecov-commenter commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

voonhous commented Apr 24, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Apr 24, 2026

CI report:

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

voonhous commented Apr 29, 2026

Uh oh!

Uh oh!

codecov-commenter commented Apr 29, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants