Skip to content

fix(schema): Handle BLOB and VARIANT in Hive-reader rewriteRecordWith…#18580

Merged
voonhous merged 1 commit into
apache:masterfrom
voonhous:fix-#18578
Apr 29, 2026
Merged

fix(schema): Handle BLOB and VARIANT in Hive-reader rewriteRecordWith…#18580
voonhous merged 1 commit into
apache:masterfrom
voonhous:fix-#18578

Conversation

@voonhous
Copy link
Copy Markdown
Member

…NewSchema

Describe the issue this Pull Request addresses

Closes #18578

HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal switches on newSchema.getType() and only names RECORD, ENUM, ARRAY, MAP, UNION.

Reproduces on the Hive read path every time Hive projects its HMS-derived struct shape onto Hudi's canonical BLOB record:

  • HMS view: record named after the column, type field = plain STRING.
  • Hudi canonical view: record "blob", type = ENUM blob_storage_type, logicalType: "blob".

VECTOR was fine by accident, it maps to Avro FIXED.

Summary and Changelog

Add case BLOB and case VARIANT fallthrough to the existing RECORD body.

  • BLOB's {type, data, reference} and VARIANT's {metadata, value} are pinned by their LogicalType.validate() contracts, so the existing field-by-name iteration in the RECORD body is correct for both.
  • The existing case ENUM already converts STRING to ENUM for BLOB's type field.

Regression tests in TestHoodieArrayWritableSchemaUtils pin the fix as a unit test, no Spark / Hive / Testcontainers:

  • testRewriteBlobToBlobProjectionEquivalentShortCircuits + testRewritePlainRecordToBlobSucceedsAfterFix.
  • Parallel pair for VARIANT.

Each test feeds the exact schema pair seen in the E2E failure signature and asserts the rewrite path now succeeds (pre-fix: throws).

Impact

  • Hive reads of Hudi tables with BLOB or VARIANT columns that go through projectRecord stop crashing.
  • No public API change. No on-disk format change. No config change.
  • Zero runtime cost, one extra switch-case entry on a hot path.

Risk Level

low

  • Change is additive: two fallthrough cases into an already-tested branch body.
  • BLOB and VARIANT inner layouts are frozen by BlobLogicalType.validate / VariantLogicalType.validate, so the RECORD body's field-by-name iteration cannot produce a different result for them than for the corresponding plain record.
  • Covered by direct unit tests in the module that owns the changed file.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds BLOB and VARIANT fallthrough to the existing RECORD branch in HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal, unblocking Hive reads of tables with these logical types. The change is minimal and additive, and the regression tests pin both the short-circuit path and the plain-record→canonical-BLOB/VARIANT rewrite. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor test-naming nit below — the production fix and the helper utilities are clean.

cc @yihua


@Test
void testRewritePlainRecordToBlobSucceedsAfterFix() {
HoodieSchema oldSchema = HoodieSchemaTestUtils.createPlainBlobRecord("blob_data");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: SucceedsAfterFix in the test name ties the test to PR/bug context that won't mean anything to a future reader — "after which fix?" Could you rename to something that describes the behavior, like testRewritePlainBlobRecordToCanonicalBlobSchema? Same goes for testRewritePlainRecordToVariantSucceedsAfterFix at line 435.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

@rahil-c rahil-c requested review from rahil-c and yihua April 26, 2026 19:53
…NewSchema

Fixes issue: apache#18578

HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal switches
on newSchema.getType() and only named RECORD/ENUM/ARRAY/MAP/UNION. BLOB
(apache#18108) and VARIANT (apache#17833) are Hudi logical types physically stored as
Avro records but exposed as distinct HoodieSchemaTypes, so a new schema
typed BLOB/VARIANT fell through to rewritePrimaryType and threw
"cannot support rewrite value for schema type".

This reproduces on the Hive read path whenever Hive projects from its
HMS-derived struct shape (record name = column name, type field = plain
STRING) onto Hudi's canonical BLOB schema (record "blob", type = ENUM
blob_storage_type, logicalType "blob") - the exact signature seen in
ITTestCustomTypeHiveSync#testBlobTypeWithHiveSyncSQL. VECTOR was fine by
accident because it maps to Avro FIXED.

Add case BLOB and case VARIANT fallthrough to the existing RECORD body.
Inner field layouts are fixed by BlobLogicalType.validate /
VariantLogicalType.validate, so field-by-name iteration is correct. The
existing ENUM case at line 137 already handles the STRING -> ENUM
conversion for the BLOB "type" field.

Tests pin the fix without Spark / Hive / Testcontainers - they call
HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchema directly with
synthetic schemas that mirror the E2E failure signature, for both BLOB
and VARIANT.
@voonhous
Copy link
Copy Markdown
Member Author

Merging this in after renaming the tests and addressing comments since prior CI has already succeeded.

Test renames = no logic change, so this should be safe.

@voonhous voonhous merged commit 7c2c56e into apache:master Apr 29, 2026
62 checks passed
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.06%. Comparing base (edaa168) to head (af364b2).
⚠️ Report is 19 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18580      +/-   ##
============================================
- Coverage     68.88%   68.06%   -0.82%     
- Complexity    28532    28920     +388     
============================================
  Files          2479     2518      +39     
  Lines        136810   140570    +3760     
  Branches      16660    17416     +756     
============================================
+ Hits          94244    95684    +1440     
- Misses        34982    37030    +2048     
- Partials       7584     7856     +272     
Flag Coverage Δ
common-and-other-modules 44.36% <ø> (-0.07%) ⬇️
hadoop-mr-java-client 44.96% <ø> (+0.20%) ⬆️
spark-client-hadoop-common 48.43% <ø> (-0.05%) ⬇️
spark-java-tests 48.65% <ø> (-0.84%) ⬇️
spark-scala-tests 44.70% <ø> (-0.58%) ⬇️
utilities 37.69% <ø> (-0.30%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...i/hadoop/utils/HoodieArrayWritableSchemaUtils.java 64.10% <ø> (-1.24%) ⬇️

... and 99 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rahil-c pushed a commit to rahil-c/hudi that referenced this pull request Apr 29, 2026
…NewSchema (apache#18580)

Fixes issue: apache#18578

HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal switches
on newSchema.getType() and only named RECORD/ENUM/ARRAY/MAP/UNION. BLOB
(apache#18108) and VARIANT (apache#17833) are Hudi logical types physically stored as
Avro records but exposed as distinct HoodieSchemaTypes, so a new schema
typed BLOB/VARIANT fell through to rewritePrimaryType and threw
"cannot support rewrite value for schema type".

This reproduces on the Hive read path whenever Hive projects from its
HMS-derived struct shape (record name = column name, type field = plain
STRING) onto Hudi's canonical BLOB schema (record "blob", type = ENUM
blob_storage_type, logicalType "blob") - the exact signature seen in
ITTestCustomTypeHiveSync#testBlobTypeWithHiveSyncSQL. VECTOR was fine by
accident because it maps to Avro FIXED.

Add case BLOB and case VARIANT fallthrough to the existing RECORD body.
Inner field layouts are fixed by BlobLogicalType.validate /
VariantLogicalType.validate, so field-by-name iteration is correct. The
existing ENUM case at line 137 already handles the STRING -> ENUM
conversion for the BLOB "type" field.

Tests pin the fix without Spark / Hive / Testcontainers - they call
HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchema directly with
synthetic schemas that mirror the E2E failure signature, for both BLOB
and VARIANT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blob querying via Hive fails 1

6 participants