fix(schema): Handle BLOB and VARIANT in Hive-reader rewriteRecordWith…#18580
Conversation
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR adds BLOB and VARIANT fallthrough to the existing RECORD branch in HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal, unblocking Hive reads of tables with these logical types. The change is minimal and additive, and the regression tests pin both the short-circuit path and the plain-record→canonical-BLOB/VARIANT rewrite. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor test-naming nit below — the production fix and the helper utilities are clean.
cc @yihua
|
|
||
| @Test | ||
| void testRewritePlainRecordToBlobSucceedsAfterFix() { | ||
| HoodieSchema oldSchema = HoodieSchemaTestUtils.createPlainBlobRecord("blob_data"); |
There was a problem hiding this comment.
🤖 nit: SucceedsAfterFix in the test name ties the test to PR/bug context that won't mean anything to a future reader — "after which fix?" Could you rename to something that describes the behavior, like testRewritePlainBlobRecordToCanonicalBlobSchema? Same goes for testRewritePlainRecordToVariantSucceedsAfterFix at line 435.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
…NewSchema Fixes issue: apache#18578 HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal switches on newSchema.getType() and only named RECORD/ENUM/ARRAY/MAP/UNION. BLOB (apache#18108) and VARIANT (apache#17833) are Hudi logical types physically stored as Avro records but exposed as distinct HoodieSchemaTypes, so a new schema typed BLOB/VARIANT fell through to rewritePrimaryType and threw "cannot support rewrite value for schema type". This reproduces on the Hive read path whenever Hive projects from its HMS-derived struct shape (record name = column name, type field = plain STRING) onto Hudi's canonical BLOB schema (record "blob", type = ENUM blob_storage_type, logicalType "blob") - the exact signature seen in ITTestCustomTypeHiveSync#testBlobTypeWithHiveSyncSQL. VECTOR was fine by accident because it maps to Avro FIXED. Add case BLOB and case VARIANT fallthrough to the existing RECORD body. Inner field layouts are fixed by BlobLogicalType.validate / VariantLogicalType.validate, so field-by-name iteration is correct. The existing ENUM case at line 137 already handles the STRING -> ENUM conversion for the BLOB "type" field. Tests pin the fix without Spark / Hive / Testcontainers - they call HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchema directly with synthetic schemas that mirror the E2E failure signature, for both BLOB and VARIANT.
|
Merging this in after renaming the tests and addressing comments since prior CI has already succeeded. Test renames = no logic change, so this should be safe. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18580 +/- ##
============================================
- Coverage 68.88% 68.06% -0.82%
- Complexity 28532 28920 +388
============================================
Files 2479 2518 +39
Lines 136810 140570 +3760
Branches 16660 17416 +756
============================================
+ Hits 94244 95684 +1440
- Misses 34982 37030 +2048
- Partials 7584 7856 +272
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
…NewSchema (apache#18580) Fixes issue: apache#18578 HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternal switches on newSchema.getType() and only named RECORD/ENUM/ARRAY/MAP/UNION. BLOB (apache#18108) and VARIANT (apache#17833) are Hudi logical types physically stored as Avro records but exposed as distinct HoodieSchemaTypes, so a new schema typed BLOB/VARIANT fell through to rewritePrimaryType and threw "cannot support rewrite value for schema type". This reproduces on the Hive read path whenever Hive projects from its HMS-derived struct shape (record name = column name, type field = plain STRING) onto Hudi's canonical BLOB schema (record "blob", type = ENUM blob_storage_type, logicalType "blob") - the exact signature seen in ITTestCustomTypeHiveSync#testBlobTypeWithHiveSyncSQL. VECTOR was fine by accident because it maps to Avro FIXED. Add case BLOB and case VARIANT fallthrough to the existing RECORD body. Inner field layouts are fixed by BlobLogicalType.validate / VariantLogicalType.validate, so field-by-name iteration is correct. The existing ENUM case at line 137 already handles the STRING -> ENUM conversion for the BLOB "type" field. Tests pin the fix without Spark / Hive / Testcontainers - they call HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchema directly with synthetic schemas that mirror the E2E failure signature, for both BLOB and VARIANT.
…NewSchema
Describe the issue this Pull Request addresses
Closes #18578
HoodieArrayWritableSchemaUtils.rewriteRecordWithNewSchemaInternalswitches onnewSchema.getType()and only namesRECORD,ENUM,ARRAY,MAP,UNION.HoodieSchemaTypevalues.newSchematyped BLOB or VARIANT fell through torewritePrimaryTypeand threwcannot support rewrite value for schema type.Reproduces on the Hive read path every time Hive projects its HMS-derived struct shape onto Hudi's canonical BLOB record:
typefield = plain STRING."blob",type= ENUMblob_storage_type,logicalType: "blob".VECTOR was fine by accident, it maps to Avro FIXED.
Summary and Changelog
Add
case BLOBandcase VARIANTfallthrough to the existingRECORDbody.{type, data, reference}and VARIANT's{metadata, value}are pinned by theirLogicalType.validate()contracts, so the existing field-by-name iteration in the RECORD body is correct for both.case ENUMalready converts STRING to ENUM for BLOB'stypefield.Regression tests in
TestHoodieArrayWritableSchemaUtilspin the fix as a unit test, no Spark / Hive / Testcontainers:testRewriteBlobToBlobProjectionEquivalentShortCircuits+testRewritePlainRecordToBlobSucceedsAfterFix.Each test feeds the exact schema pair seen in the E2E failure signature and asserts the rewrite path now succeeds (pre-fix: throws).
Impact
projectRecordstop crashing.Risk Level
low
BlobLogicalType.validate/VariantLogicalType.validate, so the RECORD body's field-by-name iteration cannot produce a different result for them than for the corresponding plain record.Documentation Update
none
Contributor's checklist