fix(metadata): Exclude Variant/Blob/Vector from V1 column stats#18695
fix(metadata): Exclude Variant/Blob/Vector from V1 column stats#18695voonhous wants to merge 1 commit intoapache:masterfrom
Conversation
V2 already filters all three types; V1 (used by bloom filters unconditionally and by column/partition stats on table v8) was missing BLOB/VECTOR in the AVRO branch and VECTOR in the SPARK branch, letting indexes silently include columns whose stats are meaningless. Also clarifies the expression-index error message to list VARIANT/BLOB/VECTOR alongside RECORD/ARRAY/MAP.
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR closes a real gap by excluding Variant/Blob/Vector columns from V1 column-stats/partition-stats/bloom-filter/expression-index population, matching the V2 behavior. The note about HoodieSchemaType.VECTOR.toAvroType() == FIXED not catching VECTOR via the existing type != FIXED check is a useful catch — the V1 check operates on the HoodieSchemaType enum, so VECTOR did need to be listed explicitly. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18695 +/- ##
============================================
+ Coverage 68.07% 68.08% +0.01%
- Complexity 28943 28945 +2
============================================
Files 2519 2519
Lines 140664 140664
Branches 17428 17428
============================================
+ Hits 95757 95772 +15
+ Misses 37043 37031 -12
+ Partials 7864 7861 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Variant, Blob, and Vector are recently added types. Index code (column stats, partition stats, bloom filters, expression-index column gating) was not taught about them. Stats on these columns are meaningless until proper support lands. Disable index population on these columns by default for now.
Summary and Changelog
V2 already excluded all three types. The gaps were in V1, which is still active for:
Changes:
HoodieTableMetadataUtil.isColumnTypeSupportedV1:BLOBandVECTOR.VECTOR(BLOBandVARIANTwere already excluded).HoodieIndexUtilsexpression-index error message now listsVARIANT, BLOB, VECTORalongsideRECORD, ARRAY, MAP. Behavior was already correct viaHoodieSchemaType.isComplex(); only the message text was stale.TestHoodieTableMetadataUtil: newtestVariantBlobVectorColumnsAreNotSupportedForV1ColumnStatscovers all three types under bothAVROandSPARKrecord types in V1.Note:
HoodieSchemaType.VECTOR.toAvroType()isFIXED, but the V1 check switches on theHoodieSchemaTypeenum, so the existingtype != FIXEDdoes not catchVECTOR. It must be listed explicitly.Impact
User-facing: indexes silently skip Variant/Blob/Vector columns instead of indexing garbage. No on-disk format change. No public API change. Secondary index was already protected by its allow-list. Expression index was already blocked behaviorally; only its error wording changed.
Risk Level
low
Single chokepoint (
isColumnTypeSupported) feeds every column-list builder. V2 path is unchanged. V1 change strictly narrows the supported set; no previously-supported type is now rejected.Documentation Update
none
Contributor's checklist