Skip to content

fix(metadata): Exclude Variant/Blob/Vector from V1 column stats#18695

Open
voonhous wants to merge 1 commit intoapache:masterfrom
voonhous:disable-index-for-logical-types
Open

fix(metadata): Exclude Variant/Blob/Vector from V1 column stats#18695
voonhous wants to merge 1 commit intoapache:masterfrom
voonhous:disable-index-for-logical-types

Conversation

@voonhous
Copy link
Copy Markdown
Member

@voonhous voonhous commented May 6, 2026

Describe the issue this Pull Request addresses

Variant, Blob, and Vector are recently added types. Index code (column stats, partition stats, bloom filters, expression-index column gating) was not taught about them. Stats on these columns are meaningless until proper support lands. Disable index population on these columns by default for now.

Summary and Changelog

V2 already excluded all three types. The gaps were in V1, which is still active for:

  • BLOOM_FILTERS (always V1)
  • COLUMN_STATS / PARTITION_STATS / EXPRESSION_INDEX on table version 8

Changes:

  • HoodieTableMetadataUtil.isColumnTypeSupportedV1:
    • AVRO branch now also excludes BLOB and VECTOR.
    • SPARK branch now also excludes VECTOR (BLOB and VARIANT were already excluded).
  • HoodieIndexUtils expression-index error message now lists VARIANT, BLOB, VECTOR alongside RECORD, ARRAY, MAP. Behavior was already correct via HoodieSchemaType.isComplex(); only the message text was stale.
  • TestHoodieTableMetadataUtil: new testVariantBlobVectorColumnsAreNotSupportedForV1ColumnStats covers all three types under both AVRO and SPARK record types in V1.

Note: HoodieSchemaType.VECTOR.toAvroType() is FIXED, but the V1 check switches on the HoodieSchemaType enum, so the existing type != FIXED does not catch VECTOR. It must be listed explicitly.

Impact

User-facing: indexes silently skip Variant/Blob/Vector columns instead of indexing garbage. No on-disk format change. No public API change. Secondary index was already protected by its allow-list. Expression index was already blocked behaviorally; only its error wording changed.

Risk Level

low

Single chokepoint (isColumnTypeSupported) feeds every column-list builder. V2 path is unchanged. V1 change strictly narrows the supported set; no previously-supported type is now rejected.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

V2 already filters all three types; V1 (used by bloom filters
unconditionally and by column/partition stats on table v8) was
missing BLOB/VECTOR in the AVRO branch and VECTOR in the SPARK
branch, letting indexes silently include columns whose stats
are meaningless. Also clarifies the expression-index error
message to list VARIANT/BLOB/VECTOR alongside RECORD/ARRAY/MAP.
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR closes a real gap by excluding Variant/Blob/Vector columns from V1 column-stats/partition-stats/bloom-filter/expression-index population, matching the V2 behavior. The note about HoodieSchemaType.VECTOR.toAvroType() == FIXED not catching VECTOR via the existing type != FIXED check is a useful catch — the V1 check operates on the HoodieSchemaType enum, so VECTOR did need to be listed explicitly. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented May 6, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.08%. Comparing base (91f341f) to head (027a966).

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18695      +/-   ##
============================================
+ Coverage     68.07%   68.08%   +0.01%     
- Complexity    28943    28945       +2     
============================================
  Files          2519     2519              
  Lines        140664   140664              
  Branches      17428    17428              
============================================
+ Hits          95757    95772      +15     
+ Misses        37043    37031      -12     
+ Partials       7864     7861       -3     
Flag Coverage Δ
common-and-other-modules 44.35% <ø> (+<0.01%) ⬆️
hadoop-mr-java-client 44.93% <ø> (-0.01%) ⬇️
spark-client-hadoop-common 48.42% <ø> (-0.01%) ⬇️
spark-java-tests 48.64% <ø> (+<0.01%) ⬆️
spark-scala-tests 44.76% <ø> (+<0.01%) ⬆️
utilities 37.66% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...n/java/org/apache/hudi/index/HoodieIndexUtils.java 94.50% <ø> (+1.83%) ⬆️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java 82.36% <ø> (ø)

... and 9 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants