Skip to content

feat(flink): Support data skipping based on column stats for source V2#18706

Merged
danny0405 merged 1 commit into
apache:masterfrom
cshuo:support_data_skipping_col_stats_v2
May 9, 2026
Merged

feat(flink): Support data skipping based on column stats for source V2#18706
danny0405 merged 1 commit into
apache:masterfrom
cshuo:support_data_skipping_col_stats_v2

Conversation

@cshuo
Copy link
Copy Markdown
Collaborator

@cshuo cshuo commented May 8, 2026

Describe the issue this Pull Request addresses

Flink Source V2 did not propagate the pushed-down ColumnStatsProbe into its FileIndex, so column stats based data skipping was not applied on the Source V2 path even when filter pushdown and data skipping were enabled.

This PR wires the existing column stats pruning context through Source V2 so it can use the same FileStatsIndex pruning path already available to the file index, fixes #18703

Summary and Changelog

  • Added ColumnStatsProbe to HoodieScanContext.
  • Passed columnStatsProbe from HoodieTableSource into the Source V2 scan context.
  • Passed the scan context's ColumnStatsProbe into FileIndex from HoodieSource.buildFileIndex().
  • Updated TestHoodieSource coverage to validate Source V2 split pruning with partition stats and column stats paths.

Impact

  • Enables Source V2 to use existing column stats based file-level pruning when data skipping and metadata column stats are enabled.

Risk Level

low

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR wires ColumnStatsProbe from HoodieTableSource through HoodieScanContext into the Source V2 FileIndex, enabling existing column-stats-based data skipping on the V2 path. The change is small and the test updates exercise both the partition stats and column stats pruning paths. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label May 8, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented May 8, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 68.08%. Comparing base (47bf4e4) to head (18070b5).

Files with missing lines Patch % Lines
.../java/org/apache/hudi/table/HoodieTableSource.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18706      +/-   ##
============================================
- Coverage     68.14%   68.08%   -0.07%     
+ Complexity    29077    29040      -37     
============================================
  Files          2522     2522              
  Lines        141177   141179       +2     
  Branches      17514    17514              
============================================
- Hits          96208    96116      -92     
- Misses        37061    37147      +86     
- Partials       7908     7916       +8     
Flag Coverage Δ
common-and-other-modules 44.42% <50.00%> (+<0.01%) ⬆️
hadoop-mr-java-client 45.01% <ø> (+<0.01%) ⬆️
spark-client-hadoop-common 48.35% <ø> (+<0.01%) ⬆️
spark-java-tests 49.00% <ø> (+<0.01%) ⬆️
spark-scala-tests 44.72% <ø> (-0.19%) ⬇️
utilities 37.63% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...java/org/apache/hudi/source/HoodieScanContext.java 100.00% <ø> (ø)
...main/java/org/apache/hudi/source/HoodieSource.java 53.62% <100.00%> (+0.68%) ⬆️
.../java/org/apache/hudi/table/HoodieTableSource.java 56.99% <0.00%> (-0.20%) ⬇️

... and 39 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@danny0405 danny0405 merged commit bab463a into apache:master May 9, 2026
60 of 63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support data skipping based on column stats for source V2

5 participants