Skip to content

[MINOR] Fix NPE in getInputFileSlices when RO path filter returns empty partition#18639

Open
prashantwason wants to merge 1 commit into
apache:masterfrom
prashantwason:fix-empty-partition-rotablefilter-npe
Open

[MINOR] Fix NPE in getInputFileSlices when RO path filter returns empty partition#18639
prashantwason wants to merge 1 commit into
apache:masterfrom
prashantwason:fix-empty-partition-rotablefilter-npe

Conversation

@prashantwason
Copy link
Copy Markdown
Member

Describe the issue this Pull Request addresses

Closes #18638

When hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter=true is set on a COW (or READ_OPTIMIZED) table that contains a partition with no base files, BaseHoodieTableFileIndex.getInputFileSlices throws NullPointerException inside Collectors.uniqKeysMapAccumulator during Spark query planning.

Summary and Changelog

generatePartitionFileSlicesPostROTablePathFilter (added in #18136) builds its result map by iterating over the file list, so a partition with zero files produces no entry. The caller getInputFileSlices then does Collectors.toMap(identity, cache::get)cache::get returns null for the missing partition, and Collectors.toMap rejects null values via Objects.requireNonNull.

This change pre-populates the result map with empty file-slice lists for every input partition before processing files, restoring the contract already honored by filterFiles (the non-RO path), which iterates over partitions rather than over files.

Changes:

  • hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java: pre-populate partitionToFileSlices in generatePartitionFileSlicesPostROTablePathFilter so empty partitions appear in the returned map.
  • hudi-common/src/test/java/org/apache/hudi/BaseHoodieTableFileIndexTest.java: regression test testGeneratePartitionFileSlicesPostROTablePathFilterIncludesEmptyPartitions exercising a mix of partitions with and without files.

Impact

Fixes NPE on queries against COW (and READ_OPTIMIZED) tables that have empty partitions when the RO path filter listing optimization is enabled. No behavior change when no partitions are empty or when the optimization is disabled (default).

Risk Level

low

The fix is a one-line additive pre-population of a HashMap that downstream code already keys into. It strengthens the existing contract of the method without changing the file-filtering logic or the non-RO code path.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…tion

When hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter
is enabled on a COW (or READ_OPTIMIZED) table that contains a partition with
zero base files, querying the table throws NullPointerException inside
Collectors.uniqKeysMapAccumulator.

generatePartitionFileSlicesPostROTablePathFilter built its result map by
iterating over the file list, so a partition with no files produced no entry.
The caller getInputFileSlices then did Collectors.toMap(identity, cache::get)
where cache::get returned null for the missing partition, and Collectors.toMap
rejects null values via Objects.requireNonNull.

Pre-populate the result map with empty file-slice lists for every input
partition before processing files. This restores the contract already honored
by filterFiles (the non-RO path), which iterates over partitions and naturally
returns an entry per partition.

Added a regression test in BaseHoodieTableFileIndexTest covering an input
that mixes a partition with files and partitions with none.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR fixes an NPE in getInputFileSlices when the RO path filter optimization is enabled and a partition has no base files, by pre-populating the result map with empty file-slice lists for every input partition. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor inconsistency in the test; the production change and its comment are clean.

cc @yihua

@Test
public void testGeneratePartitionFileSlicesPostROTablePathFilterIncludesEmptyPartitions() throws Exception {
BaseHoodieTableFileIndex fileIndex = mock(BaseHoodieTableFileIndex.class,
org.mockito.Mockito.CALLS_REAL_METHODS);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: mock on the line above is already statically imported, so it might be worth adding import static org.mockito.Mockito.CALLS_REAL_METHODS; and just writing CALLS_REAL_METHODS here for consistency.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

// Without this, the caller's Collectors.toMap(identity, cache::get) NPEs on empty partitions
// because cache.get returns null and toMap rejects null values. This matches the contract
// already honored by filterFiles, which iterates over partitions rather than over files.
partitions.forEach(p -> partitionToFileSlices.put(p, new ArrayList<>()));
Copy link
Copy Markdown
Contributor

@danny0405 danny0405 Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, can we use Collecitons.emptyList instead of new ArrayList<>() ?

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 29, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.07%. Comparing base (642e1d3) to head (8672f5a).
⚠️ Report is 19 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #18639   +/-   ##
=========================================
  Coverage     68.06%   68.07%           
+ Complexity    28919    28911    -8     
=========================================
  Files          2518     2518           
  Lines        140570   140571    +1     
  Branches      17416    17419    +3     
=========================================
+ Hits          95680    95692   +12     
+ Misses        37033    37022   -11     
  Partials       7857     7857           
Flag Coverage Δ
common-and-other-modules 44.37% <100.00%> (+<0.01%) ⬆️
hadoop-mr-java-client 44.84% <0.00%> (+<0.01%) ⬆️
spark-client-hadoop-common 48.43% <0.00%> (+<0.01%) ⬆️
spark-java-tests 48.63% <100.00%> (-0.02%) ⬇️
spark-scala-tests 44.69% <0.00%> (-0.01%) ⬇️
utilities 37.68% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...java/org/apache/hudi/BaseHoodieTableFileIndex.java 89.34% <100.00%> (+5.80%) ⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NPE in BaseHoodieTableFileIndex.getInputFileSlices when RO path filter is enabled and a queried partition has no base files

5 participants