refactor(metadata): Replace misused stream reduce with a plain for-loop by voonhous · Pull Request #18532 · apache/hudi

voonhous · 2026-04-19T14:22:07Z

Describe the issue this Pull Request addresses

HoodieTableMetadataUtil.convertMetadataToFilesPartitionRecords aggregated per-partition write stats via writeStats.stream().reduce(new HashMap<>(...), accumulator, CollectionUtils::combine).

The "identity" is a mutable HashMap that the accumulator mutates and returns which is a misuse of Stream.reduce, whose contract assumes the identity is safe to combine with any element as a no-op.

The only reason this works is that the stream is sequential and the method runs on the driver (the caller, HoodieMetadataWriteUtils, then wraps the returned list via context.parallelize(..., 1)).

A plain for-loop expresses the same aggregation directly and is the idiomatic shape for mutable-accumulation sequential code.

Summary and Changelog

Internal readability/idiom cleanup in the metadata-table write path. No behavior change.

Replaced the writeStats.stream().reduce(...) call with an imperative for (HoodieWriteStat stat : writeStats) loop that builds the updatedFilesToSizesMapping HashMap directly.
Same merge semantics (Math::max on per-file size; CDC path/size entries overlaid).
No change to the surrounding partitionToWriteStats.entrySet().stream().map(...) pipeline.

Impact

None. Readability and idiom cleanup only; behavior and allocation shape are materially unchanged.

Risk Level

low

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

- HoodieTableMetadataUtil.convertMetadataToFilesPartitionRecords aggregated per-partition write stats via writeStats.stream().reduce(new HashMap<>(), accumulator, CollectionUtils::combine). - The "identity" is a mutable HashMap that the accumulator mutates in place - a misuse of Stream.reduce. - It only works because the stream is sequential and the method runs on the driver (HoodieMetadataWriteUtils then wraps the result via context.parallelize(..., 1)). - A plain for-loop expresses the same aggregation directly and is idiomatic for mutable-accumulation sequential code. - No behavior change. No measurable perf impact - readability/idiom cleanup.

hudi-bot · 2026-04-19T16:02:54Z

CI report:

a61f79c Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the cleanup! This refactor replaces a misused Stream.reduce (with a mutable HashMap identity) with an equivalent for-loop, preserving the same merge semantics. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

danny0405 · 2026-04-20T02:27:06Z

-                        return map;
-                      },
-                      CollectionUtils::combine);
+              HashMap<String, Long> updatedFilesToSizesMapping = new HashMap<>(writeStats.size());


is it possible we use the collect API instead:

Map<String, String> result = instances.stream() .collect(Collectors.toMap( Instance::deriveKey, Instance::deriveValue, (existing, replacement) -> existing // Merge function if keys collide ));

You mean to simplify the entire thing to use Collectors#toMap?

Don't think it's possible...

Each HoodieWriteStat can produce 1 main-file entry plus 0..N CDC entries. A bare #toMap can't express that, we will need #flatMap first to explode each stat into a stream of entries.

There is asymmetric merge semantics, the main path uses Math::max (file sizes monotonically increase, so we keep the largest reported size), while CDC entries use plain put (last write wins). #toMap takes a single merge function, so the two can't be expressed together cleanly if they ever share a key. In prod, IIUC they do not collide, the original code preserves that asymmetry. Changing toMap might override this.

A bare #toMap can't express that, we will need #flatMap first

flatMap sounds good to me, did you ever try this?

Will explore further and do this fix when i have time. Will ping you again for a followup fix.

Am focusing on the unstructured track for now.

codecov-commenter · 2026-05-15T16:40:35Z

Codecov Report

❌ Patch coverage is 69.23077% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.84%. Comparing base (eaaae8a) to head (a61f79c).
⚠️ Report is 86 commits behind head on master.

Files with missing lines	Patch %	Lines
.../apache/hudi/metadata/HoodieTableMetadataUtil.java	69.23%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18532   +/-   ##
=========================================
  Coverage     68.84%   68.84%           
- Complexity    28323    28336   +13     
=========================================
  Files          2467     2467           
  Lines        135839   135840    +1     
  Branches      16483    16481    -2     
=========================================
+ Hits          93518    93520    +2     
+ Misses        34922    34921    -1     
  Partials       7399     7399

Flag	Coverage Δ
common-and-other-modules	`44.66% <69.23%> (+<0.01%)`	⬆️
hadoop-mr-java-client	`44.77% <53.84%> (+<0.01%)`	⬆️
spark-client-hadoop-common	`48.41% <53.84%> (+<0.01%)`	⬆️
spark-java-tests	`48.93% <69.23%> (-0.01%)`	⬇️
spark-scala-tests	`45.44% <69.23%> (-0.01%)`	⬇️
utilities	`38.19% <53.84%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
.../apache/hudi/metadata/HoodieTableMetadataUtil.java	`82.14% <69.23%> (-0.07%)`	⬇️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

voonhous requested review from danny0405 and yihua April 19, 2026 14:22

github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 19, 2026

voonhous changed the title ~~perf(metadata): replace stream reduce with imperative loop for write-…~~ refactor(metadata): replace stream reduce with imperative loop for write-… Apr 19, 2026

voonhous changed the title ~~refactor(metadata): replace stream reduce with imperative loop for write-…~~ refactor(metadata): Replace misused stream reduce with a plain for-loop Apr 19, 2026

hudi-agent reviewed Apr 19, 2026

View reviewed changes

danny0405 reviewed Apr 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(metadata): Replace misused stream reduce with a plain for-loop#18532

refactor(metadata): Replace misused stream reduce with a plain for-loop#18532
voonhous wants to merge 1 commit into
apache:masterfrom
voonhous:perf-metadata-reduce-to-loop

voonhous commented Apr 19, 2026 •

edited

Loading

Uh oh!

hudi-bot commented Apr 19, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

danny0405 Apr 20, 2026

Uh oh!

voonhous Apr 20, 2026

Uh oh!

danny0405 Apr 21, 2026

Uh oh!

voonhous Apr 21, 2026

Uh oh!

codecov-commenter commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

voonhous commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Apr 19, 2026

CI report:

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

danny0405 Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

danny0405 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 15, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

voonhous commented Apr 19, 2026 •

edited

Loading