[HUDI-XXXXX] fix: Parallelize cloud object existence checks in S3EventsHoodieIncrSource by vinishjail97 · Pull Request #18252 · apache/hudi

vinishjail97 · 2026-02-26T03:25:10Z

Summary

Add EXISTS_CHECK_PARALLELISM config (default 32 threads per Spark task) for concurrent HEAD requests during file existence checks
Repartition distinct() output by totalExecutorCores instead of using a single partition (caused by upstream Window.orderBy() + AQE coalescing)
Add per-task thread pool in getCloudObjectMetadataPerPartition using CompletableFuture.supplyAsync with explicit ExecutorService
Use CompletableFuture.allOf().join() to wait for all futures concurrently before executor shutdown, preventing SdkInterruptedException on interrupted threads
Extract processRow helper, fix per-file INFO log to DEBUG
Add tests for parallel exists check, sequential fallback, and size validation

Context

The upstream unpartitioned Window.orderBy() forces all data into a single partition. AQE then coalesces the downstream distinct() output to 1 partition. This means all file exists checks run sequentially in a single thread — for 176K+ files at ~100ms per HEAD request, that's ~5 hours.

New approach: repartition by totalExecutorCores and run a 32-thread pool within each partition's mapPartitions lambda. This gives totalCores × 32 concurrent HEAD requests across the cluster.

Bug fix: executor shutdown race

The original implementation returned a lazy iterator() from inside the try block, causing shutdownNow() in finally to fire before any .join() calls executed. This interrupted in-flight S3 getFileStatus threads, leading to SdkInterruptedException → AbortedException → InterruptedIOException → task failure after 4 retries.

Fix: use CompletableFuture.allOf().join() to wait for all futures at a single fan-in point inside the try block, then eagerly collect results before finally fires.

Test plan

TestCloudObjectsSelectorCommon — new tests for parallel exists check, sequential fallback, size validation
TestS3EventsHoodieIncrSource — verify no regressions
Validate on a staging cluster with large file count

…dieIncrSource Repartition by totalExecutorCores and run a per-task thread pool for concurrent HEAD requests during file existence checks. Previously all checks ran sequentially in a single partition due to upstream Window coalescing, causing ~5hr latency for 176K+ files. - Add EXISTS_CHECK_PARALLELISM config (default 32 threads per task) - Repartition distinct() output by totalCores for cluster utilization - Add thread pool in getCloudObjectMetadataPerPartition using CompletableFuture.supplyAsync with explicit ExecutorService - Extract processRow helper, fix per-file INFO log to DEBUG - Add tests for parallel and sequential exists check paths

…e in parallel exists check Replace sequential per-future join() with CompletableFuture.allOf().join() to wait for all futures concurrently inside the try block. This ensures shutdownNow() only fires after all work is done, preventing interrupted threads when the caller consumes results. Also adds proper CompletionException handling on the fan-in join.

codecov-commenter · 2026-03-02T20:58:00Z

Codecov Report

❌ Patch coverage is 0% with 71 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.23%. Comparing base (fb7b1a5) to head (2e92efd).
⚠️ Report is 15 commits behind head on master.

Files with missing lines	Patch %	Lines
...es/sources/helpers/CloudObjectsSelectorCommon.java	0.00%	65 Missing ⚠️
...pache/hudi/utilities/config/CloudSourceConfig.java	0.00%	6 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18252      +/-   ##
============================================
- Coverage     57.30%   57.23%   -0.07%     
- Complexity    18561    18602      +41     
============================================
  Files          1945     1948       +3     
  Lines        106256   106732     +476     
  Branches      13131    13199      +68     
============================================
+ Hits          60885    61086     +201     
- Misses        39648    39888     +240     
- Partials       5723     5758      +35

Flag	Coverage Δ
hadoop-mr-java-client	`45.21% <ø> (-0.19%)`	⬇️
spark-java-tests	`47.40% <0.00%> (-0.03%)`	⬇️
spark-scala-tests	`45.50% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...pache/hudi/utilities/config/CloudSourceConfig.java	`0.00% <0.00%> (ø)`
...es/sources/helpers/CloudObjectsSelectorCommon.java	`0.00% <0.00%> (ø)`

... and 49 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-03-02T21:10:24Z

CI report:

2e92efd Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

vinishjail97 changed the title ~~[HUDI-XXXXX] Parallelize cloud object existence checks in S3EventsHoodieIncrSource~~ fix:Parallelize cloud object existence checks in S3EventsHoodieIncrSource Feb 26, 2026

vinishjail97 changed the title ~~fix:Parallelize cloud object existence checks in S3EventsHoodieIncrSource~~ fix: Parallelize cloud object existence checks in S3EventsHoodieIncrSource Feb 26, 2026

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2026

vinishjail97 changed the title ~~fix: Parallelize cloud object existence checks in S3EventsHoodieIncrSource~~ [HUDI-XXXXX] fix: Parallelize cloud object existence checks in S3EventsHoodieIncrSource Mar 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-XXXXX] fix: Parallelize cloud object existence checks in S3EventsHoodieIncrSource#18252

[HUDI-XXXXX] fix: Parallelize cloud object existence checks in S3EventsHoodieIncrSource#18252
vinishjail97 wants to merge 2 commits intoapache:masterfrom
vinishjail97:vinishjail97/parallel-exists-check

vinishjail97 commented Feb 26, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Mar 2, 2026

Uh oh!

hudi-bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vinishjail97 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Bug fix: executor shutdown race

Test plan

Uh oh!

codecov-commenter commented Mar 2, 2026

Codecov Report

Uh oh!

hudi-bot commented Mar 2, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vinishjail97 commented Feb 26, 2026 •

edited

Loading