Skip to content

[HUDI-XXXXX] fix: Parallelize cloud object existence checks in S3EventsHoodieIncrSource#18252

Open
vinishjail97 wants to merge 2 commits intoapache:masterfrom
vinishjail97:vinishjail97/parallel-exists-check
Open

[HUDI-XXXXX] fix: Parallelize cloud object existence checks in S3EventsHoodieIncrSource#18252
vinishjail97 wants to merge 2 commits intoapache:masterfrom
vinishjail97:vinishjail97/parallel-exists-check

Conversation

@vinishjail97
Copy link
Contributor

@vinishjail97 vinishjail97 commented Feb 26, 2026

Summary

  • Add EXISTS_CHECK_PARALLELISM config (default 32 threads per Spark task) for concurrent HEAD requests during file existence checks
  • Repartition distinct() output by totalExecutorCores instead of using a single partition (caused by upstream Window.orderBy() + AQE coalescing)
  • Add per-task thread pool in getCloudObjectMetadataPerPartition using CompletableFuture.supplyAsync with explicit ExecutorService
  • Use CompletableFuture.allOf().join() to wait for all futures concurrently before executor shutdown, preventing SdkInterruptedException on interrupted threads
  • Extract processRow helper, fix per-file INFO log to DEBUG
  • Add tests for parallel exists check, sequential fallback, and size validation

Context

The upstream unpartitioned Window.orderBy() forces all data into a single partition. AQE then coalesces the downstream distinct() output to 1 partition. This means all file exists checks run sequentially in a single thread — for 176K+ files at ~100ms per HEAD request, that's ~5 hours.

New approach: repartition by totalExecutorCores and run a 32-thread pool within each partition's mapPartitions lambda. This gives totalCores × 32 concurrent HEAD requests across the cluster.

Bug fix: executor shutdown race

The original implementation returned a lazy iterator() from inside the try block, causing shutdownNow() in finally to fire before any .join() calls executed. This interrupted in-flight S3 getFileStatus threads, leading to SdkInterruptedExceptionAbortedExceptionInterruptedIOException → task failure after 4 retries.

Fix: use CompletableFuture.allOf().join() to wait for all futures at a single fan-in point inside the try block, then eagerly collect results before finally fires.

Test plan

  • TestCloudObjectsSelectorCommon — new tests for parallel exists check, sequential fallback, size validation
  • TestS3EventsHoodieIncrSource — verify no regressions
  • Validate on a staging cluster with large file count

…dieIncrSource

Repartition by totalExecutorCores and run a per-task thread pool for
concurrent HEAD requests during file existence checks. Previously all
checks ran sequentially in a single partition due to upstream Window
coalescing, causing ~5hr latency for 176K+ files.

- Add EXISTS_CHECK_PARALLELISM config (default 32 threads per task)
- Repartition distinct() output by totalCores for cluster utilization
- Add thread pool in getCloudObjectMetadataPerPartition using
  CompletableFuture.supplyAsync with explicit ExecutorService
- Extract processRow helper, fix per-file INFO log to DEBUG
- Add tests for parallel and sequential exists check paths
@vinishjail97 vinishjail97 changed the title [HUDI-XXXXX] Parallelize cloud object existence checks in S3EventsHoodieIncrSource fix:Parallelize cloud object existence checks in S3EventsHoodieIncrSource Feb 26, 2026
@vinishjail97 vinishjail97 changed the title fix:Parallelize cloud object existence checks in S3EventsHoodieIncrSource fix: Parallelize cloud object existence checks in S3EventsHoodieIncrSource Feb 26, 2026
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2026
…e in parallel exists check

Replace sequential per-future join() with CompletableFuture.allOf().join() to
wait for all futures concurrently inside the try block. This ensures shutdownNow()
only fires after all work is done, preventing interrupted threads when the caller
consumes results. Also adds proper CompletionException handling on the fan-in join.
@vinishjail97 vinishjail97 changed the title fix: Parallelize cloud object existence checks in S3EventsHoodieIncrSource [HUDI-XXXXX] fix: Parallelize cloud object existence checks in S3EventsHoodieIncrSource Mar 2, 2026
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 0% with 71 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.23%. Comparing base (fb7b1a5) to head (2e92efd).
⚠️ Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
...es/sources/helpers/CloudObjectsSelectorCommon.java 0.00% 65 Missing ⚠️
...pache/hudi/utilities/config/CloudSourceConfig.java 0.00% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18252      +/-   ##
============================================
- Coverage     57.30%   57.23%   -0.07%     
- Complexity    18561    18602      +41     
============================================
  Files          1945     1948       +3     
  Lines        106256   106732     +476     
  Branches      13131    13199      +68     
============================================
+ Hits          60885    61086     +201     
- Misses        39648    39888     +240     
- Partials       5723     5758      +35     
Flag Coverage Δ
hadoop-mr-java-client 45.21% <ø> (-0.19%) ⬇️
spark-java-tests 47.40% <0.00%> (-0.03%) ⬇️
spark-scala-tests 45.50% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...pache/hudi/utilities/config/CloudSourceConfig.java 0.00% <0.00%> (ø)
...es/sources/helpers/CloudObjectsSelectorCommon.java 0.00% <0.00%> (ø)

... and 49 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Collaborator

hudi-bot commented Mar 2, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants