[CELEBORN-2331] Parallelize batch open stream client creation by sunchao · Pull Request #3692 · apache/celeborn

sunchao · 2026-05-17T03:22:35Z

Why are the changes needed?

CelebornShuffleReader batches stream-open requests by worker, but it previously created the data client for each worker serially before sending those already-parallel batch requests. When a reducer reads from multiple workers, connection setup for a slow or unavailable worker can delay useful work against the remaining healthy workers.

Parallelizing this setup removes the worker-by-worker wait from the normal path. Because this changes task-side connection scheduling, the optimization also needs an operational fallback that restores the prior behavior without requiring a code rollback.

What changes were proposed in this PR?

The reader now first gathers pending stream-open locations by worker address, then creates one data client per distinct worker concurrently using the existing stream-creator pool. Once client setup completes, it sends the existing BATCH_OPEN_STREAM requests only for workers with an available client, allowing healthy workers to proceed even if another worker fails during setup.

The client-creation phase preserves the prior retry behavior for later locations on the same worker when an earlier client attempt fails. It also handles task cancellation explicitly: if the waiting Spark task is interrupted, it restores the interrupt status and cancels unfinished setup work; worker-side interruption is propagated rather than treated as an ordinary retryable failure.

This optimization is controlled by celeborn.client.spark.batch.openStream.parallelClientCreation.enabled, which defaults to true. Setting it to false selects the original serial client-creation and request-building flow, giving deployments a targeted rollback switch if parallel connection setup causes unexpected operational behavior.

How was this PR tested?

Unit tests for parallel client setup, failure/retry handling, cancellation on interruption, and the new configuration default and override.
Configuration documentation generation validation for the new client setting.
Spotless formatting validation.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

RexXiong · 2026-05-25T13:20:50Z

Overall: Clean refactoring that parallelizes data-client creation across workers. The separation into group → parallel create → build request is well-structured, and the test coverage is thorough (parallel execution, failure isolation, retry, interruption).

1. Redundant retries with identical connection parameters

createClientsInParallel iterates through all locations for a given hostPort until createClient succeeds. However, all locations at the same hostPort share identical host and fetchPort, so each retry is an identical connection attempt:

while (!clientCreated && locationsIterator.hasNext) {
  val location = locationsIterator.next()
  try {
    clientsByHostPort.put(hostPort, createClient(location))  // same host:port every time
    clientCreated = true
  } catch { ... }
}

For a reducer reading 1000 partitions from a failing worker, this means up to 1000 identical connection attempts (each paying the full connection timeout). Since futures.foreach(_.get()) blocks until ALL futures complete, one slow-failing worker would dominate the entire parallel creation phase — partially defeating the purpose of parallelization.

Consider either:

Taking only the first location per hostPort for client creation (the original behavior was also 1 attempt)
Adding a configurable max retry count (e.g., 2-3 attempts per hostPort)

2. +1 to SteNicholas's suggestion for a config switch

A celeborn.client.spark.batch.openStream.parallelClientCreation.enabled (or similar) flag would be prudent for a new parallelization path. This allows users to fall back to serial behavior if unexpected issues arise in production. Could default to true.

3. Minor: onClientCreateFailure invoked per failed location

In the original code, excludeFailedFetchLocation was called at most once per failing worker. Now it can be called N times (once per location at that hostPort). While excludeFailedFetchLocation is idempotent (ConcurrentHashMap), the associated logWarning will emit N log lines for the same worker, which could be noisy for large reducers. This ties back to point 1 — bounding retries would also bound log spam.

Reviewed with Claude Code

sunchao · 2026-05-25T18:34:06Z

@RexXiong Thanks for reviewing. The requested rollback switch is added in d44fa47 as celeborn.client.spark.batch.openStream.parallelClientCreation.enabled, defaulting to true; setting it to false restores the pre-change serial client-creation and request-building path. For the retry and warning-count point, I kept the existing semantics intentionally: before this PR, if createClient failed for a location, workerRequestMap remained empty and a later location on the same hostPort attempted createClient again, with failure handling per failed attempt. Reducing this to one attempt would be a separate behavioral change rather than preserving the existing reader behavior.

SteNicholas · 2026-05-26T02:23:42Z

Thanks. Merged to main(v0.7.0).

[CELEBORN-2331] Parallelize batch open stream client creation

eac81a4

github-actions Bot added module:client module:spark labels May 17, 2026

sunchao marked this pull request as ready for review May 18, 2026 16:16

SteNicholas requested a review from Copilot May 19, 2026 18:13

Copilot started reviewing on behalf of SteNicholas May 19, 2026 18:13 View session

Copilot AI reviewed May 19, 2026

SteNicholas requested a review from Copilot May 21, 2026 02:30

Copilot started reviewing on behalf of SteNicholas May 21, 2026 02:30 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Comment thread ...t-spark/spark-3/src/main/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleReader.scala Outdated

Preserve batch open client retry behavior

5f9d650

SteNicholas requested a review from Copilot May 24, 2026 04:31

Copilot started reviewing on behalf of SteNicholas May 24, 2026 04:32 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

Comment thread ...t-spark/spark-3/src/main/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleReader.scala Outdated

Comment thread ...t-spark/spark-3/src/main/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleReader.scala

[CELEBORN-2331] Preserve interrupt handling during client creation

17d5afa

SteNicholas reviewed May 25, 2026

View reviewed changes

Comment thread ...t-spark/spark-3/src/main/scala/org/apache/spark/shuffle/celeborn/CelebornShuffleReader.scala Outdated

[CELEBORN-2331] Add parallel client creation switch

d44fa47

github-actions Bot added kind:documentation module:common labels May 25, 2026

SteNicholas approved these changes May 26, 2026

View reviewed changes

RexXiong approved these changes May 26, 2026

View reviewed changes

SteNicholas closed this in 759e7b5 May 26, 2026

sunchao mentioned this pull request Jun 25, 2026

[CELEBORN-2371] Bound Spark batch-open client creation retries and stop them on interruption #3746

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CELEBORN-2331] Parallelize batch open stream client creation#3692

[CELEBORN-2331] Parallelize batch open stream client creation#3692
sunchao wants to merge 4 commits into
apache:mainfrom
sunchao:dev/chao/codex/port-pr72-to-oss-main

sunchao commented May 17, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RexXiong commented May 25, 2026

Uh oh!

sunchao commented May 25, 2026

Uh oh!

SteNicholas commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

sunchao commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

What changes were proposed in this PR?

How was this PR tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RexXiong commented May 25, 2026

Uh oh!

sunchao commented May 25, 2026

Uh oh!

SteNicholas commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sunchao commented May 17, 2026 •

edited

Loading