Skip to content

[CELEBORN-1577][BUG] Quota cancel shuffle should use app shuffle id#3662

Open
s0nskar wants to merge 6 commits intoapache:mainfrom
s0nskar:fix_quota_shuffle_id
Open

[CELEBORN-1577][BUG] Quota cancel shuffle should use app shuffle id#3662
s0nskar wants to merge 6 commits intoapache:mainfrom
s0nskar:fix_quota_shuffle_id

Conversation

@s0nskar
Copy link
Copy Markdown
Contributor

@s0nskar s0nskar commented Apr 13, 2026

What changes were proposed in this pull request?

  • Added a new mapping for celebornShuffleId -> appShuffleId
  • cancelAllActiveStages should passing appShuffleId not celebornShuffleId

Why are the changes needed?

shuffleAllocatedWorkers worker contains celebornShuffleId, we need to use appShuffleId because DAGScheduler only understand app shuffle id.

Does this PR resolve a correctness bug?

No

Does this PR introduce any user-facing change?

No

How was this patch tested?

NA

@s0nskar
Copy link
Copy Markdown
Contributor Author

s0nskar commented Apr 13, 2026

cc: @leixm PTAL

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates quota-triggered shuffle/stage cancellation to use Spark’s app shuffle ID (the one understood by DAGScheduler) by tracking a mapping from Celeborn-generated shuffle IDs to app shuffle IDs.

Changes:

  • Introduced a celebornShuffleId -> appShuffleId mapping in LifecycleManager.
  • Populated the mapping when generating new Celeborn shuffle IDs.
  • Updated cancelAllActiveStages to translate active Celeborn shuffle IDs to app shuffle IDs before invoking the Spark cancel callback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@SteNicholas
Copy link
Copy Markdown
Member

SteNicholas commented Apr 13, 2026

@s0nskar, you'd better to firstly resolve the following failure of compilation and address the comments of coiplot.

Error:  /home/runner/work/celeborn/celeborn/client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala:2056: inferred type arguments [Integer,Unit] do not conform to method toSet's type parameter bounds [B >: Int,U]
Error:  /home/runner/work/celeborn/celeborn/client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala:2056: type mismatch;
 found   : Integer => Unit
 required: B => U
[INFO] : Integer => Unit <: B => U?
[INFO] : false
Error: [ERROR] two errors found

@s0nskar
Copy link
Copy Markdown
Contributor Author

s0nskar commented Apr 13, 2026

@SteNicholas working on it!

@RexXiong
Copy link
Copy Markdown
Contributor

Thanks for the fix. A suggestion:

Encapsulation of mapping cleanup: I noticed that celebornShuffleIdToAppShuffleIdMap.remove() is only added in two places, but unregisterShuffle may be called from other locations as well. Those call sites would miss the mapping cleanup.

Consider moving celebornShuffleIdToAppShuffleIdMap.remove(shuffleId) inside the unregisterShuffle method to ensure the mapping is always cleaned up when a shuffle is unregistered. This provides better encapsulation and avoids potential memory leaks or stale mappings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants