[CELEBORN-1577][BUG] Quota cancel shuffle should use app shuffle id#3662
[CELEBORN-1577][BUG] Quota cancel shuffle should use app shuffle id#3662s0nskar wants to merge 6 commits intoapache:mainfrom
Conversation
|
cc: @leixm PTAL |
There was a problem hiding this comment.
Pull request overview
This PR updates quota-triggered shuffle/stage cancellation to use Spark’s app shuffle ID (the one understood by DAGScheduler) by tracking a mapping from Celeborn-generated shuffle IDs to app shuffle IDs.
Changes:
- Introduced a
celebornShuffleId -> appShuffleIdmapping inLifecycleManager. - Populated the mapping when generating new Celeborn shuffle IDs.
- Updated
cancelAllActiveStagesto translate active Celeborn shuffle IDs to app shuffle IDs before invoking the Spark cancel callback.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
client/src/main/scala/org/apache/celeborn/client/LifecycleManager.scala
Outdated
Show resolved
Hide resolved
|
@s0nskar, you'd better to firstly resolve the following failure of compilation and address the comments of coiplot. |
|
@SteNicholas working on it! |
|
Thanks for the fix. A suggestion: Encapsulation of mapping cleanup: I noticed that Consider moving |
What changes were proposed in this pull request?
Why are the changes needed?
shuffleAllocatedWorkersworker contains celebornShuffleId, we need to useappShuffleIdbecause DAGScheduler only understand app shuffle id.Does this PR resolve a correctness bug?
No
Does this PR introduce any user-facing change?
No
How was this patch tested?
NA