Skip to content

[SPARK-48637][CORE] On-demand shuffle migration peer refresh during decommission#46995

Closed
Ngone51 wants to merge 1 commit intoapache:masterfrom
Ngone51:dc
Closed

[SPARK-48637][CORE] On-demand shuffle migration peer refresh during decommission#46995
Ngone51 wants to merge 1 commit intoapache:masterfrom
Ngone51:dc

Conversation

@Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Jun 16, 2024

What changes were proposed in this pull request?

(This is the first PR that trying to make the BlockManagerDecommissioner event drvien.)

This PR adds functionality for BlockManagerDecommissioner to do the on-demand shuffle migration peer refresh when there is a peer aborted instead of always relying on the periodical refresh thread (i.e., shuffleBlockMigrationRefreshExecutor).

Strictly speaking, in today's solution, when there is queued peers (i.e., ShuffleMigrationRunnable) in shuffleMigrationPool, the peer can also be immediately refilled if there is a peer aborted. So the most useful case for this PR is when there is no queued peers.

Please note that this PR doesn't remove the periodical refresh thread. Because one on-demmand refresh can't guarantee it can always reach the STORAGE_DECOMMISSION_SHUFFLE_MAX_THREADS. (We could probably use a dedicated delayed timer to get rid of the periodical refresh thread totally in the future though.)

Why are the changes needed?

BlockManagerDecommissioner today relies on a periodical thread to refresh the shuffle migration peers and the thread does the refresh every 30s (by default). So when a peer aborted, it can take at most 30s to get a fresh migration peer, which is quite low effeicent.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add unit test.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Jun 16, 2024
@Ngone51 Ngone51 requested a review from dongjoon-hyun June 16, 2024 13:13
@Ngone51
Copy link
Member Author

Ngone51 commented Jun 16, 2024

@holdenk
Copy link
Contributor

holdenk commented Jun 16, 2024

This makes sense to me as an improvement, especially if y'all are seeing failures happen in production often.

@dongjoon-hyun
Copy link
Member

Could you fix the related test failures, @Ngone51 ?

[info] *** 4 TESTS FAILED ***
[error] Failed: Total 4110, Failed 4, Errors 0, Passed 4106, Ignored 10, Canceled 2
[error] Failed tests:
[error] 	org.apache.spark.storage.BlockManagerSuite

@github-actions
Copy link

github-actions bot commented Nov 2, 2024

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Nov 2, 2024
@github-actions github-actions bot closed this Nov 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants