[SPARK-44635][CORE] Refresh shuffle locations when decommission happens by ukby1234 · Pull Request #43443 · apache/spark

ukby1234 · 2023-10-18T23:08:51Z

What changes were proposed in this pull request?

When shuffle fetch fails, we should refresh the map output locations to check whether shuffles are migrated to other locations. This increase the stability of Spark programs.

Why are the changes needed?

An improvement on the stability of Spark program when executor decommission is enabled. This is done by refreshing the map locations during fetch failures.

Does this PR introduce any user-facing change?

Yes, introduced a flag spark.storage.decommission.shuffleBlocks.refreshLocationsEnabled so users can opt-in for this feature.

How was this patch tested?

Added unit tests and this patch has been running in our production cluster for 2 months.

Was this patch authored or co-authored using generative AI tooling?

No

…ffles

mridulm · 2023-10-19T07:18:14Z

There is an ongoing PR #42296 for handling this.

ukby1234 · 2023-10-19T20:53:16Z

I noticed that PR is stale for a while and I found a couple things:

We can't do refresh locations in a Netty callbacks because it will cause deadlocks (where map locations are sent via broadcast variables)
Addressed issues to support fallback storage reads

Not sure if you want me to work on top of that PR's fork or it's okay to submit a PR like this.

mridulm · 2023-10-20T03:18:10Z

Unless the other PR has been abandoned, it is better to continue discussion there.
If there are specific issues found with that work, please do review and suggest changes.

On other hand, if the author has indicated that they are not interested in continuing with it, we can look at taking it over.

github-actions · 2024-01-29T00:18:45Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

For SPARK-44635, try to refresh map output locations for migrated shu…

ee83e5a

…ffles

github-actions bot added the CORE label Oct 18, 2023

ukby1234 mentioned this pull request Oct 20, 2023

[SPARK-44635][CORE] Handle shuffle fetch failures in decommissions #42296

Closed

github-actions bot added the Stale label Jan 29, 2024

github-actions bot closed this Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44635][CORE] Refresh shuffle locations when decommission happens#43443

[SPARK-44635][CORE] Refresh shuffle locations when decommission happens#43443
ukby1234 wants to merge 1 commit intoapache:masterfrom
ukby1234:SPARK-44635

ukby1234 commented Oct 18, 2023

Uh oh!

mridulm commented Oct 19, 2023

Uh oh!

ukby1234 commented Oct 19, 2023

Uh oh!

mridulm commented Oct 20, 2023

Uh oh!

github-actions bot commented Jan 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

ukby1234 commented Oct 18, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

mridulm commented Oct 19, 2023

Uh oh!

ukby1234 commented Oct 19, 2023

Uh oh!

mridulm commented Oct 20, 2023

Uh oh!

github-actions bot commented Jan 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments