Skip to content

[SPARK-44635][CORE] Refresh shuffle locations when decommission happens#43443

Closed
ukby1234 wants to merge 1 commit intoapache:masterfrom
ukby1234:SPARK-44635
Closed

[SPARK-44635][CORE] Refresh shuffle locations when decommission happens#43443
ukby1234 wants to merge 1 commit intoapache:masterfrom
ukby1234:SPARK-44635

Conversation

@ukby1234
Copy link
Contributor

What changes were proposed in this pull request?

When shuffle fetch fails, we should refresh the map output locations to check whether shuffles are migrated to other locations. This increase the stability of Spark programs.

Why are the changes needed?

An improvement on the stability of Spark program when executor decommission is enabled. This is done by refreshing the map locations during fetch failures.

Does this PR introduce any user-facing change?

Yes, introduced a flag spark.storage.decommission.shuffleBlocks.refreshLocationsEnabled so users can opt-in for this feature.

How was this patch tested?

Added unit tests and this patch has been running in our production cluster for 2 months.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Oct 18, 2023
@mridulm
Copy link
Contributor

mridulm commented Oct 19, 2023

There is an ongoing PR #42296 for handling this.

@ukby1234
Copy link
Contributor Author

I noticed that PR is stale for a while and I found a couple things:

  1. We can't do refresh locations in a Netty callbacks because it will cause deadlocks (where map locations are sent via broadcast variables)
  2. Addressed issues to support fallback storage reads

Not sure if you want me to work on top of that PR's fork or it's okay to submit a PR like this.

@mridulm
Copy link
Contributor

mridulm commented Oct 20, 2023

Unless the other PR has been abandoned, it is better to continue discussion there.
If there are specific issues found with that work, please do review and suggest changes.

On other hand, if the author has indicated that they are not interested in continuing with it, we can look at taking it over.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 29, 2024
@github-actions github-actions bot closed this Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments