[SPARK-44635][CORE] Refresh shuffle locations when decommission happens#43443
[SPARK-44635][CORE] Refresh shuffle locations when decommission happens#43443ukby1234 wants to merge 1 commit intoapache:masterfrom
Conversation
|
There is an ongoing PR #42296 for handling this. |
|
I noticed that PR is stale for a while and I found a couple things:
Not sure if you want me to work on top of that PR's fork or it's okay to submit a PR like this. |
|
Unless the other PR has been abandoned, it is better to continue discussion there. On other hand, if the author has indicated that they are not interested in continuing with it, we can look at taking it over. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
When shuffle fetch fails, we should refresh the map output locations to check whether shuffles are migrated to other locations. This increase the stability of Spark programs.
Why are the changes needed?
An improvement on the stability of Spark program when executor decommission is enabled. This is done by refreshing the map locations during fetch failures.
Does this PR introduce any user-facing change?
Yes, introduced a flag
spark.storage.decommission.shuffleBlocks.refreshLocationsEnabledso users can opt-in for this feature.How was this patch tested?
Added unit tests and this patch has been running in our production cluster for 2 months.
Was this patch authored or co-authored using generative AI tooling?
No