Skip to content

Comments

[WIP][SPARK-44243][CORE] Add a parameter to determine the locality of local shuffle reader#41786

Closed
wankunde wants to merge 1 commit intoapache:masterfrom
wankunde:local_shuffle_locality
Closed

[WIP][SPARK-44243][CORE] Add a parameter to determine the locality of local shuffle reader#41786
wankunde wants to merge 1 commit intoapache:masterfrom
wankunde:local_shuffle_locality

Conversation

@wankunde
Copy link
Contributor

What changes were proposed in this pull request?

Follow changes of #40339

Local shuffle reader can achieve better performance with preferred locations. If we disable SHUFFLE_REDUCE_LOCALITY_ENABLE in queries that include reduce shuffles and local shuffles, local shuffle readers can not get preferred locations.

Add new parameter LOCAL_SHUFFLE_LOCALITY_ENABLE to determine whether to get the preferred locations of the current partitionSpec.

Why are the changes needed?

Improvement for spark local shuffle.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Exists UT

@mridulm
Copy link
Contributor

mridulm commented Jun 30, 2023

If you want locality for shuffle, enable spark.shuffle.reduceLocality.enabled - why introduce another config for this ?

@wankunde wankunde changed the title [SPARK-44243][CORE] Add a parameter to determine the locality of local shuffle reader [WIP][SPARK-44243][CORE] Add a parameter to determine the locality of local shuffle reader Jun 30, 2023
@HyukjinKwon
Copy link
Member

cc @maryannxue FYI

@maryannxue
Copy link
Contributor

maryannxue commented Jul 3, 2023

What's the use of the new conf? How does it improve locality? Isn't it just enough to do https://github.com/apache/spark/pull/41786/files#diff-a3b15298f97577c1fadcc2d76d015eebd6343e246c6717417d33f3c458847f46L1133?

@wankunde
Copy link
Contributor Author

wankunde commented Jul 3, 2023

Thanks @mridulm @maryannxue for your review.

If a query contains shuffle A and shuffle B, there are many PartialReducerPartitions after OptimizeSkewedJoin optimization, and shuffle B is a local read shuffle. Enable spark.shuffle.reduceLocality.enabled may takes some extra time to get the preferred locations.
But there is a limit on the number of PartialReducerPartitions in our production environment, so it is okay for me not to make this change.

@wankunde wankunde closed this Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants