-
Couldn't load subscription status.
- Fork 28.9k
[WIP] [SPARK-18067] [SQL] SortMergeJoin adds shuffle if join predicates have non partitioned columns #15605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [SPARK-18067] [SQL] SortMergeJoin adds shuffle if join predicates have non partitioned columns #15605
Conversation
…e non partitioned columns
|
cc @hvanhovell for feedback on the approach. I will polish and add tests if you are fine with this approach |
|
Test build #67431 has finished for PR 15605 at commit
|
|
@tejasapatil this didn't work? |
|
@hvanhovell : It got closed accidentally. There are test case failures that I have to still debug. Happy to hear any comments about the approach. |
|
Test build #67460 has finished for PR 15605 at commit
|
|
Most of the tests are failing because The objective of this PR is to make
Fixing that might need some change which I might changing some core behavior. I want to get feedback or better alternatives before jumping and putting out a change for review. Lets look at spark/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala Line 53 in 39e2bad
The fix would be to track which one of the partitioning in the would like to hear opinions about this approach. If there are better alternatives, feel free to share. |
|
@hvanhovell @cloud-fan @gatorsmile @yhuai : Any opinions about the approach described in my previous comment ? |
|
This is superseded by #19054 Closing |
What changes were proposed in this pull request?
See https://issues.apache.org/jira/browse/SPARK-18067 for discussion. Putting out a PR to get some feedback about the approach.
Assume that there are two tables with columns
keyandvalueboth hash partitioned overkey. Assume these are the partitions for the children:Since we have all the same values of
keyin a given partition, we can evaluate other join predicates like (tableA.value=tableB.value) right there without needing any shuffle.What is previously being done i.e.
HashPartitioning(key, value)expects over rows with same value ofpmod( hash(key, value))to be in the same partition and does not take advantage of the fact that we already have rows with samekeypacked together.This PR uses
PartitioningCollectioninstead ofHashPartitioningfor expected partitioning.Query:
Before:
After:
How was this patch tested?
WIP. I need to add tests for:
Shufflefor such queryPartitioningCollectionandHashPartitioningmakes sense.