New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-32420][SQL] Add handling for unique key in non-codegen hash join #29216
Conversation
Test build #126467 has finished for PR 29216 at commit
|
retest this please |
Test build #126501 has finished for PR 29216 at commit
|
cc @cloud-fan and @sameeragarwal if you guys can help take a look. Thanks! |
matches.map(joinRow.withRight(_)).filter(boundCondition) | ||
} else { | ||
Seq.empty | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we already have test for inner join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}) | ||
result.setBoolean(0, exists) | ||
joinedRow(current, result) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And test for existenceJoin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya - existence join is tricky and I will try to add one. But besides testing, wondering what do you think of this PR? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya - I just found test single unique condition (equal) for left Anti join
in ExistenceJoinSuite
already covered existence join. So we should be good with existence join. More historical context for existence join is this PR just FYI.
@viirya - given tests already cover all joins, could you give a review of core logic to help move forward? Thanks. |
thanks, merging to master! |
Thanks @cloud-fan and @viirya for review! |
What changes were proposed in this pull request?
HashRelation
has two separate code paths for unique key look up and non-unique key look up E.g. in its subclassUnsafeHashedRelation
, unique key look up is more efficient as it does not have e.g. extraIterator[UnsafeRow].hasNext()/next()
overhead per row.BroadcastHashJoinExec
has handled unique key vs non-unique key separately in code-gen path. But the non-codegen path for broadcast hash join and shuffled hash join do not separate it yet, so adding the support here.Why are the changes needed?
Shuffled hash join and non-codegen broadcast hash join still rely on this code path for execution. So this PR will help save CPU for executing this two type of join. Adding codegen for shuffled hash join would be a different topic and I will add it in https://issues.apache.org/jira/browse/SPARK-32421 .
Ran the same query as
JoinBenchmark
, with enabling and disabling this feature. Verified 20% wall clock time improvement (switch control and test group order as well to verify the improvement to not be the noise).Does this PR introduce any user-facing change?
No.
How was this patch tested?
OuterJoinSuite
to cover left outer and right outer join.ExistenceJoinSuite
to cover left semi join, and existence join.joinSuite
already covered inner join.ExistenceJoinSuite
already covered left anti join, and existence join.