New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-2871] support outer join for hash join on build side. #1469
Conversation
Thanks for the PR! I will shepherd it. |
public class HashFullOuterJoinBuildFirstDescriptor extends AbstractJoinDescriptor { | ||
|
||
public HashFullOuterJoinBuildFirstDescriptor(FieldList keys1, FieldList keys2, | ||
boolean broadcastFirstAllowed, boolean broadcastSecondAllowed, boolean repartitionAllowed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broadcast-Forward shipping strategies are not valid for full outer joins.
Hence, the broadcast parameters can be removed.
Hi @ChengXiangLi, very nice PR! Sorry for not reviewing it earlier. Although, I do not expect major performance implications for inner joins, it would be good to check that to be on the safe side. Thanks, Fabian |
// The keys of probe and build sides are overlapped, so there would be none unmatched build elements | ||
// after probe phase. | ||
|
||
// create a build input that gives 40000 pairs with 2 values sharing the same key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
build input gives 80000 pairs with 40000 distinct key values
The tests look mostly good. Thanks, Fabian |
92961bc
to
c4138e8
Compare
I did simple regression test based on
The inner join performance is not influenced by this PR, which fit into my expectation. There is a flag called |
|
||
public class HashFullOuterJoinBuildFirstDescriptor extends AbstractJoinDescriptor { | ||
|
||
public HashFullOuterJoinBuildFirstDescriptor(FieldList keys1, FieldList keys2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the repartitionAllowed
parameter can be removed as well, as it is the only possible strategy.
Thanks for the update and the regression test @ChengXiangLi! |
Thanks for the update @ChengXiangLi! PR is good to merge, IMO. |
MutableHashTable
, as there are only 9 elements in each bucket, This PR could use 2 bytes to build a BitSet which is used to mark whether elements in that bucket has been probed during probe phase. After probe phase, return the elements which has not been probed at the end.REPARTITION_HASH_FIRST
.REPARTITION_HASH_SECOND
REPARTITION_HASH_FIRST
orREPARTITION_HASH_SECOND
.BROADCAST_HASH_FIRST
.BROADCAST_HASH_SECOND
.BROADCAST_HASH_FIRST
andBROADCAST_HASH_SECOND
.