Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-2871] support outer join for hash join on build side. #1469

Closed
wants to merge 4 commits into from

Conversation

ChengXiangLi
Copy link

  1. There are 4 reserved bytes left in bucket header of MutableHashTable, as there are only 9 elements in each bucket, This PR could use 2 bytes to build a BitSet which is used to mark whether elements in that bucket has been probed during probe phase. After probe phase, return the elements which has not been probed at the end.
  2. As build side outer join is supported, we could support more flexible strategy for left outer join, right outer join and full outer join, new supported join types includes:
    • left outer join with REPARTITION_HASH_FIRST.
    • right outer join with REPARTITION_HASH_SECOND
    • full outer join with REPARTITION_HASH_FIRST or REPARTITION_HASH_SECOND.
  3. But there is still some limitations about broadcast hash join, the following join types are still not supported for obviously reason:
    • left outer join with BROADCAST_HASH_FIRST.
    • right outer join with BROADCAST_HASH_SECOND.
    • full outer join with BROADCAST_HASH_FIRST and BROADCAST_HASH_SECOND.

@ChengXiangLi ChengXiangLi changed the title [FLINK-2971] support outer join for hash join on build side. [FLINK-2871] support outer join for hash join on build side. Dec 18, 2015
@fhueske
Copy link
Contributor

fhueske commented Dec 18, 2015

Thanks for the PR! I will shepherd it.

public class HashFullOuterJoinBuildFirstDescriptor extends AbstractJoinDescriptor {

public HashFullOuterJoinBuildFirstDescriptor(FieldList keys1, FieldList keys2,
boolean broadcastFirstAllowed, boolean broadcastSecondAllowed, boolean repartitionAllowed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadcast-Forward shipping strategies are not valid for full outer joins.
Hence, the broadcast parameters can be removed.

@fhueske
Copy link
Contributor

fhueske commented Jan 13, 2016

Hi @ChengXiangLi, very nice PR! Sorry for not reviewing it earlier.
I had only a few minor comments. I did not check the tests yet, but hope to do it in the next days.

Although, I do not expect major performance implications for inner joins, it would be good to check that to be on the safe side.
Have you done any performance regression tests?

Thanks, Fabian

// The keys of probe and build sides are overlapped, so there would be none unmatched build elements
// after probe phase.

// create a build input that gives 40000 pairs with 2 values sharing the same key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build input gives 80000 pairs with 40000 distinct key values

@fhueske
Copy link
Contributor

fhueske commented Jan 14, 2016

The tests look mostly good.
I would change the HashTableITCase.testHashWithBuildSideOuterJoin2() to process outer join tuples.
Can you explain the rational for the changes in HashVsSortMiniBenchmark?

Thanks, Fabian

@ChengXiangLi
Copy link
Author

I did simple regression test based on HashVsSortMiniBenchmark, the result looks like:

Test Before After
testBuildFirst 6.63s 6.65s
testBuildSecond 3.7s 3.8s

The inner join performance is not influenced by this PR, which fit into my expectation. There is a flag called buildsideOuterJoin in MutableHashTable, all the extra effort only happens while buildSideOuterJoin is true.


public class HashFullOuterJoinBuildFirstDescriptor extends AbstractJoinDescriptor {

public HashFullOuterJoinBuildFirstDescriptor(FieldList keys1, FieldList keys2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the repartitionAllowed parameter can be removed as well, as it is the only possible strategy.

@fhueske
Copy link
Contributor

fhueske commented Jan 21, 2016

Thanks for the update and the regression test @ChengXiangLi!
Looks good, only minor changes to do, IMO.

@fhueske
Copy link
Contributor

fhueske commented Jan 26, 2016

Thanks for the update @ChengXiangLi!

PR is good to merge, IMO.
Do you want to do it yourself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants