Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-8428] [table] Implement stream-stream non-window left outer join #5327

Closed
wants to merge 4 commits into from

Conversation

hequn8128
Copy link
Contributor

What is the purpose of the change

Implement stream-stream non-window left outer join for sql/table-api. A simple design doc can be found here

Brief change log

  • Add left join
    • with non-equal predicates
    • without non-equal predicates
  • Adapt retraction rules to left join. Outer join will generate retractions
  • Adapt UpsertTableSink. Table mode of dynamic table produced by left join is Update Mode, even if the table does not include a key definition
  • Add inner join test cases which consistent with test cases in batch.
  • Add left join test cases which consistent with test cases in batch.

Verifying this change

This change added tests and can be verified as follows:

  • Added integration tests for left join with or without non-equal predicates.
  • Added HarnessTests left join with or without non-equal predicates.
  • Add tests for AccMode generate by left join.
  • Add tests for UpsertSink followed left join.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (already docs)

@hequn8128
Copy link
Contributor Author

Hi, @twalthr It would be great if you can take a look at the pr. I'm looking forward to finish outer join (left/right/full) before the end of March. Besides, there are a few PRs planed to optimize inner/outer joins. Thanks :)

@twalthr
Copy link
Contributor

twalthr commented Jan 31, 2018

Thanks for the reminder @hequn8128. I will review it in the next 2 weeks. If not, feel free to ping me again.

@hequn8128
Copy link
Contributor Author

hi, @twalthr Look forward to your review, thanks :-)

Copy link
Contributor

@walterddr walterddr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pr @hequn8128 . Really looking forward to having LEFT JOIN support in datastream soon. I left a few comments and questions

Best,
Rong

if (lInKeys.isEmpty || rInKeys.isEmpty) {
None
} else {
// Output of inner join must have keys if left and right both contain key(s).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove "inner"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you

* @param defaultRow The result row used for output, right side fields will all be null.
* @param out The collector for returning result values.
*/
def collectWithNullRight(leftRow: Row, defaultRow: Row, out: Collector[Row]): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably reuse this function for RIGTH JOIN as well? maybe rename it to collectWithNull ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to reduce the number of if else as much as possible which is inefficient. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about somthing like

def collectWithDefault(mainRow: Row, defaultRow: Row, out: Collector[Row]): Unit = {
   //...
}

This way mainRow will be leftRow when LEFT JOIN and rightRow when RIGTH JOIN. It might be usable in FULL OUTER JOIN as well.

if (rigthKeyNum == 1 && value.change) {
cRowWrapper.setChange(false)
collectWithNullRight(otherSideRow, resultRow, cRowWrapper)
retractFlag = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is retractFlag used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will remove it :-)

if (!value.change && rigthKeyNum == 0) {
cRowWrapper.setChange(true)
collectWithNullRight(otherSideRow, resultRow, cRowWrapper)
hasReEmittedNullRight = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, is hasReEmittedNullRight used?

@@ -201,18 +202,294 @@ class JoinITCase extends StreamingWithStateTestBase {
// Proctime window output uncertain results, so assert has been ignored here.
}

@Test
def testJoin(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be more specific? like, inner equality join

while (otherSideIterator.hasNext) {
val otherSideEntry = otherSideIterator.next()
val otherSideRow = otherSideEntry.getKey
val cntAndExpiredTime = otherSideEntry.getValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cntAndExpiredTime is already defined in upper layer, maybe change the naming to avoid confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

while (otherSideIterator.hasNext) {
val otherSideEntry = otherSideIterator.next()
val otherSideRow = otherSideEntry.getKey
val cntAndExpiredTime = otherSideEntry.getValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cntAndExpiredTime is already defined in upper layer, maybe change the naming to avoid confusion?

}
// update matched cnt only when left row cnt is changed from 0 to 1. Each time encountered a
// new record from right, leftJoinCnt will also be updated.
if (cntAndExpiredTime.f0 == 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this check be triggered if cntAndExpiredTime was updated from 2 ==> 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should also take value.change into consideration.

override def open(parameters: Configuration): Unit = {
super.open(parameters)

val leftJoinCntDescriptor = new MapStateDescriptor[Row, Long](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably use [Row, Int], to match with the type for count in LeftSideState and RightSideState?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long is more safe. I will change all count type to Long. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either is fine as long as they are consistent.

@@ -985,4 +1017,232 @@ class JoinHarnessTest extends HarnessTestBase {

testHarness.close()
}

@Test
def testNonWindowLeftJoinWithOutNonEqualPred() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WithOut ==> Without

Copy link
Contributor Author

@hequn8128 hequn8128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @walterddr
Thanks very much for your review and suggestions. I will update it soon.
Best, Hequn

if (lInKeys.isEmpty || rInKeys.isEmpty) {
None
} else {
// Output of inner join must have keys if left and right both contain key(s).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you

* @param defaultRow The result row used for output, right side fields will all be null.
* @param out The collector for returning result values.
*/
def collectWithNullRight(leftRow: Row, defaultRow: Row, out: Collector[Row]): Unit = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to reduce the number of if else as much as possible which is inefficient. What do you think?

while (otherSideIterator.hasNext) {
val otherSideEntry = otherSideIterator.next()
val otherSideRow = otherSideEntry.getKey
val cntAndExpiredTime = otherSideEntry.getValue
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense.

if (rigthKeyNum == 1 && value.change) {
cRowWrapper.setChange(false)
collectWithNullRight(otherSideRow, resultRow, cRowWrapper)
retractFlag = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will remove it :-)

override def open(parameters: Configuration): Unit = {
super.open(parameters)

val leftJoinCntDescriptor = new MapStateDescriptor[Row, Long](
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long is more safe. I will change all count type to Long. What do you think?

}
// update matched cnt only when left row cnt is changed from 0 to 1. Each time encountered a
// new record from right, leftJoinCnt will also be updated.
if (cntAndExpiredTime.f0 == 1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should also take value.change into consideration.

@hequn8128
Copy link
Contributor Author

Update pr according to @walterddr 's suggestions.

@hequn8128
Copy link
Contributor Author

hequn8128 commented Mar 8, 2018

Hi @twalthr @walterddr
The latest update refactors interfaces and functions to make code more friendly to right/full join. The code of right/full joins are also ready and can be reached from https://github.com/hequn8128/flink/tree/outerjoin (branch:outerjoin).
@fhueske It would be great if you can also take a look.
Thanks all. Hequn

@fhueske
Copy link
Contributor

fhueske commented Mar 15, 2018

Thanks @hequn8128! We're pretty busy with the Flink 1.5 release right now.
This will be one of the first features to add once 1.5 is out!

Best, Fabian

Copy link
Contributor

@twalthr twalthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR @hequn8128 and sorry for the delay! The code is already in very good shape. I only added minor things:

  • I would move the enforceKeys change into a separate PR.
  • We need more tests at certain places because some code paths are never tested.
  • Would be great to add more inline and method comments to maintain the code in the future.

I will run a couple of TPC-H/TPC-DS queries all feedback has been addressed.

/**
* Whether the [[DataStreamRel]] produces retraction messages.
*/
def producesRetractions: Boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't producesUpdates enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A join generates retraction if it's type is left/right/full. It is different from agg which generates retractions if sendsUpdatesAsRetraction(node) && node.producesUpdates is true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

@@ -60,6 +60,9 @@ class DataStreamJoin(

override def needsUpdatesAsRetraction: Boolean = true

// outer join will generate retractions
override def producesRetractions: Boolean = joinType != JoinRelType.INNER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify this? A inner join is producing retractions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inner join doesn't produce retractions, left/right/full join does, for example, left join will retract the previous non-matched output when new matched row comes from the right side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, now I understand the terminology between producing and just forwarding retractions.

isLeft: Boolean): Unit = {

val inputRow = value.row
val (curProcessTime, _) = updateCurrentSide(value, ctx, timerState, currentSideState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use Scala sugar in runtime classes. This might create an tuple object for every processed element.

cRowWrapper.reset()
cRowWrapper.setCollector(out)
cRowWrapper.setChange(value.change)
cRowWrapper.setEmitCnt(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

recordFromLeft: Boolean): Unit = {

val inputRow = value.row
val (curProcessTime, _) = updateCurrentSide(value, ctx, timerState, currentSideState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same object creation issue as above.

* Join current row with other side rows. Preserve current row if there are no matched rows
* from other side.
*/
def preservedJoin(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain return type.

* return 1 if from right.
*/
def getJoinCntIndex(isInputFromLeft: Boolean): Int = {
if (isInputFromLeft) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could return the state directly instead of returning the index.

"DataStreamCalc",
binaryNode(
"DataStreamJoin",
"DataStreamScan(true, Acc)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a test for a join that consumes from an aggregation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testJoin() has covered this case.

genFunction.code,
joinType == JoinRelType.LEFT,
queryConfig)
case JoinRelType.LEFT =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we don't support right outer joins here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I planed to add right join in FLINK-8429. It's ok to add right join in this pr if you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also do it as part of FLINK-8429.

}

@Test
def testDataStreamJoinWithAggregation(): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to use a consistent naming scheme for the test methods you added. Remove DataStream or Table from the names. And mark Inner, Outer joins correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All names have been renamed, both stream and batch tests.

@hequn8128
Copy link
Contributor Author

@twalthr Hi, Great to see your review and valuable suggestions. I will update my pr late next week(maybe next weekend). Thanks very much.

@hequn8128
Copy link
Contributor Author

@twalthr Hi, thanks for your review. I have updated the pr according to your suggestions. Changes mainly include:

  • Remove changes about UpsertSink
  • Refactor test case name and add more test to cover code path
  • Add more method comments
  • Add another base class NonWindowOuterJoinWithNonEquiPredicates and move corresponding variables and functions into it.
  • Split CRowWrappingMultiOutputCollector into CRowWrappingMultiOutputCollector and LazyOutputCollector.

Best, Hequn.

@twalthr
Copy link
Contributor

twalthr commented May 17, 2018

Thanks for the update @hequn8128. The changes look good. I tested your implementation on a cluster with TPC-H data. The results were equal to the batch results and the state clean-up worked. I will merge this :-)

@asfgit asfgit closed this in 8b95ba3 May 17, 2018
cjolif pushed a commit to cjolif/flink that referenced this pull request May 19, 2018
Two different CoProcessFunctions are used to implement left join for performance reasons. One for left join with non-equal predicates, the other for left join without non-equal predicates. The main difference between them is, for left join without non-equal predicates, left rows can always find matching right rows as long as join keys are same.

- Left join with non-equal predicates: Use a mapState to keep how many rows(joinCnt) from right table can be matched by current left row. If joinCnt is 0, output NULL right with left row. If joinCnt is changed from 0 to 1, retract the previous NULL right output and output the matched result. If joinCnt is changed from 1 to 0 when received a right retract input, retract the previous mathched result and output NULL right with left row.
- Left join without non-equal predicates: We don't need to count joinCnt any more, because joinCnt is same with right state size, so check state size is ok.

Table Modes:
Left join will generate retractions, so DataStreamRel node of left join will working under AccRetract mode. Also, the table mode of dynamic table produced by left join is Update Mode, even if the table does not include a key definition.

This closes apache#5327.
sampathBhat pushed a commit to sampathBhat/flink that referenced this pull request Jul 26, 2018
Two different CoProcessFunctions are used to implement left join for performance reasons. One for left join with non-equal predicates, the other for left join without non-equal predicates. The main difference between them is, for left join without non-equal predicates, left rows can always find matching right rows as long as join keys are same.

- Left join with non-equal predicates: Use a mapState to keep how many rows(joinCnt) from right table can be matched by current left row. If joinCnt is 0, output NULL right with left row. If joinCnt is changed from 0 to 1, retract the previous NULL right output and output the matched result. If joinCnt is changed from 1 to 0 when received a right retract input, retract the previous mathched result and output NULL right with left row.
- Left join without non-equal predicates: We don't need to count joinCnt any more, because joinCnt is same with right state size, so check state size is ok.

Table Modes:
Left join will generate retractions, so DataStreamRel node of left join will working under AccRetract mode. Also, the table mode of dynamic table produced by left join is Update Mode, even if the table does not include a key definition.

This closes apache#5327.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants