[FLINK-6232][Table&Sql] support proctime inner windowed stream join #4266

hongyuhong · 2017-07-06T03:30:31Z

Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration.
If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the How To Contribute guide.
In addition to going through the list, please provide a meaningful description of your changes.

General
- The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text")
- The pull request addresses only one issue
- Each commit in the PR has a meaningful commit message (including the JIRA id)
Documentation
- Documentation has been added for new functionality
- Old documentation affected by the pull request has been updated
- JavaDoc for public methods has been added
Tests & Build
- Functionality added by the pull request is covered by tests
- mvn clean verify has been executed successfully locally or a Travis build has passed

wuchong · 2017-07-06T03:40:27Z

Hi @hongyuhong , is this the same PR with #3715 ? In order to rebase/remove merge commit, please do not create a new PR, otherwise committers may review an out-date PR or lose the review context.

You can force update your repo branch via git push <your-repo-name> flink-6232 --force and close this PR.

Thanks,
Jark

hongyuhong · 2017-07-06T03:49:09Z

Hi @wuchong, thanks for the reminding. There still have some modify in the new commit, so i want to reserve the older commit for easy reviewing, after the reviewing finish, i will close the pr #3715.

Thanks very much.

fhueske · 2017-07-13T07:53:24Z

Thanks for the update @hongyuhong!
I will take this PR from here. The logic looks very good but I would like to refactor some parts (mainly the WindowJoinUtil).

I will open a new PR with your work and my commit on top, probably later today.
It would be great if you could review and check my PR.

@wuchong your review is of course also highly welcome :-)

Thank you, Fabian

wuchong

Hi @hongyuhong , thanks for your great work! I left some comments below (most is code style).

Overall looks good to me.

@fhueske , I just finished the review. Sorry for the delay.

Cheers,
Jark

wuchong · 2017-07-11T12:19:02Z

...ibraries/flink-table/src/main/scala/org/apache/flink/table/runtime/join/WindowJoinUtil.scala

+ */
+package org.apache.flink.table.runtime.join
+
+import java.math.{BigDecimal => JBigDecimal}


remove unused import

wuchong · 2017-07-11T12:20:01Z

...table/src/main/scala/org/apache/flink/table/plan/nodes/datastream/DataStreamWindowJoin.scala

+
+    joinType match {
+      case JoinRelType.INNER =>
+        isRowTime match {


I think using a if (isRowTime) else here is more simple.

wuchong · 2017-07-11T12:57:14Z

...ibraries/flink-table/src/main/scala/org/apache/flink/table/runtime/join/WindowJoinUtil.scala

+object WindowJoinUtil {
+
+  /**
+    * Analyze time-condtion to get time boundary for each stream and get the time type


minor typo: condtion -> condition

wuchong · 2017-07-11T12:58:57Z

...ibraries/flink-table/src/main/scala/org/apache/flink/table/runtime/join/WindowJoinUtil.scala

+        "two join predicates that bound the time in both directions.")
+    }
+
+    // extract time offset from the time indicator conditon


minor typo: conditon -> condition

wuchong · 2017-07-11T13:13:12Z

...ries/flink-table/src/test/scala/org/apache/flink/table/api/scala/stream/sql/JoinITCase.scala

+    val env = StreamExecutionEnvironment.getExecutionEnvironment
+    val tEnv = TableEnvironment.getTableEnvironment(env)
+    env.setStateBackend(getStateBackend)
+    StreamITCase.testResults = mutable.MutableList()


You can simply do StreamITCase.clear instead of this.

wuchong · 2017-07-11T13:13:36Z

...ries/flink-table/src/test/scala/org/apache/flink/table/api/scala/stream/sql/JoinITCase.scala

+    val env = StreamExecutionEnvironment.getExecutionEnvironment
+    val tEnv = TableEnvironment.getTableEnvironment(env)
+    env.setStateBackend(getStateBackend)
+    StreamITCase.testResults = mutable.MutableList()


You can simply do StreamITCase.clear instead of this.

wuchong · 2017-07-11T13:14:29Z

...ries/flink-table/src/test/scala/org/apache/flink/table/api/scala/stream/sql/JoinITCase.scala

+
+class JoinITCase extends StreamingWithStateTestBase {
+
+  val data = List(


Looks like the data is never used, can we remove it?

wuchong · 2017-07-13T05:05:02Z

...ibraries/flink-table/src/main/scala/org/apache/flink/table/runtime/join/WindowJoinUtil.scala

+    val reduceList = new util.ArrayList[RexNode]()
+    exprReducer.reduce(rexBuilder, originList, reduceList)
+
+    val literals = reduceList.asScala.map(f => f match {


Can be simplified to

val literals = reduceList.asScala.map { case literal: RexLiteral => literal.getValue2.asInstanceOf[Long] case _ => throw TableException( "Time condition may only consist of time attributes, literals, and arithmetic operators.") }

wuchong · 2017-07-13T07:07:19Z

...flink-table/src/main/scala/org/apache/flink/table/runtime/join/ProcTimeWindowInnerJoin.scala

+    // If the state has non-expired timestamps, register a new timer.
+    // Otherwise clean the complete state for this input.
+    if (nextTimer != 0) {
+      ctx.timerService.registerProcessingTimeTimer(nextTimer + winSize + 1)


The nextTimer maybe not the smallest or greatest timestamp among the non-expired timestamps. Is it better to register a curTime + winSize + 1 timer?

yes, that makes sense to me.

wuchong · 2017-07-13T08:15:16Z

...flink-table/src/main/scala/org/apache/flink/table/runtime/join/ProcTimeWindowInnerJoin.scala

+    val oppoUpperTime = curProcessTime + oppoUpperBound
+
+    // only when windowsize != 0, we need to store the element
+    if (winSize != 0) {


I'm not sure about this. For example, a.proctime between b.proctime - 5 and b.proctime. In this case, we will buffer stream a for a window size 5, but will not buffer stream b because the right window size is 0.

Suppose the input elements are [a1, 1], [a2, 2], [b1, 5], [a3, 5]. The first field in the tuple indicates which stream it belongs to. The seconds field in the tuple is the processing timestamp. The expected result should be a1, b1, a2, b1, a3, b1. But the actual result misses a3, b1. Because we didn't buffer the elements from b stream.

So I think, even if the window size is 0, we still need to store the elements. Of course, we will register a curTime +1 timer to clean the states.

I think you are right @wuchong. I'll remove that condition.
OTOH, this is a processing time join which cannot guarantee strict results anyway ;-)

fhueske · 2017-07-13T08:44:13Z

Thanks for the review @wuchong.
I'll address your comments in my upcoming PR.

wuchong · 2017-07-13T09:06:23Z

...table/src/main/scala/org/apache/flink/table/plan/nodes/datastream/DataStreamWindowJoin.scala

+
+    val config = tableEnv.getConfig
+
+    val isLeftAppendOnly = UpdateCheckUtils.isAppendOnly(left)


We should use DataStreamRetractionRules.isAccRetract(input) to check whether the input will produces updates.

isAccRetract only checks how updates are encoded but not whether there are updates.
The current approach is correct, IMO.

The following SQL select a, sum(b), a+1 from t1 group by a will optimized into the following nodes:

DataStreamCalc (AccRetract, producesUpdates=false) DataStreamGroupAggregate (AccRetract, producesUpdates=true) DataStreamScan (Acc, producesUpdates=fale)

The DataStreamCalc is append only, but is in AccRetract mode which means the output contains retraction.

I think we want to check whether the input contains retraction, right?

the UpdateCheckUtils.isAppendOnly recursively checks if any downstream operator produces updates. As soon as any downstream operator produces updates, the given operator has to be able to handle them.

Updates can be encodes as retraction or as key-wise updates. Retraction updates produce two messages. Non-retraction updates produce a single message and require a key to which they relate (CRow.change == true -> insert or update per key, CRow.change == false -> delete on key). Right now, only UpsertTableSinks use non-retraction/keyed updates, but other operators such as unbounded joins will use it as well.

So even if AccRetract is false, the input might produce updates but those updates are differently encoded, i.e., in a single message. The window stream join is not able to handle updates (it ignores the CRow.change flag). Therefore, we must ensure that the inputs do not produce updates.

Thank you for the explanation, that makes sense to me. But I find DataStreamOverAggregate and DataStreamGroupWindowAggregate use DataStreamRetractionRules.isAccRetract, is that a misusage?

Yes, I think you are right. These checks should also check for updates and not retraction mode.

Maybe it makes sense to integrate the whole append-only/updates check into the decoration rules. Same for the inference of unique keys (the other method in UpdateCheckUtils).

fhueske · 2017-07-13T16:28:30Z

Hi @hongyuhong and @wuchong, I opened a new PR which extends this PR.
Please have a look and give feedback.

@hongyuhong can you close the PRs #3715 and this one?

Thank you, Fabian

[FLINK-6232][Table&Sql] support proctime inner windowed stream join

e2baf55

hongyuhong mentioned this pull request Jul 6, 2017

[FLINK-6232][Table&SQL]Support proctime inner equi-join between two s… #3715

Closed

3 tasks

wuchong reviewed Jul 13, 2017

View reviewed changes

fhueske mentioned this pull request Jul 13, 2017

[FLINK-6232] [table] Add processing time window inner join to SQL. #4324

Closed

hongyuhong closed this Jul 14, 2017

rmetzger added the component=API/TableSQL label Mar 14, 2019

flinkbot added component=TableSQL/API and removed component=API/TableSQL labels Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-6232][Table&Sql] support proctime inner windowed stream join #4266

[FLINK-6232][Table&Sql] support proctime inner windowed stream join #4266

hongyuhong commented Jul 6, 2017

wuchong commented Jul 6, 2017

hongyuhong commented Jul 6, 2017

fhueske commented Jul 13, 2017

wuchong left a comment

wuchong Jul 11, 2017

wuchong Jul 11, 2017

wuchong Jul 11, 2017

wuchong Jul 11, 2017

wuchong Jul 11, 2017

wuchong Jul 11, 2017

wuchong Jul 11, 2017

wuchong Jul 13, 2017

wuchong Jul 13, 2017

fhueske Jul 13, 2017

wuchong Jul 13, 2017

fhueske Jul 13, 2017

fhueske commented Jul 13, 2017

wuchong Jul 13, 2017

fhueske Jul 13, 2017

wuchong Jul 13, 2017

fhueske Jul 13, 2017 •

edited

wuchong Jul 13, 2017 •

edited

fhueske Jul 13, 2017

fhueske commented Jul 13, 2017


		class JoinITCase extends StreamingWithStateTestBase {

		val data = List(


		val config = tableEnv.getConfig

		val isLeftAppendOnly = UpdateCheckUtils.isAppendOnly(left)

[FLINK-6232][Table&Sql] support proctime inner windowed stream join #4266

[FLINK-6232][Table&Sql] support proctime inner windowed stream join #4266

Conversation

hongyuhong commented Jul 6, 2017

wuchong commented Jul 6, 2017

hongyuhong commented Jul 6, 2017

fhueske commented Jul 13, 2017

wuchong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhueske commented Jul 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhueske Jul 13, 2017 • edited

Choose a reason for hiding this comment

wuchong Jul 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhueske commented Jul 13, 2017

fhueske Jul 13, 2017 •

edited

wuchong Jul 13, 2017 •

edited