[SPARK-32862][SS] Left semi stream-stream join #30076

c21 · 2020-10-17T02:17:29Z

What changes were proposed in this pull request?

This is to support left semi join in stream-stream join. The implementation of left semi join is (mostly in StreamingSymmetricHashJoinExec and SymmetricHashJoinStateManager):

For left side input row, check if there's a match on right side state store.
- if there's a match, output the left side row, but do not put the row in left side state store (no need to put in state store).
- if there's no match, output nothing, but put the row in left side state store (with "matched" field to set to false in state store).
For right side input row, check if there's a match on left side state store.
- For all matched left rows in state store, output the rows with "matched" field as false. Set all left rows with "matched" field to be true. Only output the left side rows matched for the first time to guarantee left semi join semantics.
State store eviction: evict rows from left/right side state store below watermark, same as inner join.

Note a followup optimization can be to evict matched left side rows from state store earlier, even when the rows are still above watermark. However this needs more change in SymmetricHashJoinStateManager, so will leave this as a followup.

Why are the changes needed?

Current stream-stream join supports inner, left outer and right outer join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166 ). We do see internally a lot of users are using left semi stream-stream join (not spark structured streaming), e.g. I want to get the ad impression (join left side) which has click (joint right side), but I don't care how many clicks per ad (left semi semantics).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests in UnsupportedOperationChecker.scala and StreamingJoinSuite.scala.

c21 · 2020-10-17T02:17:52Z

cc @cloud-fan and @sameeragarwal if you guys have time to take a look, thanks.

viirya · 2020-10-17T02:40:16Z

cc @HeartSaVioR

SparkQA · 2020-10-17T03:09:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34536/

SparkQA · 2020-10-17T03:30:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34536/

SparkQA · 2020-10-17T04:43:23Z

Test build #129931 has finished for PR 30076 at commit e5af8e1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-17T04:45:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34539/

SparkQA · 2020-10-17T05:07:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34539/

SparkQA · 2020-10-17T07:05:01Z

Test build #129934 has finished for PR 30076 at commit ee16690.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-10-19T07:28:34Z

cc @tdas @zsxwing @jose-torres @gaborgsomogyi

gaborgsomogyi · 2020-10-19T10:58:22Z

I've just picked this up and from high level perspective I see at least 2 things in the PR:

Test code deduplication (which is good to do)
Left semi join itself

I suggest to split them up by creating a jira for the test code deduplication.
That would make the PR more consize and easier to review.

HeartSaVioR

Just went through the code except test suite. Seems OK to me and I'll look into test suite in a couple of days.

Btw, I have a feeling that left semi join seems to be different enough compared to the other joins, which might be worth to take a different path instead of adding exceptions for left semi.

e.g. I guess you'd like to simply remove the left side of row instead of marking it to be matched in state whenever it got matched with right side of row. (This seems to be what you say as "follow-up".)

I'm OK to do it later assuming it doesn't end up with touching public API.

...alyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

c21

Thanks @HeartSaVioR for review, addressed all comments for now.

...alyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

SparkQA · 2020-10-19T19:03:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34617/

SparkQA · 2020-10-19T19:25:22Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34617/

...alyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

viirya · 2020-10-19T19:38:59Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

+    keyWithIndexToValue.getAll(key, numValues).filterNot { keyIdxToValue =>
+      joinOnlyFirstTimeMatchedRow && keyIdxToValue.matched
+    }.map { keyIdxToValue =>


I feel it is easier to read if:

val keyIdxToValues = if (joinOnlyFirstTimeMatchedRow) { keyWithIndexToValue.getAll(key, numValues).filter { keyIdxToValue => !keyIdxToValue.matched } } else { keyWithIndexToValue.getAll(key, numValues) } keyIdxToValues.map { keyIdxToValue => ... }

I think current code is more concise and doesn't make distraction. If someone feels filterNot is something possibly making confusion, we can simply use filter and reverse the condition.

SparkQA · 2020-10-19T22:57:26Z

Test build #130010 has finished for PR 30076 at commit 3918727.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-10-20T07:25:55Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

  }

  after {
    StateStore.stop()
  }

+  protected def setupStream(prefix: String, multiplier: Int): (MemoryStream[Int], DataFrame) = {


It'd be pretty much helpful to provide guide comments on tracking refactors.
e.g. this is equivalent to StreamingOuterJoinSuite.setupStream with changing signature private to protected to co-use.

HeartSaVioR · 2020-10-20T07:27:53Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+    val windowed1 = df1.select('key, window('leftTime, "10 second"), 'leftValue)
+    val windowed2 = df2.select('key, window('rightTime, "10 second"), 'rightValue)
+    val joined = windowed1.join(windowed2, Seq("key", "window"), joinType)
+    val select = if (joinType == "left_semi") {


For reviewers: this is equivalent to StreamingOuterJoinSuite.setupWindowedJoin with changing

signature private to protected

conditional select on left_semi vs others, as in left_semi only left side of columns are available

HeartSaVioR · 2020-10-20T07:32:36Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+    (input1, input2, select)
+  }
+
+  protected def setupWindowedJoinWithLeftCondition(joinType: String)


For reviewers: this is extracted from test("left outer early state exclusion on left") / test("right outer early state exclusion on left"), with adding select per join type.

HeartSaVioR · 2020-10-20T07:34:21Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+    (leftInput, rightInput, select)
+  }
+
+  protected def setupWindowedJoinWithRightCondition(joinType: String)


For reviewers: this is extracted from test("left outer early state exclusion on right") / test("right outer early state exclusion on right"), with adding select per join type.

HeartSaVioR · 2020-10-20T07:38:50Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+    (leftInput, rightInput, select)
+  }
+
+  protected def setupWindowedJoinWithRangeCondition(joinType: String)


For reviewers: this is extracted from test(s"${joinType.replaceAllLiterally("_", " ")} with watermark range condition"), with conditional select on left_semi vs others, as in left_semi only left side of columns are available.

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

c21

Addressed all comments and the PR is ready for review again, cc @HeartSaVioR , thanks.

...alyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

SparkQA · 2020-10-21T06:40:38Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34681/

SparkQA · 2020-10-21T07:05:02Z

Test build #130072 has finished for PR 30076 at commit 765a233.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-21T07:09:11Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34681/

c21 · 2020-10-21T07:36:30Z

retest this please

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

SparkQA · 2020-10-21T08:54:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34691/

SparkQA · 2020-10-21T09:18:02Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34691/

c21

Addressed all comments.

c21 · 2020-10-21T19:46:22Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationChecker.scala

              if (right.isStreaming) {
-                throwError("Left semi/anti joins with a streaming DataFrame/Dataset " +
+                throwError("Left anti joins with a streaming DataFrame/Dataset " +
                    "on the right are not supported")
              }

            // We support streaming left outer joins with static on the right always, and with


@xuanyuanking - updated.

c21 · 2020-10-21T19:55:17Z

...alyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/UnsupportedOperationsSuite.scala

+    streamRelation.join(streamRelation, joinType = LeftSemi,
+      condition = Some(attribute === attribute)),
+    OutputMode.Append(),
+    Seq("watermark in the join keys"))


@xuanyuanking - yeah I agree adding "without" would be better. I updated for the left semi join here. A refactoring for all joins (inner, outer, semi, anti, etc) is anyway needed as a followup JIRA (https://issues.apache.org/jira/browse/SPARK-33209), so I want to clean up other places in a separate PR, e.g. "appropriate range condition" has similar problem.

c21 · 2020-10-21T19:58:18Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+      // right: (1, 10), (2, 5)
+      assertNumStateRows(total = 4, updated = 2),
+      AddData(rightInput, (1, 11)),
+      // No match as left time is too low and left row is already matched.


@HeartSaVioR - sounds good, updated.

c21 · 2020-10-21T20:08:08Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala

+      // states
+      // left: (2, 2L), (4, 4L)
+      //       (left rows with value % 2 != 0 is filtered per [[PushDownLeftSemiAntiJoin]])
+      // right: (2, 2L), (4, 4L)


@HeartSaVioR - updated, I also figured the optimization rule should be PushPredicateThroughJoin, instead of PushDownLeftSemiAntiJoin , updated comment as well.

c21 · 2020-10-21T20:18:09Z

...ain/scala/org/apache/spark/sql/execution/streaming/state/SymmetricHashJoinStateManager.scala

    val numValues = keyToNumValues.get(key)
-    keyWithIndexToValue.getAll(key, numValues).map { keyIdxToValue =>
+    keyWithIndexToValue.getAll(key, numValues).filterNot { keyIdxToValue =>


FYI I created https://issues.apache.org/jira/browse/SPARK-33211 for this followup.

SparkQA · 2020-10-21T21:07:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34715/

SparkQA · 2020-10-21T21:29:02Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34715/

HeartSaVioR

LGTM, only a nit just commented.

SparkQA · 2020-10-21T23:23:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34719/

SparkQA · 2020-10-21T23:58:43Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34719/

SparkQA · 2020-10-22T00:50:10Z

Test build #130106 has finished for PR 30076 at commit 9cd222f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-22T03:49:50Z

Test build #130110 has finished for PR 30076 at commit 14871d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-10-22T05:41:19Z

I'll let the PR around 2 days to see whether others have additional comments. If no further comment is provided I'll merge this probably in this weekend. cc. @viirya @xuanyuanking

xuanyuanking · 2020-10-22T06:26:26Z

@HeartSaVioR Agree, post my LGTM.

viirya · 2020-10-22T16:17:44Z

Will go through this again today.

c21 · 2020-10-23T17:09:54Z

@viirya - wondering any more comments? thanks.

c21 · 2020-10-25T20:32:07Z

@viirya - gentle ping again, any more comments before merging? Thanks.

viirya

I don't look at tests in details, but the left-semi streaming join part looks okay,

HeartSaVioR · 2020-10-25T23:05:52Z

retest this, please

SparkQA · 2020-10-25T23:53:44Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34846/

SparkQA · 2020-10-26T00:19:30Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34846/

c21 · 2020-10-26T00:29:49Z

retest this please

SparkQA · 2020-10-26T01:16:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34848/

SparkQA · 2020-10-26T01:45:35Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34848/

SparkQA · 2020-10-26T03:50:38Z

Test build #130246 has finished for PR 30076 at commit 14871d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2020-10-26T04:32:18Z

Thanks, merging to master.

HeartSaVioR · 2020-10-26T04:34:02Z

Thanks for your contribution! Merged into master.

SparkQA · 2020-10-26T05:16:19Z

Test build #130248 has finished for PR 30076 at commit 14871d9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2020-10-26T05:38:43Z

Thank you @HeartSaVioR , @xuanyuanking and @viirya for review!

…rtedOperationsSuite ### What changes were proposed in this pull request? This PR is a followup from #30076 to refactor unit test of stream-stream join in `UnsupportedOperationsSuite`, where we had a lot of duplicated code for stream-stream join unit test, for each join type. ### Why are the changes needed? Help reduce duplicated code and make it easier for developers to read and add code in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test in `UnsupportedOperationsSuite.scala` (pure refactoring). Closes #30347 from c21/stream-test. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>

left semi stream-stream join

e5af8e1

c21 changed the title ~~[SPARK-32862][SQL] Left semi stream-stream join~~ [SPARK-32862][SS] Left semi stream-stream join Oct 17, 2020

Fix unit test

ee16690

HeartSaVioR reviewed Oct 19, 2020

View reviewed changes

Address all comments

3918727

c21 commented Oct 19, 2020

View reviewed changes

viirya reviewed Oct 19, 2020

View reviewed changes

HeartSaVioR reviewed Oct 20, 2020

View reviewed changes

Address all comments

765a233

c21 commented Oct 21, 2020

View reviewed changes

HeartSaVioR reviewed Oct 21, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala Show resolved Hide resolved

c21 commented Oct 21, 2020

View reviewed changes

HeartSaVioR approved these changes Oct 21, 2020

View reviewed changes

Update comment

14871d9

viirya reviewed Oct 25, 2020

View reviewed changes

HeartSaVioR closed this in d87a0bb Oct 26, 2020

c21 deleted the stream-join branch October 26, 2020 07:27

c21 mentioned this pull request Nov 12, 2020

[SPARK-33209][SS] Refactor unit test of stream-stream join in UnsupportedOperationsSuite #30347

Closed

[SPARK-32862][SS] Left semi stream-stream join #30076

[SPARK-32862][SS] Left semi stream-stream join #30076

Conversation

c21 commented Oct 17, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

c21 commented Oct 17, 2020

viirya commented Oct 17, 2020

SparkQA commented Oct 17, 2020

SparkQA commented Oct 17, 2020

SparkQA commented Oct 17, 2020

SparkQA commented Oct 17, 2020

SparkQA commented Oct 17, 2020

SparkQA commented Oct 17, 2020

HeartSaVioR commented Oct 19, 2020

gaborgsomogyi commented Oct 19, 2020

HeartSaVioR left a comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

SparkQA commented Oct 19, 2020

SparkQA commented Oct 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c21 left a comment

Choose a reason for hiding this comment

SparkQA commented Oct 21, 2020

SparkQA commented Oct 21, 2020

SparkQA commented Oct 21, 2020

c21 commented Oct 21, 2020

SparkQA commented Oct 21, 2020

SparkQA commented Oct 21, 2020

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 21, 2020

SparkQA commented Oct 21, 2020

HeartSaVioR left a comment

Choose a reason for hiding this comment

SparkQA commented Oct 21, 2020

SparkQA commented Oct 21, 2020

SparkQA commented Oct 22, 2020

SparkQA commented Oct 22, 2020

HeartSaVioR commented Oct 22, 2020

xuanyuanking commented Oct 22, 2020

viirya commented Oct 22, 2020

c21 commented Oct 23, 2020

c21 commented Oct 25, 2020

viirya left a comment

Choose a reason for hiding this comment

HeartSaVioR commented Oct 25, 2020

SparkQA commented Oct 25, 2020

SparkQA commented Oct 26, 2020

c21 commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

SparkQA commented Oct 26, 2020

HeartSaVioR commented Oct 26, 2020

HeartSaVioR commented Oct 26, 2020

SparkQA commented Oct 26, 2020

c21 commented Oct 26, 2020