[SPARK-32854][SS] Minor code and doc improvement for stream-stream join by c21 · Pull Request #29724 · apache/spark

c21 · 2020-09-11T05:41:20Z

What changes were proposed in this pull request?

Several minor code and documentation improvement for stream-stream join. Specifically:

Remove extending from SparkPlan, as extending from BinaryExecNode is enough.
Return left/right.outputPartitioning for Left/RightOuter in outputPartitioning, as the PartitioningCollection wrapper is unnecessary (similar to batch joins ShuffledHashJoinExec, SortMergeJoinExec).
Avoid per-row check for join type (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L486-L492), by creating the method before the loop of reading rows (generateFilteredJoinedRow in storeAndJoinWithOtherSide). Similar optimization (i.e. create auxiliary method/variable per different join type before the iterator of input rows) has been done in batch join world (SortMergeJoinExec, ShuffledHashJoinExec).
Minor fix for comment/indentation for better readability.

Why are the changes needed?

Minor optimization to avoid per-row unnecessary work (this probably can be optimized away by compiler, but we can do a better join to avoid it at the first place). And other comment/indentation fix to have better code readability for future developers.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests in StreamingJoinSuite.scala as no new logic is introduced.

c21 · 2020-09-11T05:41:48Z

cc @cloud-fan and @sameeragarwal if you have time to take a look, thanks.

SparkQA · 2020-09-11T07:05:02Z

Test build #128552 has finished for PR 29724 at commit 069ad73.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2020-09-11T07:06:08Z

retest this please

cloud-fan · 2020-09-11T07:51:24Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

 * If a timestamp column with event time watermark is present in the join keys or in the input
- * data, then the it uses the watermark figure out which rows in the buffer will not join with
- * and the new data, and therefore can be discarded. Depending on the provided query conditions, we
+ * data, then it uses the watermark figure out which rows in the buffer will not join with


uses the watermark to figure out

@cloud-fan - updated.

cloud-fan · 2020-09-11T07:55:30Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala


    //  Join one side input using the other side's buffered/state rows. Here is how it is done.
    //
-    //  - `leftJoiner.joinWith(rightJoiner)` generates all rows from matching new left input with


I'm not familiar with this part, cc @zsxwing @HeartSaVioR @xuanyuanking

The comment seems to be just modified for replacing leftJoiner.joinWith(rightJoiner) with leftSideJoiner.storeAndJoinWithOtherSide(rightSideJoiner) and vice versa for right side. Other parts aren't modified.

@cloud-fan , @HeartSaVioR - yes, this is just updating the comment, because there's no leftJoiner/rightJoiner/joinWith in the file, and the original author (#19271) should mean to refer to leftSideJoiner/rightSideJoiner/storeAndJoinWithOtherSide. I think it would make sense to be consistent between code and comment here. This is anyway a minor change for comment only.

I think the original PR just wants to use pseudocode to explain, either way is ok to me.

SparkQA · 2020-09-11T12:30:27Z

Test build #128556 has finished for PR 29724 at commit 069ad73.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21

Thanks @cloud-fan and @HeartSaVioR , the PR is updated and ready for review again, thanks.

c21 · 2020-09-11T18:06:05Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

 * If a timestamp column with event time watermark is present in the join keys or in the input
- * data, then the it uses the watermark figure out which rows in the buffer will not join with
- * and the new data, and therefore can be discarded. Depending on the provided query conditions, we
+ * data, then it uses the watermark figure out which rows in the buffer will not join with


@cloud-fan - updated.

c21 · 2020-09-11T18:10:11Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala


    //  Join one side input using the other side's buffered/state rows. Here is how it is done.
    //
-    //  - `leftJoiner.joinWith(rightJoiner)` generates all rows from matching new left input with


@cloud-fan , @HeartSaVioR - yes, this is just updating the comment, because there's no leftJoiner/rightJoiner/joinWith in the file, and the original author (#19271) should mean to refer to leftSideJoiner/rightSideJoiner/storeAndJoinWithOtherSide. I think it would make sense to be consistent between code and comment here. This is anyway a minor change for comment only.

xuanyuanking · 2020-09-11T18:49:48Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala


    //  Join one side input using the other side's buffered/state rows. Here is how it is done.
    //
-    //  - `leftJoiner.joinWith(rightJoiner)` generates all rows from matching new left input with


I think the original PR just wants to use pseudocode to explain, either way is ok to me.

xuanyuanking · 2020-09-11T18:53:56Z

...src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala

-        s"${getClass.getSimpleName} should not take $x as the JoinType")
+    case LeftOuter => left.outputPartitioning
+    case RightOuter => right.outputPartitioning
+    case _ => throwBadJoinTypeException()


Nich catch,
nit: let's remove the duplicate error string in https://github.com/apache/spark/pull/29724/files#diff-e9db271d8593f070ba7096e758c8b89dR168 and https://github.com/apache/spark/pull/29724/files#diff-e9db271d8593f070ba7096e758c8b89dR162

@xuanyuanking - sorry that I don't get how to change, are you suggesting to have a string val for error message to be used in throwBadJoinTypeException and require(...): val errorMessageForJoinType = s"${getClass.getSimpleName} should not take $joinType as the JoinType"), or something else?

Yes, have a string val for the same error message.

@xuanyuanking - sure, updated.

SparkQA · 2020-09-11T23:11:52Z

Test build #128578 has finished for PR 29724 at commit cfd9cc0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-12T00:11:30Z

Test build #128580 has finished for PR 29724 at commit ed800ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2020-09-14T03:10:28Z

Addressed all comments and tests are passed. Let me know if anything needs to be changed, thanks, @cloud-fan .

HeartSaVioR

LGTM

cloud-fan · 2020-09-14T08:49:50Z

thanks, merging to master!

c21 · 2020-09-14T16:57:39Z

Thank you @cloud-fan , @HeartSaVioR and @xuanyuanking for review!

Minor code and doc improvement for stream-stream join

069ad73

probot-autolabeler bot added SQL STRUCTURED STREAMING labels Sep 11, 2020

cloud-fan reviewed Sep 11, 2020

View reviewed changes

Address comment

cfd9cc0

c21 commented Sep 11, 2020

View reviewed changes

xuanyuanking reviewed Sep 11, 2020

View reviewed changes

Address comment for commen error message

ed800ee

HeartSaVioR approved these changes Sep 14, 2020

View reviewed changes

cloud-fan closed this in 978f531 Sep 14, 2020

c21 deleted the streaming branch September 14, 2020 17:13

Conversation

c21 commented Sep 11, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

c21 commented Sep 11, 2020

Uh oh!

SparkQA commented Sep 11, 2020

Uh oh!

c21 commented Sep 11, 2020

Uh oh!

cloud-fan Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 11, 2020

Uh oh!

c21 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 11, 2020

Uh oh!

SparkQA commented Sep 12, 2020

Uh oh!

c21 commented Sep 14, 2020

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 14, 2020

Uh oh!

c21 commented Sep 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan Sep 11, 2020 •

edited

Loading