Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32862][SS] Left semi stream-stream join #30076

Closed
wants to merge 6 commits into from

Conversation

c21
Copy link
Contributor

@c21 c21 commented Oct 17, 2020

What changes were proposed in this pull request?

This is to support left semi join in stream-stream join. The implementation of left semi join is (mostly in StreamingSymmetricHashJoinExec and SymmetricHashJoinStateManager):

  • For left side input row, check if there's a match on right side state store.
    • if there's a match, output the left side row, but do not put the row in left side state store (no need to put in state store).
    • if there's no match, output nothing, but put the row in left side state store (with "matched" field to set to false in state store).
  • For right side input row, check if there's a match on left side state store.
    • For all matched left rows in state store, output the rows with "matched" field as false. Set all left rows with "matched" field to be true. Only output the left side rows matched for the first time to guarantee left semi join semantics.
  • State store eviction: evict rows from left/right side state store below watermark, same as inner join.

Note a followup optimization can be to evict matched left side rows from state store earlier, even when the rows are still above watermark. However this needs more change in SymmetricHashJoinStateManager, so will leave this as a followup.

Why are the changes needed?

Current stream-stream join supports inner, left outer and right outer join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166 ). We do see internally a lot of users are using left semi stream-stream join (not spark structured streaming), e.g. I want to get the ad impression (join left side) which has click (joint right side), but I don't care how many clicks per ad (left semi semantics).

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests in UnsupportedOperationChecker.scala and StreamingJoinSuite.scala.

@c21
Copy link
Contributor Author

c21 commented Oct 17, 2020

cc @cloud-fan and @sameeragarwal if you guys have time to take a look, thanks.

@c21 c21 changed the title [SPARK-32862][SQL] Left semi stream-stream join [SPARK-32862][SS] Left semi stream-stream join Oct 17, 2020
@viirya
Copy link
Member

viirya commented Oct 17, 2020

cc @HeartSaVioR

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34536/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34536/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Test build #129931 has finished for PR 30076 at commit e5af8e1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34539/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34539/

@SparkQA
Copy link

SparkQA commented Oct 17, 2020

Test build #129934 has finished for PR 30076 at commit ee16690.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

@gaborgsomogyi
Copy link
Contributor

I've just picked this up and from high level perspective I see at least 2 things in the PR:

  • Test code deduplication (which is good to do)
  • Left semi join itself

I suggest to split them up by creating a jira for the test code deduplication.
That would make the PR more consize and easier to review.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just went through the code except test suite. Seems OK to me and I'll look into test suite in a couple of days.

Btw, I have a feeling that left semi join seems to be different enough compared to the other joins, which might be worth to take a different path instead of adding exceptions for left semi.

e.g. I guess you'd like to simply remove the left side of row instead of marking it to be matched in state whenever it got matched with right side of row. (This seems to be what you say as "follow-up".)

I'm OK to do it later assuming it doesn't end up with touching public API.

Copy link
Contributor Author

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HeartSaVioR for review, addressed all comments for now.

@SparkQA
Copy link

SparkQA commented Oct 19, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34617/

@SparkQA
Copy link

SparkQA commented Oct 19, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34617/

Comment on lines 113 to 115
keyWithIndexToValue.getAll(key, numValues).filterNot { keyIdxToValue =>
joinOnlyFirstTimeMatchedRow && keyIdxToValue.matched
}.map { keyIdxToValue =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it is easier to read if:

val keyIdxToValues = if (joinOnlyFirstTimeMatchedRow) {
  keyWithIndexToValue.getAll(key, numValues).filter { keyIdxToValue =>
    !keyIdxToValue.matched
  }
} else {
  keyWithIndexToValue.getAll(key, numValues) 
}

keyIdxToValues.map { keyIdxToValue =>
  ...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think current code is more concise and doesn't make distraction. If someone feels filterNot is something possibly making confusion, we can simply use filter and reverse the condition.

@SparkQA
Copy link

SparkQA commented Oct 19, 2020

Test build #130010 has finished for PR 30076 at commit 3918727.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

after {
StateStore.stop()
}

protected def setupStream(prefix: String, multiplier: Int): (MemoryStream[Int], DataFrame) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be pretty much helpful to provide guide comments on tracking refactors.
e.g. this is equivalent to StreamingOuterJoinSuite.setupStream with changing signature private to protected to co-use.

val windowed1 = df1.select('key, window('leftTime, "10 second"), 'leftValue)
val windowed2 = df2.select('key, window('rightTime, "10 second"), 'rightValue)
val joined = windowed1.join(windowed2, Seq("key", "window"), joinType)
val select = if (joinType == "left_semi") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers: this is equivalent to StreamingOuterJoinSuite.setupWindowedJoin with changing

  1. signature private to protected
  2. conditional select on left_semi vs others, as in left_semi only left side of columns are available

(input1, input2, select)
}

protected def setupWindowedJoinWithLeftCondition(joinType: String)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers: this is extracted from test("left outer early state exclusion on left") / test("right outer early state exclusion on left"), with adding select per join type.

(leftInput, rightInput, select)
}

protected def setupWindowedJoinWithRightCondition(joinType: String)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers: this is extracted from test("left outer early state exclusion on right") / test("right outer early state exclusion on right"), with adding select per join type.

(leftInput, rightInput, select)
}

protected def setupWindowedJoinWithRangeCondition(joinType: String)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers: this is extracted from test(s"${joinType.replaceAllLiterally("_", " ")} with watermark range condition"), with conditional select on left_semi vs others, as in left_semi only left side of columns are available.

Copy link
Contributor Author

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all comments and the PR is ready for review again, cc @HeartSaVioR , thanks.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34681/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Test build #130072 has finished for PR 30076 at commit 765a233.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34681/

@c21
Copy link
Contributor Author

c21 commented Oct 21, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34691/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34691/

Copy link
Contributor Author

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all comments.

if (right.isStreaming) {
throwError("Left semi/anti joins with a streaming DataFrame/Dataset " +
throwError("Left anti joins with a streaming DataFrame/Dataset " +
"on the right are not supported")
}

// We support streaming left outer joins with static on the right always, and with
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuanyuanking - updated.

streamRelation.join(streamRelation, joinType = LeftSemi,
condition = Some(attribute === attribute)),
OutputMode.Append(),
Seq("watermark in the join keys"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuanyuanking - yeah I agree adding "without" would be better. I updated for the left semi join here. A refactoring for all joins (inner, outer, semi, anti, etc) is anyway needed as a followup JIRA (https://issues.apache.org/jira/browse/SPARK-33209), so I want to clean up other places in a separate PR, e.g. "appropriate range condition" has similar problem.

// right: (1, 10), (2, 5)
assertNumStateRows(total = 4, updated = 2),
AddData(rightInput, (1, 11)),
// No match as left time is too low and left row is already matched.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR - sounds good, updated.

// states
// left: (2, 2L), (4, 4L)
// (left rows with value % 2 != 0 is filtered per [[PushDownLeftSemiAntiJoin]])
// right: (2, 2L), (4, 4L)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeartSaVioR - updated, I also figured the optimization rule should be PushPredicateThroughJoin, instead of PushDownLeftSemiAntiJoin , updated comment as well.

val numValues = keyToNumValues.get(key)
keyWithIndexToValue.getAll(key, numValues).map { keyIdxToValue =>
keyWithIndexToValue.getAll(key, numValues).filterNot { keyIdxToValue =>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I created https://issues.apache.org/jira/browse/SPARK-33211 for this followup.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34715/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34715/

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only a nit just commented.

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34719/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34719/

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Test build #130106 has finished for PR 30076 at commit 9cd222f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 22, 2020

Test build #130110 has finished for PR 30076 at commit 14871d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

I'll let the PR around 2 days to see whether others have additional comments. If no further comment is provided I'll merge this probably in this weekend. cc. @viirya @xuanyuanking

@xuanyuanking
Copy link
Member

@HeartSaVioR Agree, post my LGTM.

@viirya
Copy link
Member

viirya commented Oct 22, 2020

Will go through this again today.

@c21
Copy link
Contributor Author

c21 commented Oct 23, 2020

@viirya - wondering any more comments? thanks.

@c21
Copy link
Contributor Author

c21 commented Oct 25, 2020

@viirya - gentle ping again, any more comments before merging? Thanks.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't look at tests in details, but the left-semi streaming join part looks okay,

@HeartSaVioR
Copy link
Contributor

retest this, please

@SparkQA
Copy link

SparkQA commented Oct 25, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34846/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34846/

@c21
Copy link
Contributor Author

c21 commented Oct 26, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34848/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34848/

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Test build #130246 has finished for PR 30076 at commit 14871d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

Thanks, merging to master.

@HeartSaVioR
Copy link
Contributor

Thanks for your contribution! Merged into master.

@SparkQA
Copy link

SparkQA commented Oct 26, 2020

Test build #130248 has finished for PR 30076 at commit 14871d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@c21
Copy link
Contributor Author

c21 commented Oct 26, 2020

Thank you @HeartSaVioR , @xuanyuanking and @viirya for review!

@c21 c21 deleted the stream-join branch October 26, 2020 07:27
HeartSaVioR pushed a commit that referenced this pull request Nov 17, 2020
…rtedOperationsSuite

### What changes were proposed in this pull request?

This PR is a followup from #30076 to refactor unit test of stream-stream join in `UnsupportedOperationsSuite`, where we had a lot of duplicated code for stream-stream join unit test, for each join type.

### Why are the changes needed?

Help reduce duplicated code and make it easier for developers to read and add code in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test in `UnsupportedOperationsSuite.scala` (pure refactoring).

Closes #30347 from c21/stream-test.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants