Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-4076] Use beam join api in sql #11041

Merged
merged 12 commits into from Mar 23, 2020

Conversation

reuvenlax
Copy link
Contributor

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@reuvenlax
Copy link
Contributor Author

@kennknowles This PR still needs some comments and unit tests, but what do you think? It pulls most of the complexity of the join transforms into the generic schema Join transform. A lot of code is deleted. Could've deleted even more code, except it's being used by "lookup" joins.

@reuvenlax
Copy link
Contributor Author

@kennknowles @apilloud

A quick summary:
This switches SQL to use the schema Join API. This should be more efficient in some cases, but more importantly it greatly simplifies the join code in SQL.

As part of this, several bugs and gaps were discovered in the schema APIs. One was fixed in a previous PR, and the rest are in this PR.

This also adds "broadcast" join capability to the schema Join API. This replaces the side-input join functionality that previously was implemented in SQL.

Lookup join remains in SQL. I'm not sure it's worth (i.e. that it's a general-enough use case) trying to pull this into the core Beam transforms.

@apilloud
Copy link
Member

Run Java PreCommit

@apilloud
Copy link
Member

Run SQL Postcommit

Copy link
Member

@apilloud apilloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

This change is rather large, it is quite likely I missed some bugs. It would be nice to have some more tests. Also needs a JIRA before merging.

@reuvenlax reuvenlax changed the title Use beam join api in sql [BEAM-4076] Use beam join api in sql Mar 23, 2020
@reuvenlax
Copy link
Contributor Author

@apilloud Added a richer set of unit tests and a JIRA. Will merge once green.

@reuvenlax
Copy link
Contributor Author

Run SQL Postcommit

@reuvenlax reuvenlax merged commit b564239 into apache:master Mar 23, 2020
if (joinType == JoinRelType.LEFT) {
context.output(combineTwoRowsIntoOne(leftRow, rightNullRow, swap, schema));
}
private static FieldAccessDescriptor getJoinColumn(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually hardcodes a bad assumption a bit further than it was before: that the join is only on columns. We want to move in the other direction, and allow join conditions to be more general RexNodes, many of which still work for CoGBK and side input lookup joins. This is BEAM-6112.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this, because in the current codebase on master we could inline and delete SerializableRexNode entirely. So I went looking for how we encoding a full expression to be joined on. I didn't see rules that precomputed all of them (in which case input refs would suffice).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants