-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-15590: [C++] Add support for joins to the Substrait consumer #13078
Conversation
|
@westonpace adding an initial version of the PR. Please take a look. |
@westonpace not sure why |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good start for join support. I wonder how we might start building up some end-to-end tests that use Substrait. Any ideas?
I had this in my mind. The neatest/fastest way is to use Substrait-Python integration. Or we need to write a data module in the C++ test suite to provide end-to-end testing. I used existing datasets from the Parquet test data submodule. I see most of the test cases in Substrait features are not end-to-end tests. May be we can start from now. WDYT? |
return Status::NotImplemented("Outer join type is not supported"); | ||
case 3: | ||
return Status::NotImplemented("Left join type is not supported"); | ||
case 4: | ||
return Status::NotImplemented("Right join type is not supported"); | ||
case 5: | ||
return Status::NotImplemented("Semi join type is not supported"); | ||
case 6: | ||
return Status::NotImplemented("Anti join type is not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is stopping us from supporting these joins? Is it just a matter of mapping to the appropriate Arrow equivalent enum? Or a matter of testing? Is there a follow-up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly, I am not 100% clear on mapping the definitions Substrait-Arrow. I thought about doing it in the follow up PR. Not any special reason stopping doing it for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Substrait | Arrow |
---|---|
JOIN_TYPE_INNER | INNER |
JOIN_TYPE_OUTER | FULL_OUTER |
JOIN_TYPE_LEFT | LEFT_OUTER |
JOIN_TYPE_RIGHT | RIGHT_OUTER |
JOIN_TYPE_SEMI | LEFT_SEMI |
JOIN_TYPE_ANTI | LEFT_ANTI |
JOIN_TYPE_SINGLE | - |
- | RIGHT_SEMI |
- | RIGHT_ANTI |
JOIN_TYPE_SINGLE
is interesting and I'd almost interpret that as an optional bool that could be added to any existing join. I'll create an issue to seek clarification. Let's add support for the rest in this PR since it should be straightforward enough.
Substrait doesn't have RIGHT_SEMI
or RIGHT_ANTI
and both of these can easily be created by just swapping the inputs so I'm not sure it is critical right now. Although it might help (for performance reasons) to keep the probe/build side separate from the side we want to keep the output for so they may come in handy in the future. I'm honestly not entirely sure. Either way, let's not worry about those for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
substrait-io/substrait#206 for the JOIN_TYPE_SINGLE
question but let's not block this PR waiting for that to resolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace thank you for the clarification. I will update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace updated the PR.
"Only Support `equal` or `is_not_distinct_from` for join key comparison"); | ||
} | ||
|
||
// TODO: Add Suffix support for Substrait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a follow-up JIRA or link to a Substrait issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up JIRA created: https://issues.apache.org/jira/browse/ARROW-16624
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the JIRA number in the comment?
@westonpace These two JIRAs [1] [2] will be helpful in finalizing the overall tasks associated with Substrait-Joins [1] https://issues.apache.org/jira/browse/ARROW-16485 Any other things that we should take care in this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few minor nits
"Only Support `equal` or `is_not_distinct_from` for join key comparison"); | ||
} | ||
|
||
// TODO: Add Suffix support for Substrait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the JIRA number in the comment?
adding join type enum Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
cc @westonpace not sure why this is occurring: https://github.com/apache/arrow/runs/6650523294?check_suite_focus=true#step:6:5132 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI failures appear to be unrelated.
Initial Version of Substrait Join Support
This PR doesn't support the complete join functionality, but it include the following features.
This will be a followed by a set of PRs to solve the remaining features [1].
Features included
Todo:
[1]. https://issues.apache.org/jira/browse/ARROW-16485