Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-15590: [C++] Add support for joins to the Substrait consumer #13078

Merged
merged 20 commits into from
Jun 1, 2022

Conversation

vibhatha
Copy link
Collaborator

@vibhatha vibhatha commented May 6, 2022

Initial Version of Substrait Join Support

This PR doesn't support the complete join functionality, but it include the following features.
This will be a followed by a set of PRs to solve the remaining features [1].

Features included

  • Only Support Inner Join (A follow up PR would include the support for other join types)
  • Support Join operations with a single call-expression of types "equal" and "is_not_distinct_from"
  • Test cases to check the basic functionality and limitations

Todo:

  • Fix the Windows CI Issue

[1]. https://issues.apache.org/jira/browse/ARROW-16485

@github-actions
Copy link

github-actions bot commented May 6, 2022

@github-actions
Copy link

github-actions bot commented May 6, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@vibhatha
Copy link
Collaborator Author

vibhatha commented May 6, 2022

@westonpace adding an initial version of the PR. Please take a look.

@westonpace westonpace self-requested a review May 10, 2022 14:54
@vibhatha
Copy link
Collaborator Author

@westonpace not sure why appveyor CI is failing...

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start for join support. I wonder how we might start building up some end-to-end tests that use Substrait. Any ideas?

cpp/src/arrow/engine/substrait/serde_test.cc Show resolved Hide resolved
cpp/src/arrow/compute/exec/options.h Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
@vibhatha
Copy link
Collaborator Author

This looks like a good start for join support. I wonder how we might start building up some end-to-end tests that use Substrait. Any ideas?

I had this in my mind. The neatest/fastest way is to use Substrait-Python integration. Or we need to write a data module in the C++ test suite to provide end-to-end testing. I used existing datasets from the Parquet test data submodule.
But Python may not have everything exposed, may be we can write one here.

I see most of the test cases in Substrait features are not end-to-end tests. May be we can start from now. WDYT?

Comment on lines 248 to 256
return Status::NotImplemented("Outer join type is not supported");
case 3:
return Status::NotImplemented("Left join type is not supported");
case 4:
return Status::NotImplemented("Right join type is not supported");
case 5:
return Status::NotImplemented("Semi join type is not supported");
case 6:
return Status::NotImplemented("Anti join type is not supported");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is stopping us from supporting these joins? Is it just a matter of mapping to the appropriate Arrow equivalent enum? Or a matter of testing? Is there a follow-up PR?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly, I am not 100% clear on mapping the definitions Substrait-Arrow. I thought about doing it in the follow up PR. Not any special reason stopping doing it for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Substrait Arrow
JOIN_TYPE_INNER INNER
JOIN_TYPE_OUTER FULL_OUTER
JOIN_TYPE_LEFT LEFT_OUTER
JOIN_TYPE_RIGHT RIGHT_OUTER
JOIN_TYPE_SEMI LEFT_SEMI
JOIN_TYPE_ANTI LEFT_ANTI
JOIN_TYPE_SINGLE -
- RIGHT_SEMI
- RIGHT_ANTI

JOIN_TYPE_SINGLE is interesting and I'd almost interpret that as an optional bool that could be added to any existing join. I'll create an issue to seek clarification. Let's add support for the rest in this PR since it should be straightforward enough.

Substrait doesn't have RIGHT_SEMI or RIGHT_ANTI and both of these can easily be created by just swapping the inputs so I'm not sure it is critical right now. Although it might help (for performance reasons) to keep the probe/build side separate from the side we want to keep the output for so they may come in handy in the future. I'm honestly not entirely sure. Either way, let's not worry about those for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substrait-io/substrait#206 for the JOIN_TYPE_SINGLE question but let's not block this PR waiting for that to resolve.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace thank you for the clarification. I will update the PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace updated the PR.

cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
"Only Support `equal` or `is_not_distinct_from` for join key comparison");
}

// TODO: Add Suffix support for Substrait
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a follow-up JIRA or link to a Substrait issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the JIRA number in the comment?

cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/serde_test.cc Show resolved Hide resolved
@vibhatha vibhatha requested a review from westonpace May 20, 2022 15:17
@vibhatha
Copy link
Collaborator Author

@westonpace These two JIRAs [1] [2] will be helpful in finalizing the overall tasks associated with Substrait-Joins

[1] https://issues.apache.org/jira/browse/ARROW-16485
[2] https://issues.apache.org/jira/browse/ARROW-16624

Any other things that we should take care in this PR?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few minor nits

cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
cpp/src/arrow/engine/substrait/relation_internal.cc Outdated Show resolved Hide resolved
"Only Support `equal` or `is_not_distinct_from` for join key comparison");
}

// TODO: Add Suffix support for Substrait
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the JIRA number in the comment?

vibhatha and others added 3 commits May 30, 2022 12:54
adding join type enum

Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
vibhatha and others added 5 commits May 30, 2022 12:55
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
@vibhatha vibhatha requested a review from westonpace May 30, 2022 07:43
@vibhatha
Copy link
Collaborator Author

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI failures appear to be unrelated.

@westonpace westonpace merged commit 2eb2573 into apache:master Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants