New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10591: [Rust] Added support for filter of StructArray #8689
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
andygrove
pushed a commit
that referenced
this pull request
Nov 21, 2020
This PR is based on top of #8630 and contains a physical node to perform an inner join in DataFusion. This is still a draft, but IMO the design is here and the two tests already pass. This is co-authored with @andygrove , that contributed to the design on how to perform this operation in the context of DataFusion (see ARROW-9555 for details). The API used for the computation of the join at the arrow level is briefly discussed in [this document](https://docs.google.com/document/d/1KKuBvfx7uKi-x-tWOL60R1FNjDW8B790zPAv6yAlYcU/edit). There is still a lot to work on, but I I though it would be a good time to have a first round of discussions, and also to gauge timings wrt to the 3.0 release. There are two main issues being addressed in this PR: * How to we perform the join at the partition level: this pr collects all batches from the left, and then issues a stream per part on the right. Each batch on that stream joins itself with all the ones from the left (N) via a hash. This allow us to only require computing the hash of a row once (first all the left, then one by one on the right). * How do we build an array from `N (left)` arrays and a set of indices (matching the hash from the right): this is done using the `MutableArrayData` being worked on #8630, which incrementally memcopies slots from each of the N arrays based on the index. This implementation is useful because it works for all array types and does not require casting anything to rust native types (i.e. it operates on `ArrayData`, not specific implementations). There are still some steps left to have a join in SQL, most notably the whole logical planning, the output_partition logic, the bindings to SQL and DataFrame API, update the optimizers to handle nodes with 2 children, and a whole battery of tests. There is also a natural path for the other joins, as it will be a matter of incorporating the work already on PR #8689 that introduces the option to extend the `MutableArrayData` with nulls, the operation required for left and right joins. Closes #8709 from jorgecarleitao/join2 Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>
github-actions
bot
added
the
needs-rebase
A PR that needs to be rebased by the author
label
Nov 25, 2020
GeorgeAp
pushed a commit
to sirensolutions/arrow
that referenced
this pull request
Jun 7, 2021
This PR is based on top of apache#8630 and contains a physical node to perform an inner join in DataFusion. This is still a draft, but IMO the design is here and the two tests already pass. This is co-authored with @andygrove , that contributed to the design on how to perform this operation in the context of DataFusion (see ARROW-9555 for details). The API used for the computation of the join at the arrow level is briefly discussed in [this document](https://docs.google.com/document/d/1KKuBvfx7uKi-x-tWOL60R1FNjDW8B790zPAv6yAlYcU/edit). There is still a lot to work on, but I I though it would be a good time to have a first round of discussions, and also to gauge timings wrt to the 3.0 release. There are two main issues being addressed in this PR: * How to we perform the join at the partition level: this pr collects all batches from the left, and then issues a stream per part on the right. Each batch on that stream joins itself with all the ones from the left (N) via a hash. This allow us to only require computing the hash of a row once (first all the left, then one by one on the right). * How do we build an array from `N (left)` arrays and a set of indices (matching the hash from the right): this is done using the `MutableArrayData` being worked on apache#8630, which incrementally memcopies slots from each of the N arrays based on the index. This implementation is useful because it works for all array types and does not require casting anything to rust native types (i.e. it operates on `ArrayData`, not specific implementations). There are still some steps left to have a join in SQL, most notably the whole logical planning, the output_partition logic, the bindings to SQL and DataFrame API, update the optimizers to handle nodes with 2 children, and a whole battery of tests. There is also a natural path for the other joins, as it will be a matter of incorporating the work already on PR apache#8689 that introduces the option to extend the `MutableArrayData` with nulls, the operation required for left and right joins. Closes apache#8709 from jorgecarleitao/join2 Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Based on #8630 , this adds support to the filter operation for
StructArray
s.