Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10591: [Rust] Added support for filter of StructArray #8689

Closed
wants to merge 5 commits into from
Closed

ARROW-10591: [Rust] Added support for filter of StructArray #8689

wants to merge 5 commits into from

Conversation

jorgecarleitao
Copy link
Member

Based on #8630 , this adds support to the filter operation for StructArrays.

@github-actions
Copy link

andygrove pushed a commit that referenced this pull request Nov 21, 2020
This PR is based on top of #8630 and contains a physical node to perform an inner join in DataFusion.

This is still a draft, but IMO the design is here and the two tests already pass.

This is co-authored with @andygrove , that contributed to the design on how to perform this operation in the context of DataFusion (see ARROW-9555 for details).

The API used for the computation of the join at the arrow level is briefly discussed in [this document](https://docs.google.com/document/d/1KKuBvfx7uKi-x-tWOL60R1FNjDW8B790zPAv6yAlYcU/edit).

There is still a lot to work on, but I I though it would be a good time to have a first round of discussions, and also to gauge timings wrt to the 3.0 release.

There are two main issues being addressed in this PR:

* How to we perform the join at the partition level: this pr collects all batches from the left, and then issues a stream per part on the right. Each batch on that stream joins itself with all the ones from the left (N) via a hash. This allow us to only require computing the hash of a row once (first all the left, then one by one on the right).

* How do we build an array from `N (left)` arrays and a set of indices (matching the hash from the right): this is done using the `MutableArrayData` being worked on #8630, which incrementally memcopies slots from each of the N arrays based on the index. This implementation is useful because it works for all array types and does not require casting anything to rust native types (i.e. it operates on `ArrayData`, not specific implementations).

There are still some steps left to have a join in SQL, most notably the whole logical planning, the output_partition logic, the bindings to SQL and DataFrame API, update the optimizers to handle nodes with 2 children,  and a whole battery of tests.

There is also a natural path for the other joins, as it will be a matter of incorporating the work already on PR #8689 that introduces the option to extend the `MutableArrayData` with nulls, the operation required for left and right joins.

Closes #8709 from jorgecarleitao/join2

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020
@jorgecarleitao jorgecarleitao deleted the filter_struct branch December 2, 2020 17:17
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR is based on top of apache#8630 and contains a physical node to perform an inner join in DataFusion.

This is still a draft, but IMO the design is here and the two tests already pass.

This is co-authored with @andygrove , that contributed to the design on how to perform this operation in the context of DataFusion (see ARROW-9555 for details).

The API used for the computation of the join at the arrow level is briefly discussed in [this document](https://docs.google.com/document/d/1KKuBvfx7uKi-x-tWOL60R1FNjDW8B790zPAv6yAlYcU/edit).

There is still a lot to work on, but I I though it would be a good time to have a first round of discussions, and also to gauge timings wrt to the 3.0 release.

There are two main issues being addressed in this PR:

* How to we perform the join at the partition level: this pr collects all batches from the left, and then issues a stream per part on the right. Each batch on that stream joins itself with all the ones from the left (N) via a hash. This allow us to only require computing the hash of a row once (first all the left, then one by one on the right).

* How do we build an array from `N (left)` arrays and a set of indices (matching the hash from the right): this is done using the `MutableArrayData` being worked on apache#8630, which incrementally memcopies slots from each of the N arrays based on the index. This implementation is useful because it works for all array types and does not require casting anything to rust native types (i.e. it operates on `ArrayData`, not specific implementations).

There are still some steps left to have a join in SQL, most notably the whole logical planning, the output_partition logic, the bindings to SQL and DataFrame API, update the optimizers to handle nodes with 2 children,  and a whole battery of tests.

There is also a natural path for the other joins, as it will be a matter of incorporating the work already on PR apache#8689 that introduces the option to extend the `MutableArrayData` with nulls, the operation required for left and right joins.

Closes apache#8709 from jorgecarleitao/join2

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Rust needs-rebase A PR that needs to be rebased by the author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant