ARROW-10591: [Rust] Added support for filter of StructArray #8689

jorgecarleitao · 2020-11-17T06:09:37Z

Based on #8630 , this adds support to the filter operation for StructArrays.

github-actions · 2020-11-17T06:26:01Z

https://issues.apache.org/jira/browse/ARROW-10591

@andygrove

This PR is based on top of #8630 and contains a physical node to perform an inner join in DataFusion. This is still a draft, but IMO the design is here and the two tests already pass. This is co-authored with @andygrove , that contributed to the design on how to perform this operation in the context of DataFusion (see ARROW-9555 for details). The API used for the computation of the join at the arrow level is briefly discussed in [this document](https://docs.google.com/document/d/1KKuBvfx7uKi-x-tWOL60R1FNjDW8B790zPAv6yAlYcU/edit). There is still a lot to work on, but I I though it would be a good time to have a first round of discussions, and also to gauge timings wrt to the 3.0 release. There are two main issues being addressed in this PR: * How to we perform the join at the partition level: this pr collects all batches from the left, and then issues a stream per part on the right. Each batch on that stream joins itself with all the ones from the left (N) via a hash. This allow us to only require computing the hash of a row once (first all the left, then one by one on the right). * How do we build an array from `N (left)` arrays and a set of indices (matching the hash from the right): this is done using the `MutableArrayData` being worked on #8630, which incrementally memcopies slots from each of the N arrays based on the index. This implementation is useful because it works for all array types and does not require casting anything to rust native types (i.e. it operates on `ArrayData`, not specific implementations). There are still some steps left to have a join in SQL, most notably the whole logical planning, the output_partition logic, the bindings to SQL and DataFrame API, update the optimizers to handle nodes with 2 children, and a whole battery of tests. There is also a natural path for the other joins, as it will be a matter of incorporating the work already on PR #8689 that introduces the option to extend the `MutableArrayData` with nulls, the operation required for left and right joins. Closes #8709 from jorgecarleitao/join2 Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

@andygrove

This PR is based on top of apache#8630 and contains a physical node to perform an inner join in DataFusion. This is still a draft, but IMO the design is here and the two tests already pass. This is co-authored with @andygrove , that contributed to the design on how to perform this operation in the context of DataFusion (see ARROW-9555 for details). The API used for the computation of the join at the arrow level is briefly discussed in [this document](https://docs.google.com/document/d/1KKuBvfx7uKi-x-tWOL60R1FNjDW8B790zPAv6yAlYcU/edit). There is still a lot to work on, but I I though it would be a good time to have a first round of discussions, and also to gauge timings wrt to the 3.0 release. There are two main issues being addressed in this PR: * How to we perform the join at the partition level: this pr collects all batches from the left, and then issues a stream per part on the right. Each batch on that stream joins itself with all the ones from the left (N) via a hash. This allow us to only require computing the hash of a row once (first all the left, then one by one on the right). * How do we build an array from `N (left)` arrays and a set of indices (matching the hash from the right): this is done using the `MutableArrayData` being worked on apache#8630, which incrementally memcopies slots from each of the N arrays based on the index. This implementation is useful because it works for all array types and does not require casting anything to rust native types (i.e. it operates on `ArrayData`, not specific implementations). There are still some steps left to have a join in SQL, most notably the whole logical planning, the output_partition logic, the bindings to SQL and DataFrame API, update the optimizers to handle nodes with 2 children, and a whole battery of tests. There is also a natural path for the other joins, as it will be a matter of incorporating the work already on PR apache#8689 that introduces the option to extend the `MutableArrayData` with nulls, the operation required for left and right joins. Closes apache#8709 from jorgecarleitao/join2 Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

jorgecarleitao added 5 commits November 15, 2020 18:47

Improved filter bench.

8737089

Added generic array builder.

cd0c87b

Migrated filter.

a5573b1

Initial implementation for nulls.

ec779be

Added support to filter struct.

e524a80

github-actions bot added the Component: Rust label Nov 17, 2020

jorgecarleitao mentioned this pull request Nov 18, 2020

ARROW-9555: [Rust] [DataFusion] Implement physical node for inner join #8709

Closed

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020

jorgecarleitao closed this Dec 2, 2020

jorgecarleitao deleted the filter_struct branch December 2, 2020 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10591: [Rust] Added support for filter of StructArray #8689

ARROW-10591: [Rust] Added support for filter of StructArray #8689

jorgecarleitao commented Nov 17, 2020

github-actions bot commented Nov 17, 2020

ARROW-10591: [Rust] Added support for filter of StructArray #8689

ARROW-10591: [Rust] Added support for filter of StructArray #8689

Conversation

jorgecarleitao commented Nov 17, 2020

github-actions bot commented Nov 17, 2020