Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11076: [Rust][DataFusion] Refactor usage of right indices in hash join #9048

Closed
wants to merge 8 commits into from

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Dec 30, 2020

This applies some refactoring to build_batch_from_indices which is supposed to make further changes easier, e.g. solving https://issues.apache.org/jira/browse/ARROW-11030

FYI @jorgecarleitao @andygrove

@Dandandan Dandandan changed the title ARROW-11076: [Rust][DataFusion] Join right refactor ARROW-11076: [Rust][DataFusion] Refactor usage of right indices in hash join Dec 30, 2020
@github-actions
Copy link

// Note that we take `.data_ref()` to gather the [ArrayData] of each array.
let (is_primary, arrays) = match primary[0].schema().index_of(field.name()) {
Ok(i) => Ok((true, primary.iter().map(|batch| batch.column(i).data_ref().as_ref()).collect::<Vec<_>>())),
let (is_primary, column_index) = match primary[0].schema().index_of(field.name()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we should consider (in a separate PR) is determining upfront which colums are left/rigth and avoid calling schema.index_of for each column in each batch. It is a small cost but we could do it once upfront.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agreed, anything we can move out of this inner loop is an improvement, and avoids slowdowns, e.g. with large nr. of columns / smaller batch sizes.

Copy link
Contributor Author

@Dandandan Dandandan Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could in a next PR move the code and pass a column_indices: Vec<usize> instead.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This provides insane speedups at SF=100 and I will post numbers shortly.

@andygrove
Copy link
Member

andygrove commented Dec 30, 2020

For TPC-H q12 at SF=100 and 8 partitions:

Batch Size Master #9043 #9043 + This PR
4096 ??? ??? 25.2 s
8192 617.5 s 70.7 s 15.5 s
16384 183.1 s 46.4 s 13.5 s
32768 59.4 s 33.3 s 13.7 s
65536 27.5 s 20.7 s 13.8 s
131072 18.4 s 18.5 s 14.0 s

Thank you @Dandandan this is superb 💯

@Dandandan
Copy link
Contributor Author

Awesome, better than I expected!

GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…sh join

This applies some refactoring to `build_batch_from_indices` which is supposed to make further changes easier, e.g. solving https://issues.apache.org/jira/browse/ARROW-11030

* This starts handling right (1) batch and left (many) batches differently as for the right batches we can directly use `take` on it. This should be more efficient anyway, and also allows in the future to build the index array directly instead of doing extra copying.
* Use `indices.len()` for the capacity parameter, rather than the number of rows at the left. This is of impact at larger sizes (e.g. SF 100), see: apache#9036 Rather than estimating it based on previous batches, this does it based on the (known) number of resulting rows.
* Reuse "computed" right indices across multiple columns.
* The refactoring makes it easier to apply changes needed for https://issues.apache.org/jira/browse/ARROW-11030 where we need to remove the n*n work that is done for the build side
* The changes don't have a big impact locally on performance on TPC-H with small scale factor, but I believe it should have a similar effect as apache#9036 on SF=100 by using `indices.len()` rather than the number of rows in the build side.

FYI @jorgecarleitao @andygrove

Closes apache#9048 from Dandandan/join_right_refactor

Authored-by: Heres, Daniel <danielheres@gmail.com>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants