ARROW-11076: [Rust][DataFusion] Refactor usage of right indices in hash join #9048

Dandandan · 2020-12-30T16:07:57Z

This applies some refactoring to build_batch_from_indices which is supposed to make further changes easier, e.g. solving https://issues.apache.org/jira/browse/ARROW-11030

This starts handling right (1) batch and left (many) batches differently as for the right batches we can directly use take on it. This should be more efficient anyway, and also allows in the future to build the index array directly instead of doing extra copying.
Use indices.len() for the capacity parameter, rather than the number of rows at the left. This is of impact at larger sizes (e.g. SF 100), see: ARROW-11053: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches #9036 Rather than estimating it based on previous batches, this does it based on the (known) number of resulting rows.
Reuse "computed" right indices across multiple columns.
The refactoring makes it easier to apply changes needed for https://issues.apache.org/jira/browse/ARROW-11030 where we need to remove the n*n work that is done for the build side
The changes don't have a big impact locally on performance on TPC-H with small scale factor, but I believe it should have a similar effect as ARROW-11053: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches #9036 on SF=100 by using indices.len() rather than the number of rows in the build side.

FYI @jorgecarleitao @andygrove

github-actions · 2020-12-30T16:20:20Z

https://issues.apache.org/jira/browse/ARROW-11076

andygrove · 2020-12-30T16:45:02Z

rust/datafusion/src/physical_plan/hash_join.rs

-        // Note that we take `.data_ref()` to gather the [ArrayData] of each array.
-        let (is_primary, arrays) = match primary[0].schema().index_of(field.name()) {
-            Ok(i) => Ok((true, primary.iter().map(|batch| batch.column(i).data_ref().as_ref()).collect::<Vec<_>>())),
+        let (is_primary, column_index) = match primary[0].schema().index_of(field.name()) {


Another thing we should consider (in a separate PR) is determining upfront which colums are left/rigth and avoid calling schema.index_of for each column in each batch. It is a small cost but we could do it once upfront.

Yes, agreed, anything we can move out of this inner loop is an improvement, and avoids slowdowns, e.g. with large nr. of columns / smaller batch sizes.

I think we could in a next PR move the code and pass a column_indices: Vec<usize> instead.

andygrove

LGTM! This provides insane speedups at SF=100 and I will post numbers shortly.

andygrove · 2020-12-30T16:54:39Z

For TPC-H q12 at SF=100 and 8 partitions:

Batch Size	Master	#9043	#9043 + This PR
4096	???	???	25.2 s
8192	617.5 s	70.7 s	15.5 s
16384	183.1 s	46.4 s	13.5 s
32768	59.4 s	33.3 s	13.7 s
65536	27.5 s	20.7 s	13.8 s
131072	18.4 s	18.5 s	14.0 s

Thank you @Dandandan this is superb 💯

Dandandan · 2020-12-30T17:06:06Z

Awesome, better than I expected!

@jorgecarleitao

…sh join This applies some refactoring to `build_batch_from_indices` which is supposed to make further changes easier, e.g. solving https://issues.apache.org/jira/browse/ARROW-11030 * This starts handling right (1) batch and left (many) batches differently as for the right batches we can directly use `take` on it. This should be more efficient anyway, and also allows in the future to build the index array directly instead of doing extra copying. * Use `indices.len()` for the capacity parameter, rather than the number of rows at the left. This is of impact at larger sizes (e.g. SF 100), see: apache#9036 Rather than estimating it based on previous batches, this does it based on the (known) number of resulting rows. * Reuse "computed" right indices across multiple columns. * The refactoring makes it easier to apply changes needed for https://issues.apache.org/jira/browse/ARROW-11030 where we need to remove the n*n work that is done for the build side * The changes don't have a big impact locally on performance on TPC-H with small scale factor, but I believe it should have a similar effect as apache#9036 on SF=100 by using `indices.len()` rather than the number of rows in the build side. FYI @jorgecarleitao @andygrove Closes apache#9048 from Dandandan/join_right_refactor Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

Dandandan added 4 commits December 30, 2020 16:12

Refactor usage of right indices

2f02d1c

Format

3ba0d9b

Use indices.len()

20cb091

Small simplification

7b03103

github-actions bot added Component: Rust - DataFusion Component: Rust labels Dec 30, 2020

Dandandan mentioned this pull request Dec 30, 2020

ARROW-11053: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches #9036

Closed

Dandandan changed the title ~~ARROW-11076: [Rust][DataFusion] Join right refactor~~ ARROW-11076: [Rust][DataFusion] Refactor usage of right indices in hash join Dec 30, 2020

Dandandan added 2 commits December 30, 2020 17:26

Naming improvement

b9fe4e4

Reuse right indices array

6017e49

andygrove reviewed Dec 30, 2020

View reviewed changes

andygrove approved these changes Dec 30, 2020

View reviewed changes

andygrove mentioned this pull request Dec 30, 2020

ARROW-11052: [Rust] [DataFusion] Implement metrics for HashJoinExec #9035

Closed

Dandandan added 2 commits December 30, 2020 20:31

Small simplification, avoid clone

098b171

Remove outdated comment

cde79c7

jorgecarleitao closed this in 25b7302 Dec 31, 2020

Dandandan mentioned this pull request Dec 31, 2020

ARROW-11088: [Rust][DataFusion] Calculate column indices upfront in hash join #9059

Closed

asfimport mentioned this pull request Dec 31, 2020

[Rust][DataFusion] Refactor usage of right indices in hash join #26988

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11076: [Rust][DataFusion] Refactor usage of right indices in hash join #9048

ARROW-11076: [Rust][DataFusion] Refactor usage of right indices in hash join #9048

Dandandan commented Dec 30, 2020 •

edited

Loading

github-actions bot commented Dec 30, 2020

andygrove Dec 30, 2020

Dandandan Dec 30, 2020

Dandandan Dec 30, 2020 •

edited

Loading

andygrove left a comment

andygrove commented Dec 30, 2020 •

edited

Loading

Dandandan commented Dec 30, 2020

ARROW-11076: [Rust][DataFusion] Refactor usage of right indices in hash join #9048

ARROW-11076: [Rust][DataFusion] Refactor usage of right indices in hash join #9048

Conversation

Dandandan commented Dec 30, 2020 • edited Loading

github-actions bot commented Dec 30, 2020

andygrove Dec 30, 2020

Choose a reason for hiding this comment

Dandandan Dec 30, 2020

Choose a reason for hiding this comment

Dandandan Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

andygrove commented Dec 30, 2020 • edited Loading

Dandandan commented Dec 30, 2020

Dandandan commented Dec 30, 2020 •

edited

Loading

Dandandan Dec 30, 2020 •

edited

Loading

andygrove commented Dec 30, 2020 •

edited

Loading