ARROW-11053: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches #9036

andygrove · 2020-12-28T22:17:58Z

This builds on #9035.

I am investigating why join performance is so bad with smaller batch sizes (see https://issues.apache.org/jira/browse/ARROW-11030) and this is one optimization that I have found so far that helps a bit.

Prior to this PR, we use the size of left or right batches to guess the capacity of output batches and this results in a lot of over allocation in some cases. For TPC-H q12 at SF=100, I see vectors created with capacity of ~3,000,000 (the size of the build-side of the join) and then we only populate it with ~700 entries.

This PR attempts to learn a good capacity based on previously processed batches.

Here are query times in seconds at different batch sizes:

Batch Size	Master	This PR
16384	189.6	158.0
32768	61.9	47.2
65536	28.2	21.4
131072	19.0	15.6

github-actions · 2020-12-28T22:32:58Z

https://issues.apache.org/jira/browse/ARROW-11053

Dandandan · 2020-12-29T03:40:07Z

rust/datafusion/src/physical_plan/hash_join.rs

-                other => other,
+                Some(Ok(batch)) => {
+                    let start = Instant::now();
+                    let capacity = if self.num_output_batches == 0 {


In #9032 it was benchmarked to be a bit faster currently to initialize with capacity 0 than to use the correct capacity upfront. I think that is more something curious and hopefully will change, but that might be the case here too? What is the performance if we initialize with 0 everywhere?

Dandandan · 2020-12-29T12:36:13Z

An important source of slowness seems to be in the (use and inefficiency of) creating the ArrayData and MutableArrayData structure. In profiling I see a lot in build_extend, freeze etc.

Changing the piece of code to generate a Vec<&ArrayData> directly gives a ~20% speedup locally on batches of size 1000 on your branch @andygrove :

        let (is_primary, arrays) = match primary[0].schema().index_of(field.name()) {
            Ok(i) => Ok((true, primary.iter().map(|batch| batch.column(i).data_ref().as_ref()).collect::<Vec<_>>())),
            Err(_) => {
                match secondary[0].schema().index_of(field.name()) {
                    Ok(i) => Ok((false, secondary.iter().map(|batch| batch.column(i).data_ref().as_ref()).collect::<Vec<_>>())),
                    _ => Err(DataFusionError::Internal(
                        format!("During execution, the column {} was not found in neither the left or right side of the join", field.name()).to_string()
                    ))
                }
            }
        }.map_err(DataFusionError::into_arrow_external_error)?;

Dandandan · 2020-12-29T13:02:25Z

Opened a PR with that change here: #9042

Dandandan · 2020-12-29T15:28:42Z

I think part of a further speed up could be moving the building of the left / build-side Vec<&ArrayData> arrays so that it is only created once instead of for each right batch in build_batch_from_indices. Currently when making the batch size smaller, the build-side Vec is built more times, but also contains more (smaller) batches itself, which could explain (part of the) big / exponential slowdown on smaller batches.

@andygrove

FYI @andygrove @jorgecarleitao This is related to findings here: #9036 This is a ~20% win on smaller batch sizes, but also has an effect on bigger sizes (about 10% on size of 16k locally) Query 12 with batch size 4096 PR: ``` Query 12 iteration 0 took 484.0 ms Query 12 iteration 1 took 482.5 ms Query 12 iteration 2 took 496.9 ms Query 12 iteration 3 took 488.1 ms Query 12 iteration 4 took 504.9 ms Query 12 iteration 5 took 490.4 ms Query 12 iteration 6 took 486.8 ms Query 12 iteration 7 took 499.3 ms Query 12 iteration 8 took 485.0 ms Query 12 iteration 9 took 488.8 ms Query 12 avg time: 490.67 ms ``` Master: ``` Query 12 iteration 0 took 669.6 ms Query 12 iteration 1 took 764.8 ms Query 12 iteration 2 took 705.8 ms Query 12 iteration 3 took 645.1 ms Query 12 iteration 4 took 640.9 ms Query 12 iteration 5 took 639.2 ms Query 12 iteration 6 took 658.9 ms Query 12 iteration 7 took 643.2 ms Query 12 iteration 8 took 639.0 ms Query 12 iteration 9 took 639.2 ms Query 12 avg time: 664.57 ms ``` Closes #9042 from Dandandan/join_perf Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

Dandandan · 2020-12-29T21:39:46Z

I think this is something we should merge. We can maybe tweak the extra 1024 capacity but this is one extra source of slowness for smaller batches. Also I think helps making the other sources more clear.

Dandandan · 2020-12-30T16:09:27Z

@andygrove thought more about this, I think we are able to use indices.len() for the exact required capacity rather than using previous sizes. I included the change among other changes in this PR #9048

andygrove · 2020-12-30T16:42:34Z

Closed in favor of #9048

@jorgecarleitao

…sh join This applies some refactoring to `build_batch_from_indices` which is supposed to make further changes easier, e.g. solving https://issues.apache.org/jira/browse/ARROW-11030 * This starts handling right (1) batch and left (many) batches differently as for the right batches we can directly use `take` on it. This should be more efficient anyway, and also allows in the future to build the index array directly instead of doing extra copying. * Use `indices.len()` for the capacity parameter, rather than the number of rows at the left. This is of impact at larger sizes (e.g. SF 100), see: #9036 Rather than estimating it based on previous batches, this does it based on the (known) number of resulting rows. * Reuse "computed" right indices across multiple columns. * The refactoring makes it easier to apply changes needed for https://issues.apache.org/jira/browse/ARROW-11030 where we need to remove the n*n work that is done for the build side * The changes don't have a big impact locally on performance on TPC-H with small scale factor, but I believe it should have a similar effect as #9036 on SF=100 by using `indices.len()` rather than the number of rows in the build side. FYI @jorgecarleitao @andygrove Closes #9048 from Dandandan/join_right_refactor Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

@andygrove

FYI @andygrove @jorgecarleitao This is related to findings here: apache#9036 This is a ~20% win on smaller batch sizes, but also has an effect on bigger sizes (about 10% on size of 16k locally) Query 12 with batch size 4096 PR: ``` Query 12 iteration 0 took 484.0 ms Query 12 iteration 1 took 482.5 ms Query 12 iteration 2 took 496.9 ms Query 12 iteration 3 took 488.1 ms Query 12 iteration 4 took 504.9 ms Query 12 iteration 5 took 490.4 ms Query 12 iteration 6 took 486.8 ms Query 12 iteration 7 took 499.3 ms Query 12 iteration 8 took 485.0 ms Query 12 iteration 9 took 488.8 ms Query 12 avg time: 490.67 ms ``` Master: ``` Query 12 iteration 0 took 669.6 ms Query 12 iteration 1 took 764.8 ms Query 12 iteration 2 took 705.8 ms Query 12 iteration 3 took 645.1 ms Query 12 iteration 4 took 640.9 ms Query 12 iteration 5 took 639.2 ms Query 12 iteration 6 took 658.9 ms Query 12 iteration 7 took 643.2 ms Query 12 iteration 8 took 639.0 ms Query 12 iteration 9 took 639.2 ms Query 12 avg time: 664.57 ms ``` Closes apache#9042 from Dandandan/join_perf Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

@jorgecarleitao

…sh join This applies some refactoring to `build_batch_from_indices` which is supposed to make further changes easier, e.g. solving https://issues.apache.org/jira/browse/ARROW-11030 * This starts handling right (1) batch and left (many) batches differently as for the right batches we can directly use `take` on it. This should be more efficient anyway, and also allows in the future to build the index array directly instead of doing extra copying. * Use `indices.len()` for the capacity parameter, rather than the number of rows at the left. This is of impact at larger sizes (e.g. SF 100), see: apache#9036 Rather than estimating it based on previous batches, this does it based on the (known) number of resulting rows. * Reuse "computed" right indices across multiple columns. * The refactoring makes it easier to apply changes needed for https://issues.apache.org/jira/browse/ARROW-11030 where we need to remove the n*n work that is done for the build side * The changes don't have a big impact locally on performance on TPC-H with small scale factor, but I believe it should have a similar effect as apache#9036 on SF=100 by using `indices.len()` rather than the number of rows in the build side. FYI @jorgecarleitao @andygrove Closes apache#9048 from Dandandan/join_right_refactor Authored-by: Heres, Daniel <danielheres@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

andygrove added 3 commits December 28, 2020 13:45

Implement join metrics

3846869

change stream-side to probe-side

a6a5b58

optimize capacity for joined batches

2e1cadb

github-actions bot added Component: Rust - DataFusion Component: Rust labels Dec 28, 2020

Dandandan reviewed Dec 29, 2020

View reviewed changes

Dandandan mentioned this pull request Dec 29, 2020

ARROW-11064: [Rust][DataFusion] Speed up hash join on smaller batches #9042

Closed

Dandandan mentioned this pull request Dec 30, 2020

ARROW-11076: [Rust][DataFusion] Refactor usage of right indices in hash join #9048

Closed

andygrove closed this Dec 30, 2020

asfimport mentioned this pull request Dec 30, 2020

[Rust] [DataFusion] Optimize joins with dynamic capacity for output batches #26969

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11053: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches #9036

ARROW-11053: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches #9036

andygrove commented Dec 28, 2020 •

edited

Loading

github-actions bot commented Dec 28, 2020

Dandandan Dec 29, 2020

Dandandan commented Dec 29, 2020 •

edited

Loading

Dandandan commented Dec 29, 2020

Dandandan commented Dec 29, 2020 •

edited

Loading

Dandandan commented Dec 29, 2020 •

edited

Loading

Dandandan commented Dec 30, 2020

andygrove commented Dec 30, 2020

ARROW-11053: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches #9036

ARROW-11053: [Rust] [DataFusion] Optimize joins with dynamic capacity for output batches #9036

Conversation

andygrove commented Dec 28, 2020 • edited Loading

github-actions bot commented Dec 28, 2020

Dandandan Dec 29, 2020

Choose a reason for hiding this comment

Dandandan commented Dec 29, 2020 • edited Loading

Dandandan commented Dec 29, 2020

Dandandan commented Dec 29, 2020 • edited Loading

Dandandan commented Dec 29, 2020 • edited Loading

Dandandan commented Dec 30, 2020

andygrove commented Dec 30, 2020

andygrove commented Dec 28, 2020 •

edited

Loading

Dandandan commented Dec 29, 2020 •

edited

Loading

Dandandan commented Dec 29, 2020 •

edited

Loading

Dandandan commented Dec 29, 2020 •

edited

Loading