ARROW-10540 [Rust] Improve filtering #8630

jorgecarleitao · 2020-11-10T17:26:35Z

Motivation

The current code-base to filter an array is typed, which means that we can't easily generalize it for arbitrary depths.

That code is also similar to the code in take: both these operations perform a very similar operation: from an array, produce another array taking elements from an existing array.

This PR

This is a proposal to introduce a new struct (MutableArrayData for the lack of a better name) that is useful to efficiently construct an array taken from another array. The main use-case of this struct is to be used in unary operations that are type-independent, such as filter, take and sort.

With this struct, this PR:

Adds MutableArrayData to efficiently create ArrayData as partial memcopies from another ArrayData
Adds support to filtering of Lists of arbitrary depth and types
improved the benchmark to use random numbers to avoid spurious correlations (e.g. null and filter mask was 100% correlated in one of the cases). Also added benchmark for strings.
Adds support for filtering using boolean arrays with an offset
(only +170 LOC, even after all the new tests and benchmarks)

The gist of this struct can be seen in the example reproduced below:

use std::sync::Arc;
use arrow::{array::{Int32Array, Array, MutableArrayData}};

let array = Int32Array::from(vec![1, 2, 3, 4, 5]).data();

// Create a new `MutableArrayData` from an array and with a capacity.
let mut mutable = MutableArrayData::new(&array, 4);
mutable.extend(1, 3); // extend from the slice [1..3], [2,3]
mutable.extend(0, 3); // extend from the slice [0..3], [1,2,3]

// `.freeze()` to convert `MutableArrayData` into a `ArrayData`.
let new_array = Int32Array::from(Arc::new(mutable.freeze()));

assert_eq!(Int32Array::from(vec![2, 3, 1, 2, 3]), new_array);

Design

There are 3 core ideas on this design:

it operates at the ArrayData, i.e. it is (rust-)type-independent and thus can be used on a variety of use-cases.
it assumes that it is always advantageous to memcopy a slice of size 2*N than 2 slices of size N.
it assumes that it is advantageous to pre-allocate memory (to the extent possible)

With this in mind, the API has 3 methods: new, extend and freeze. The code in filter was reducing memcopies to some extent, but this PR further expands on this idea by copying the largest possible chunk, instead of up to 64 slots.

Implementation

Replace the bit iteration in filter.rs by the bitchunkIterator
Replace all array creation from filter.rs to a new module (array/transform/)
Made filter code be focused on how to efficiently perform the filter (instead of how to build the buffers)

Benchmarks

set -e
git checkout 010d260173a9e110901ca67372a4ac379a615a13
cargo bench --bench filter_kernels
git checkout mutable_filter
cargo bench --bench filter_kernels

Previous HEAD position was 010d26017 Improved filter bench.
Your branch is up to date with 'origin/mutable_filter'.
   Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow)
    Finished bench [optimized] target(s) in 50.14s
     Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/filter_kernels-9f6c18b388f93061
Gnuplot not found, using plotters backend
filter u8 low selectivity                                                                            
                        time:   [489.72 us 490.02 us 490.34 us]
                        change: [+30.894% +32.676% +33.916%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

filter u8 high selectivity                                                                             
                        time:   [6.8506 us 6.8599 us 6.8701 us]
                        change: [-57.316% -56.946% -56.621%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

filter u8 very low selectivity                                                                             
                        time:   [16.036 us 16.054 us 16.074 us]
                        change: [+102.22% +104.20% +106.38%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
  5 (5.00%) high severe

filter context u8 low selectivity                                                                            
                        time:   [234.86 us 235.13 us 235.41 us]
                        change: [-35.318% -34.980% -34.651%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe

filter context u8 high selectivity                                                                             
                        time:   [4.0385 us 4.0436 us 4.0488 us]
                        change: [-61.652% -61.402% -61.115%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

filter context u8 very low selectivity                                                                             
                        time:   [1.0964 us 1.0993 us 1.1024 us]
                        change: [-57.418% -57.154% -56.863%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

filter context u8 w NULLs low selectivity                                                                            
                        time:   [451.90 us 452.26 us 452.65 us]
                        change: [-15.196% -14.870% -14.539%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

filter context u8 w NULLs high selectivity                                                                            
                        time:   [301.93 us 304.32 us 306.65 us]
                        change: [-20.688% -20.012% -19.357%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking filter context u8 w NULLs very low selectivity: Collecting 100 samples in estimated 5.0018 s (3.7M iterations                                                                                                                          filter context u8 w NULLs very low selectivity                        
                        time:   [1.3675 us 1.3734 us 1.3793 us]
                        change: [-59.899% -59.602% -59.302%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

filter context f32 low selectivity                                                                            
                        time:   [459.46 us 459.83 us 460.25 us]
                        change: [-16.233% -15.973% -15.697%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  9 (9.00%) high mild
  3 (3.00%) high severe

filter context f32 high selectivity                                                                            
                        time:   [303.60 us 305.43 us 307.52 us]
                        change: [-20.196% -19.416% -18.644%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

filter context f32 very low selectivity                                                                             
                        time:   [1.3427 us 1.3451 us 1.3478 us]
                        change: [-62.984% -62.749% -62.510%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

filter context string low selectivity                                                                            
                        time:   [876.26 us 880.96 us 887.54 us]
                        change: [-15.814% -15.375% -14.900%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

Benchmarking filter context string high selectivity: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
filter context string high selectivity                                                                             
                        time:   [1.4070 ms 1.4094 ms 1.4120 ms]
                        change: [-15.332% -14.829% -14.364%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

filter context string very low selectivity                                                                             
                        time:   [7.2640 us 7.2763 us 7.2888 us]
                        change: [+73.534% +75.057% +76.543%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

github-actions · 2020-11-10T17:32:02Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

github-actions · 2020-11-10T17:47:44Z

https://issues.apache.org/jira/browse/ARROW-10540

yordan-pavlov · 2020-11-10T21:39:23Z

rust/arrow/benches/filter_kernels.rs

-    for i in 0..size {
-        builder.append_value(value_fn(i)).unwrap();
+    for _ in 0..size {
+        let value = rng.gen::<f32>() < trues_density;


using random numbers to generate the filter arrays makes it difficult to control filter selectivity; also doesn't that make each benchmark run unique which I think is the opposite of what we want - we want consistent benchmarks with stable, repeatable conditions

I wonder if this is the reason for the inconsistency in the benchmark results - usually a highly selective filter, where most filter bits are 0s and only a small number of values are selected / copied to the output array will always have the best performance because most batches of filter bits can be skipped quickly and only a few values are copied to the output array;
this relationship can clearly be seen in the benchmark results listed in this earlier PR #7798;

however in the benchmark results listed in the description of this PR in many cases the opposite is true - low selectivity filter benchmarks achieve much better performance than their high selectivity counterparts; I wonder what's the reason for that?

Thank you for your comments and for going through the code.

using random numbers to generate the filter arrays makes it difficult to control filter selectivity

Could you elaborate? Isn't the selectivity controlled above? AFAIK we just do not make it on regularly-spaced intervals (i % X) but according to some seed/randomness.

also doesn't that make each benchmark run unique which I think is the opposite of what we want - we want consistent benchmarks with stable, repeatable conditions

I agree. Would freezing it with a seed address the concern? My main concern with if i % 2 == 0 and the like is that these are highly predictable patterns and unlikely in real world situations. This predictability can make our benchmarks not very informative as they are benchmarking speculative execution and other optimizations, not the code (and again, these patterns are unlikely in real-world).

Another way of looking at it is that our benchmarks need entropy to represent the lack of information that we possess about the underlying distribution of the data (nulls, selectivity, values, etc). The patterns i % 2 and the like are super informative (almost no entropy).

usually a highly selective filter, where most filter bits are 0s and only a small number of values are selected

Sorry, it is just notation: I used "highly selectivity" in the text above to mean a lot of 1s (i.e. many items are selected => high selectivity), not a lot of 0s. But I see that you use the opposite meaning.

In the benchmarks above:

low selectivity -> select 50%

high selectivity -> do not select 1 on every 8000 entries

very low selectivity -> select 1 on every 8000 entries

@yordan-pavlov Thanks for reminding me that, it was on my plate for a while, now it is addressed in #8635

I agree. Would freezing it with a seed address the concern? My main concern with if i % 2 == 0 and the like is that these are highly predictable patterns and unlikely in real world situations. This predictability can make our benchmarks not very informative as they are benchmarking speculative execution and other optimizations, not the code (and again, these patterns are unlikely in real-world).

thread id is xored with the seed, thread_rng doesn't fit reproducible benchmarks point of view, so check out the pr I've opened @jorgecarleitao . tell me what you think.

As @vertexclique mentions, he has provided a good PR related to this conversation #8635

@jorgecarleitao the filter benchmarks may not simulate real-world use cases, but they are designed to test the code under specific conditions such as the worst case scenario with alternating 1s and 0s where no batch can be skipped and all selected values have to be copied individually; how can this scenario be achieved with a randomly generated filter array?

the other scenarios which test mostly 0s (best performance because most filter batches can be skipped and only a small number of selected values have to be copied) and mostly 1s (which is not as fast, but still faster than worst case because filter batches can be checked quickly and most values are copied in slices) should be easier to achieve with random filter arrays but are they going to be repeatable?

@jorgecarleitao once the approach to filter benchmarks has been finalized would it be possible to rearrange the commits so that the same benchmark code can be used to benchmark both the old and new implementations so that we can do a direct comparison?

@yordan-pavlov , that is the case already: :) I put any changes to benchmarks on the first commit (as I need to benchmark them myself).

yordan-pavlov · 2020-11-10T22:02:25Z

rust/arrow/src/array/transform/mod.rs

+    /// and for a size of `len`.
+    /// # Panic
+    /// This function panics if the range is out of bounds, i.e. if `start + len >= array.len()`.
+    pub fn extend(&mut self, start: usize, end: usize) {


I am not sure extend() correctly describes what the method does; the idea being that we take / copy a slice (defined by start and end) into the output array. Would take() or copy() be a better name?

I was following Rust's notation here Vec::extend, that extends from a slice. The difference here is that we can't use the &[] notation, as I could not find an object in this case to pass to. So, I tried the best to to mimic the [start..end] (via two arguments). The end is exclusive.
I also played with push (which is often used in rust for a single element).

I do not have strong feelings, though.

alamb

Some thoughts:

Naming: I have seen similar concepts called "Masks" (as they are similar to bit masks) -- so perhaps ArrayDataMask or MaskedArrayData. Or perhaps ArrayRowSet

When I actually read the code about MutableArrayData I realize that it isn't quite the mask concept, but it is similar (intermediate results want to represent "what indexes pass a certain test" and then eventually copying only those indexes to a new array"

This type of structure might also useful for performing multiple boolean operations (eg. when doing (A > 5) AND A < 10 you can compute the row ids/indexes for that pass A > 5 and then rather than actually copying those rows to then compare them less than B you can operate directly on the original copy of A (check only the rows where the mask is true)

I believe many other database systems use "bitsets" to represent this concept (we are using https://crates.io/crates/croaring in https://github.com/influxdata/influxdb_iox for the concept)

alamb · 2020-11-11T13:37:13Z

rust/arrow/src/array/transform/mod.rs

+type ExtendNullBits<'a> = Box<Fn(&mut _MutableArrayData, usize, usize) -> () + 'a>;
+// function that extends `[start..start+len]` to the mutable array.
+// this is dynamic because different data_types influence how buffers and childs are extended.
+type Extend<'a> = Box<Fn(&mut _MutableArrayData, usize, usize) -> () + 'a>;


Rather than a dynamic function pointer to extend such a structure, I wonder if you could get by with 'element_lengthandnumber_of_elements`

That unfortunately would only work for primitive buffers. For string arrays, extending an array data requires a complex operation that is fundamentally different from extending a single buffer. For nested types, the operation is recursive on the child data.

This is fundamentally a dynamic operation: we only know what to do when we see which DataType the user wants to build an ArrayData from. We can see that the Builders use a similar approach: they use dyn Builder for the same reason.

The builders have an extra complexity associated with the fact that their input type is not uniform: i.e. their API supports extending from a &[T] (e.g. i32 or i16), which is the reason why they need to be implemented via a dynamic type, whose each implementation has methods for each type. In the MutableArrayData, the only "thing" that we extend from is an ArrayData, which has a uniform (rust) type, but requires a different behavior based on its data_type => function pointer per data-type.

alamb · 2020-11-11T13:38:48Z

rust/arrow/src/array/transform/primitive.rs

+    let values = &array.buffers()[0].data()[array.offset() * size_of::<T>()..];
+    Box::new(
+        move |mutable: &mut _MutableArrayData, start: usize, len: usize| {
+            let start = start * size_of::<T>();


it seems to me that size_of::<T> is the part here that is dependent on type -- maybe you could just use that as a number

I try to use generics for things that are known at compile time and arguments for things that are only known at runtime. Maybe that is not the correct way of thinking?

In this case in particular, the difference would be between build_extend::<i32>(array) and build_extend(array, size_of::<i32>()).

I try to use generics for things that are known at compile time and arguments for things that are only known at runtime.

I agree that is a good general rule of thumb.

I guess this was a second way of asking if we could avoid dynamic dispatch - clearly the answer was "no" :)

vertexclique · 2020-11-11T13:54:56Z

Naming: I have seen similar concepts called "Masks" (as they are similar to bit masks) -- so perhaps ArrayDataMask or MaskedArrayData. Or perhaps ArrayRowSet

When I actually read the code about MutableArrayData I realize that it isn't quite the mask concept, but it is similar (intermediate results want to represent "what indexes pass a certain test" and then eventually copying only those indexes to a new array"

This type of structure might also useful for performing multiple boolean operations (eg. when doing (A > 5) AND A < 10 you can compute the row ids/indexes for that pass A > 5 and then rather than actually copying those rows to then compare them less than B you can operate directly on the original copy of A (check only the rows where the mask is true)

I found the whole block of this comment true. There are other approaches to do this but the main approach is that. I believe scratchpad implementation can also solve this problem from the different looking glass.

jorgecarleitao · 2020-11-11T15:50:23Z

Thanks a lot, @alamb and @vertexclique . I agree with the naming issues here, and great insight into those crates. I do not have strong feelings about naming; I tried to be close to what I am familiar with from Rust (e.g. Vec::extend).

A bit of background: @yordan-pavlov did an impressive job on #7798 to make filtering performant; It is really hard to get that performance with a generalization. I had to re-write this code at least 3 times to get it to a stage with comparable performance and even then, you can see that it depends on the filter conditions. But there is a (IMO good) reason for generalizing the code in filter.

Specifically, the use-case I am looking at is to remove most code from the take kernel, since it is doing what this struct is doing (and what filter was doing). There are implementation details that are different (take's indices can be null and filter does not support ArrayStruct atm), but they are fundamentally doing the same thing: constructing an ArrayData by memcopying chunks of another ArrayData.

Also, note that this goes beyond masking: it supports taking values repeatedly. I see it as a "builder" bound to a specific ArrayData (ArrayDataBuilder is already taken, though). It is not a builder like the builders in builder.rs because those memcopy data from rust native types, not Arrow types.

The reason this performance remains comparable(-ish) to master is that when it binds to an ArrayData, it inspects its DataType to initialize functions (the type Extend) bound to that ArrayData that are performant.

For example,

let values = &array.buffers()[0].data()[array.offset() * size_of::<T>()..];
Box::new(move |mutable: &mut _MutableArrayData, start: usize, len: usize| {

instead of

Box::new(move |mutable: &mut _MutableArrayData, array: &ArrayData, start: usize, len: usize| {
    let values = &array.buffers()[0].data()[array.offset() * size_of::<T>()..];

has a meaningful performance impact because it avoids checks on both buffers()[0] and [something..] on every call of extend. This check may seem small, but it significantly impacts performance because extend can be as small as: copy 2 bits (in the case of a boolean array). When there are many calls of this operation (e.g. filter every other slot), these checks are as or more expensive than the memcopy themselves.

There is a further boost possible that I explored but did not finish, but it requires a bit of unsafe code: we know that these functions never outlive MutableDataArray. Therefore, we could actually unsafely pass &self.data to their initialization and avoid all costs associated with accessing mutable.buffers() and the like inside extend. In this case, we would use

type Extend<'a> = Box<Fn(usize, usize) -> () + 'a>;

and bind the _MutableArrayData on build_primitive_extend<'a, T>(array, data). This would be the closest to the current filter implementation that remains generic for other use-cases.

My point is that there are non-trivial optimizations done in this proposal that were borrowed from the excellent work from @yordan-pavlov and that are required to keep up with very high bar set by #7798 This draft is trying to leverage that work on two of our kernels, take and filter.

yordan-pavlov · 2020-11-11T19:22:59Z

@jorgecarleitao thank you for this PR; overall I think it's a great idea to reuse the code between the take and filter kernels if possible - and you have demonstrated how it can be possible; we just have to find a way to keep performance at a good level;

I have been thinking whether the BitChunkIterator from here https://github.com/apache/arrow/blob/master/rust/arrow/src/util/bit_chunk_iterator.rs can be used to improve the filter kernel so I am happy to see you have already done it.

jorgecarleitao · 2020-11-14T09:37:36Z

I have rebased against latest master and further improved the code. IMO this is now ready to review: benchmarks are now better for all cases where we pre-compute the filter (1.1-3x) except for 1 case (very low selectivity) where it is 1.5x worse.

For a single filter case, I know how to improve it: we need to not pre-compute the filter and instead use an iterator.

I can continue to investigate this, but IMO this is an improvement to the code base and performance.

alamb · 2020-11-15T13:26:03Z

I will try and review this carefully tomorrow

jorgecarleitao · 2020-11-15T21:22:35Z

FYI I was unable to make this sufficiently efficient for the take: it has a large performance hit when compared to the current approach. As such, I am not sure if this can be used more generally. This may be moved to the filter kernel and remain there, and leave the take as a separate implementation.

jorgecarleitao · 2020-11-17T07:14:44Z

This code can be re-used to implement merge-sort and join. Some notes about this here: https://docs.google.com/document/d/1KKuBvfx7uKi-x-tWOL60R1FNjDW8B790zPAv6yAlYcU/edit?usp=sharing

yordan-pavlov · 2020-11-17T21:33:58Z

rust/arrow/benches/filter_kernels.rs

-    let sparse_filter_array = create_bool_array(size, |i| matches!(i % 8000, 0));
-    let dense_filter_array = create_bool_array(size, |i| !matches!(i % 8000, 0));
+    let filter_array = create_bool_array(size, 0.5);
+    let sparse_filter_array = create_bool_array(size, 1.0 - 1.0 / 8000.0);


the names of the sparse_filter_array (intended to contain mostly 0s) and dense_filter_array (intended to contain mostly 1s) variables no longer correspond to the values; should they be reversed?

yordan-pavlov · 2020-11-17T21:38:25Z

rust/arrow/src/buffer.rs

+    pub fn extend(&mut self, len: usize) {
+        let remaining_capacity = self.capacity - self.len;
+        if len > remaining_capacity {
+            self.reserve(self.len + len);


should this be self.reserve(self.len + (len - remaining_capacity)); ?

I am sorry, I do not follow: my understanding is that we need to reserve so that self.len <= self.capacity is always true.

If we have a buffer with capacity 64 and len 54, and we extend it by 8, we want to ensure that its new capacity is at least 54+8=62, no?

your formula yields 54 + (8 - (64 - 54)) = 54 + (8 - 10) = 52 (panics with an underflow, I think).

Another way of writing this code would be:

self.len += len; if self.len > self.capacity { self.reserve(self.len); }

@jorgecarleitao good spot - what I suggested is more obvious when len > remaining_capacity such as if len = 256 in your example, then requested capacity would be 54 + (256 - (64 - 54)) = 54 + (256 - 10) = 300; I think it will still work even in your example because the requested of 52 capacity will be lower than existing capacity so no allocation will be performed as it is not necessary because the requested extension of 8 is smaller than remaining capacity of 10.

reserves API is to guarantee a capacity of up to the argument, and extend's API is to increase self.len by len.

If we call self.reserve(300), but then increase the len by 256 to 54+256=310, we will be able to offset the pointer past the allocated memory (310 > 300), no?

you are right, I misunderstood reserve(x) to mean that we want to increase capacity by x, apologies

alamb · 2020-11-19T19:49:29Z

FWIW keeping "filter" as a special case that is very fast is not unreasonable -- it is likely to be one of the most performance critical pieces of code in analytics systems, so it is naturally likely to end up as the most complex / optimized.

@andygrove

This PR is based on top of #8630 and contains a physical node to perform an inner join in DataFusion. This is still a draft, but IMO the design is here and the two tests already pass. This is co-authored with @andygrove , that contributed to the design on how to perform this operation in the context of DataFusion (see ARROW-9555 for details). The API used for the computation of the join at the arrow level is briefly discussed in [this document](https://docs.google.com/document/d/1KKuBvfx7uKi-x-tWOL60R1FNjDW8B790zPAv6yAlYcU/edit). There is still a lot to work on, but I I though it would be a good time to have a first round of discussions, and also to gauge timings wrt to the 3.0 release. There are two main issues being addressed in this PR: * How to we perform the join at the partition level: this pr collects all batches from the left, and then issues a stream per part on the right. Each batch on that stream joins itself with all the ones from the left (N) via a hash. This allow us to only require computing the hash of a row once (first all the left, then one by one on the right). * How do we build an array from `N (left)` arrays and a set of indices (matching the hash from the right): this is done using the `MutableArrayData` being worked on #8630, which incrementally memcopies slots from each of the N arrays based on the index. This implementation is useful because it works for all array types and does not require casting anything to rust native types (i.e. it operates on `ArrayData`, not specific implementations). There are still some steps left to have a join in SQL, most notably the whole logical planning, the output_partition logic, the bindings to SQL and DataFrame API, update the optimizers to handle nodes with 2 children, and a whole battery of tests. There is also a natural path for the other joins, as it will be a matter of incorporating the work already on PR #8689 that introduces the option to extend the `MutableArrayData` with nulls, the operation required for left and right joins. Closes #8709 from jorgecarleitao/join2 Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

@andygrove

This PR is based on top of apache#8630 and contains a physical node to perform an inner join in DataFusion. This is still a draft, but IMO the design is here and the two tests already pass. This is co-authored with @andygrove , that contributed to the design on how to perform this operation in the context of DataFusion (see ARROW-9555 for details). The API used for the computation of the join at the arrow level is briefly discussed in [this document](https://docs.google.com/document/d/1KKuBvfx7uKi-x-tWOL60R1FNjDW8B790zPAv6yAlYcU/edit). There is still a lot to work on, but I I though it would be a good time to have a first round of discussions, and also to gauge timings wrt to the 3.0 release. There are two main issues being addressed in this PR: * How to we perform the join at the partition level: this pr collects all batches from the left, and then issues a stream per part on the right. Each batch on that stream joins itself with all the ones from the left (N) via a hash. This allow us to only require computing the hash of a row once (first all the left, then one by one on the right). * How do we build an array from `N (left)` arrays and a set of indices (matching the hash from the right): this is done using the `MutableArrayData` being worked on apache#8630, which incrementally memcopies slots from each of the N arrays based on the index. This implementation is useful because it works for all array types and does not require casting anything to rust native types (i.e. it operates on `ArrayData`, not specific implementations). There are still some steps left to have a join in SQL, most notably the whole logical planning, the output_partition logic, the bindings to SQL and DataFrame API, update the optimizers to handle nodes with 2 children, and a whole battery of tests. There is also a natural path for the other joins, as it will be a matter of incorporating the work already on PR apache#8689 that introduces the option to extend the `MutableArrayData` with nulls, the operation required for left and right joins. Closes apache#8709 from jorgecarleitao/join2 Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

github-actions bot added the Component: Rust label Nov 10, 2020

jorgecarleitao changed the title ~~Mutable filter~~ ARROW-10540 [Rust] Improve filtering Nov 10, 2020

yordan-pavlov reviewed Nov 10, 2020

View reviewed changes

jorgecarleitao marked this pull request as draft November 11, 2020 05:22

alamb reviewed Nov 11, 2020

View reviewed changes

jorgecarleitao marked this pull request as ready for review November 14, 2020 09:28

jorgecarleitao mentioned this pull request Nov 15, 2020

ARROW-10591: [Rust] Improve take (-20 to -40%) #8670

Closed

jorgecarleitao mentioned this pull request Nov 17, 2020

ARROW-10591: [Rust] Added support for filter of StructArray #8689

Closed

jorgecarleitao added 3 commits November 17, 2020 09:07

Improved filter bench.

948e5d4

Added generic array builder.

cce5418

Migrated filter.

00b9901

yordan-pavlov reviewed Nov 17, 2020

View reviewed changes

jorgecarleitao mentioned this pull request Nov 18, 2020

ARROW-9555: [Rust] [DataFusion] Implement physical node for inner join #8709

Closed

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020

jorgecarleitao closed this Dec 4, 2020

jorgecarleitao deleted the mutable_filter branch December 4, 2020 07:40

jorgecarleitao restored the mutable_filter branch December 4, 2020 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10540 [Rust] Improve filtering #8630

ARROW-10540 [Rust] Improve filtering #8630

jorgecarleitao commented Nov 10, 2020 •

edited

github-actions bot commented Nov 10, 2020

github-actions bot commented Nov 10, 2020

yordan-pavlov Nov 10, 2020

yordan-pavlov Nov 10, 2020

jorgecarleitao Nov 10, 2020

vertexclique Nov 11, 2020

vertexclique Nov 11, 2020

alamb Nov 11, 2020

yordan-pavlov Nov 11, 2020 •

edited

yordan-pavlov Nov 11, 2020

jorgecarleitao Nov 14, 2020

yordan-pavlov Nov 10, 2020 •

edited

jorgecarleitao Nov 10, 2020

alamb left a comment

alamb Nov 11, 2020

jorgecarleitao Nov 11, 2020

alamb Nov 11, 2020

jorgecarleitao Nov 11, 2020 •

edited

alamb Nov 12, 2020

vertexclique commented Nov 11, 2020

jorgecarleitao commented Nov 11, 2020 •

edited

yordan-pavlov commented Nov 11, 2020

jorgecarleitao commented Nov 14, 2020 •

edited

alamb commented Nov 15, 2020

jorgecarleitao commented Nov 15, 2020

jorgecarleitao commented Nov 17, 2020 •

edited

yordan-pavlov Nov 17, 2020

yordan-pavlov Nov 17, 2020

jorgecarleitao Nov 18, 2020 •

edited

yordan-pavlov Nov 18, 2020 •

edited

jorgecarleitao Nov 18, 2020 •

edited

yordan-pavlov Nov 18, 2020

alamb commented Nov 19, 2020

ARROW-10540 [Rust] Improve filtering #8630

ARROW-10540 [Rust] Improve filtering #8630

Conversation

jorgecarleitao commented Nov 10, 2020 • edited

Motivation

This PR

Design

Implementation

Benchmarks

github-actions bot commented Nov 10, 2020

github-actions bot commented Nov 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yordan-pavlov Nov 11, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yordan-pavlov Nov 10, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgecarleitao Nov 11, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vertexclique commented Nov 11, 2020

jorgecarleitao commented Nov 11, 2020 • edited

yordan-pavlov commented Nov 11, 2020

jorgecarleitao commented Nov 14, 2020 • edited

alamb commented Nov 15, 2020

jorgecarleitao commented Nov 15, 2020

jorgecarleitao commented Nov 17, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorgecarleitao Nov 18, 2020 • edited

Choose a reason for hiding this comment

yordan-pavlov Nov 18, 2020 • edited

Choose a reason for hiding this comment

jorgecarleitao Nov 18, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 19, 2020

jorgecarleitao commented Nov 10, 2020 •

edited

yordan-pavlov Nov 11, 2020 •

edited

yordan-pavlov Nov 10, 2020 •

edited

jorgecarleitao Nov 11, 2020 •

edited

jorgecarleitao commented Nov 11, 2020 •

edited

jorgecarleitao commented Nov 14, 2020 •

edited

jorgecarleitao commented Nov 17, 2020 •

edited

jorgecarleitao Nov 18, 2020 •

edited

yordan-pavlov Nov 18, 2020 •

edited

jorgecarleitao Nov 18, 2020 •

edited