Fix filter UB and add fast path #341

ritchie46 · 2021-05-24T13:14:41Z

This also fixes UB for filter on RecordBatches. Still issue #295.

Besides this, I also added a fast path. We already do a popcount in the filter operation, it seems to me a missed opportunity to not just Arc clone the data when all values are true.

EDIT:

I also made ArrayData::new_empty public. If there is any objection to that I can make it private. IMO this should be public, as I think it should be easy to make an empty container of data structures.

codecov-commenter · 2021-05-24T13:31:12Z

Codecov Report

Merging #341 (e88a3b1) into master (5ac771a) will increase coverage by 0.04%.
The diff coverage is 78.72%.

@@            Coverage Diff             @@
##           master     #341      +/-   ##
==========================================
+ Coverage   82.56%   82.60%   +0.04%     
==========================================
  Files         162      162              
  Lines       44063    44199     +136     
==========================================
+ Hits        36379    36510     +131     
- Misses       7684     7689       +5

Impacted Files	Coverage Δ
arrow/src/compute/kernels/filter.rs	`91.90% <78.26%> (-1.08%)`	⬇️
arrow/src/array/data.rs	`71.47% <100.00%> (ø)`
parquet/src/column/writer.rs	`93.29% <0.00%> (ø)`
arrow/src/array/array_boolean.rs	`90.90% <0.00%> (ø)`
parquet/src/arrow/arrow_writer.rs	`98.11% <0.00%> (+0.03%)`	⬆️
arrow/src/csv/reader.rs	`89.91% <0.00%> (+1.54%)`	⬆️
arrow/src/array/transform/boolean.rs	`84.61% <0.00%> (+7.69%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5ac771a...e88a3b1. Read the comment docs.

Dandandan · 2021-05-24T21:36:36Z

arrow/src/compute/kernels/filter.rs

-    Ok(make_array(data))
+    if iter.filter_count == array.len() {
+        let data = array.data().clone();
+        Ok(make_array(data))


Couldn't this just return array or array.clone()?

dyn Array is a trait object and does not implement Sized

arrow/src/compute/kernels/filter.rs

alamb · 2021-05-25T21:44:44Z

The MIRI failure is unrelated to this PR: #345

alamb

Thanks @ritchie46 -- This code makes sense to me. @nevi-me @Dandandan or @jorgecarleitao any thoughts?

arrow/src/compute/kernels/filter.rs

jorgecarleitao · 2021-05-26T04:12:05Z

Note that this is not undefined behavior as defined in Rust; the warning is just that when nulls exist, the value of the null slot will be used for filtering regardless of its validity. I.e. it is a semantic undefined behavior.

Do we have some benchmarks available? It would be good to verify that the performance improves here.

ritchie46 · 2021-05-26T06:02:17Z

Do we have some benchmarks available? It would be good to verify that the performance improves here.

Gnuplot not found, using plotters backend
all true str 10k        time:   [202.99 ns 203.24 ns 203.59 ns]                             
                        change: [-99.808% -99.807% -99.807%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

all true int 10k        time:   [187.96 ns 188.03 ns 188.09 ns]                             
                        change: [-98.627% -98.625% -98.623%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  12 (12.00%) high severe

all false str 10k       time:   [323.02 ns 323.13 ns 323.25 ns]                              
                        change: [-66.262% -64.457% -62.442%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

all false int 10k       time:   [194.24 ns 194.25 ns 194.25 ns]                              
                        change: [-58.900% -58.838% -58.787%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  3 (3.00%) high severe

Counting ns :).

Note that this benchmark is sensitive to scale, because the filter algorithm is O(n) and the fast path O(1)

jorgecarleitao · 2021-05-26T06:19:05Z

yeap, that I would expect. :) I was thinking about the case where there is a null mask and non-null values.

jorgecarleitao

Looks great! This has great consequences. E.g. it removes the need to constant remove filters in DataFusion,.

Thanks a lot @ritchie46 !

arrow/src/compute/kernels/filter.rs

Dandandan · 2021-05-26T06:31:36Z

Looks great! This has great consequences. E.g. it removes the need to constant remove filters in DataFusion,.

Thanks a lot @ritchie46 !

Curious, what do you mean by constant remove filters?

rename argument: 'filter' to 'predicate' to reduce name collissions.

jorgecarleitao · 2021-05-26T06:35:13Z

I mean that we no longer need to try to prune expressions of the form 1 == 1: they automatically get converted to trues, which amounts to a O(1) evaluation in the filter. (We may still do it to make the plan cleaner)

ritchie46 · 2021-05-26T06:37:38Z

I mean that we no longer need to try to prune expressions of the form 1 == 1: they automatically get converted to trues, which amounts to a O(1) evaluation in the filter. (We may still do it to make the plan cleaner)

This was the same reason for met doing the PR.

I paid a relatively heavy price for filtering null values on a DataFrame that did not have any nulls (so all filter values were true).

Dandandan · 2021-05-26T06:42:20Z

I mean that we no longer need to try to prune expressions of the form 1 == 1: they automatically get converted to trues, which amounts to a O(1) evaluation in the filter. (We may still do it to make the plan cleaner)

👍
I am not sure the call is O(1), the popcount/bitcount still has to be computed, no? Still faster than the other parts of course.

ritchie46 · 2021-05-26T06:45:25Z

I am not sure the call is O(1), the popcount/bitcount still has to be computed, no? Still faster than the other parts of course.

The call not indeed. Only the fast paths.

alamb · 2021-05-26T20:07:26Z

Thanks @ritchie46 !

* fix ub in filter record_batch * filter fast path * add all false fast path * use new_empty_array * rename filter kernel argument rename argument: 'filter' to 'predicate' to reduce name collissions.

Dandandan · 2021-05-27T10:50:59Z

This PR and benchmark got me thinking - it is faster to clone an array than to create the equivalent null array from scratch (as can be seen in micro benchmarks above).
But we could clone the contents of the array and only create the zeroed null buffer.

It can increase memory usage though - as the contents might be big, for string arrays, which could be deallocated otherwise if it has a reference count of 0. On the other hand - this is something we do too for slice and other kernels.

alamb · 2021-05-27T12:42:53Z

But we could clone the contents of the array and only create the zeroed null buffer.

I think this makes a lot of sense personally (even if it might increase the peak memory usage of the system). Most filter masks are applied immediately (e.g. as part of some filtering operation) so I think so the amount of additional time that a large string array would be held on to is likely minimal

* fix ub in filter record_batch * filter fast path * add all false fast path * use new_empty_array * rename filter kernel argument rename argument: 'filter' to 'predicate' to reduce name collissions. Co-authored-by: Ritchie Vink <ritchie46@gmail.com>

ritchie46 added 2 commits May 24, 2021 14:57

fix ub in filter record_batch

bace8d4

filter fast path

a18608b

ritchie46 changed the title ~~Filter~~ Fix filter UB and add fast path May 24, 2021

Dandandan reviewed May 24, 2021

View reviewed changes

arrow/src/compute/kernels/filter.rs Outdated Show resolved Hide resolved

add all false fast path

c011cff

alamb approved these changes May 25, 2021

View reviewed changes

Dandandan reviewed May 25, 2021

View reviewed changes

arrow/src/compute/kernels/filter.rs Outdated Show resolved Hide resolved

use new_empty_array

78cb966

jorgecarleitao approved these changes May 26, 2021

View reviewed changes

arrow/src/compute/kernels/filter.rs Outdated Show resolved Hide resolved

rename filter kernel argument

e88a3b1

rename argument: 'filter' to 'predicate' to reduce name collissions.

alamb merged commit e85dc98 into apache:master May 26, 2021

alamb mentioned this pull request May 26, 2021

Cherry pick Fix filter UB and add fast path to active_release #372

Merged

alamb mentioned this pull request May 27, 2021

Add changelog and bump version for proposed 4.2.0 release #375

Merged

alamb added the cherry-picked PR that was backported to active release (will be included in maintenance release) label Jun 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix filter UB and add fast path #341

Fix filter UB and add fast path #341

ritchie46 commented May 24, 2021 •

edited

codecov-commenter commented May 24, 2021 •

edited

Dandandan May 24, 2021

ritchie46 May 25, 2021

alamb commented May 25, 2021

alamb left a comment

jorgecarleitao commented May 26, 2021

ritchie46 commented May 26, 2021

jorgecarleitao commented May 26, 2021

jorgecarleitao left a comment

Dandandan commented May 26, 2021

jorgecarleitao commented May 26, 2021

ritchie46 commented May 26, 2021 •

edited

Dandandan commented May 26, 2021 •

edited

ritchie46 commented May 26, 2021

alamb commented May 26, 2021

Dandandan commented May 27, 2021 •

edited

alamb commented May 27, 2021

Fix filter UB and add fast path #341

Fix filter UB and add fast path #341

Conversation

ritchie46 commented May 24, 2021 • edited

codecov-commenter commented May 24, 2021 • edited

Codecov Report

Dandandan May 24, 2021

Choose a reason for hiding this comment

ritchie46 May 25, 2021

Choose a reason for hiding this comment

alamb commented May 25, 2021

alamb left a comment

Choose a reason for hiding this comment

jorgecarleitao commented May 26, 2021

ritchie46 commented May 26, 2021

jorgecarleitao commented May 26, 2021

jorgecarleitao left a comment

Choose a reason for hiding this comment

Dandandan commented May 26, 2021

jorgecarleitao commented May 26, 2021

ritchie46 commented May 26, 2021 • edited

Dandandan commented May 26, 2021 • edited

ritchie46 commented May 26, 2021

alamb commented May 26, 2021

Dandandan commented May 27, 2021 • edited

alamb commented May 27, 2021

ritchie46 commented May 24, 2021 •

edited

codecov-commenter commented May 24, 2021 •

edited

ritchie46 commented May 26, 2021 •

edited

Dandandan commented May 26, 2021 •

edited

Dandandan commented May 27, 2021 •

edited