Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use SlicesIterator for ArrayData Equality #3198

Merged
merged 4 commits into from
Nov 27, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Nov 26, 2022

Which issue does this PR close?

Closes #1600.

Rationale for this change

This improves the performance of long array with less nulls mostly, e.g. the added benchmark:

equal_51200             time:   [72.533 µs 72.653 µs 72.810 µs]     
                        change: [-30.087% -30.012% -29.938%] (p = 0.00 < 0.05)
                        Performance has improved.   

For array with many nulls, this actually causes regression when using SlicesIterator, so I only use SlicesIterator for the cases with less nulls.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Nov 26, 2022
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this could be simplified by using try_for_each_valid_idx?

@viirya
Copy link
Member Author

viirya commented Nov 27, 2022

Perhaps this could be simplified by using try_for_each_valid_idx?

Tried with try_for_each_valid_idx, but the benchmarks look not good:

equal_nulls_512         time:   [1.8361 µs 1.8471 µs 1.8602 µs]
                        change: [+171.23% +172.28% +173.48%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 23 outliers among 100 measurements (23.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  18 (18.00%) high severe

equal_51200             time:   [303.39 µs 303.51 µs 303.67 µs]
                        change: [+191.52% +191.71% +191.91%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)

try_for_each_valid_idx comes to compare each element so it doesn't help on the long array with few nulls (i.e., equal_51200), it also brings quite regression on many null case.

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we possibly back out the changes to SlicesIterator and just use BitSliceIterator, then I think this is good to go 👍

///
/// Only performant for filters that copy across long contiguous runs
#[derive(Debug)]
pub struct SlicesIterator<'a>(BitSliceIterator<'a>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this is a breaking change, as it is exposed publicly - https://docs.rs/arrow-select/latest/arrow_select/filter/struct.SlicesIterator.html


impl<'a> SlicesIterator<'a> {
pub fn new_from_buffer(values: &'a Buffer, offset: usize, len: usize) -> Self {
Self(BitSliceIterator::new(values, offset, len))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point why not just use BitSliceIterator?


use super::utils::equal_len;

pub(crate) const NULL_SLICES_SELECTIVITY_THRESHOLD: f64 = 0.4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you come up with 0.4, not saying it is a bad choice, just curious

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It came from benchmarking. equal_nulls_512 has 0.5 null density, and this doesn't improve it. So I pick 0.4 as the threshold.

@github-actions github-actions bot removed the parquet Changes to the parquet crate label Nov 27, 2022
@tustvold tustvold merged commit f985818 into apache:master Nov 27, 2022
@ursabot
Copy link

ursabot commented Nov 27, 2022

Benchmark runs are scheduled for baseline = 1daf7d3 and contender = f985818. f985818 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ArrayData Equality Slice Iterator
3 participants