Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the length kernel on Binary Array #1465

Merged
merged 5 commits into from Mar 24, 2022

Conversation

HaoYang670
Copy link
Contributor

@HaoYang670 HaoYang670 commented Mar 20, 2022

Which issue does this PR close?

Closes #1464.

Rationale for this change

The length kernel can work with BinaryArray now!.

What changes are included in this PR?

  1. add length and bit_length functions for Binary Array
  2. rewrite unary_offsets using macro
  3. add tests.

Micro benchmark

bit_length              time:   [589.47 ns 593.35 ns 600.04 ns]                        
                        change: [+1.4956% +1.9389% +2.7873%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  12 (12.00%) high severe

length                  time:   [577.85 ns 578.81 ns 579.83 ns]                    
                        change: [+0.3391% +0.5613% +0.7941%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

No obvious performance penalty.

Are there any user-facing changes?

Some private functions' APIs are changed.

rewrite unary_offset using macro

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 20, 2022
@codecov-commenter
Copy link

codecov-commenter commented Mar 20, 2022

Codecov Report

Merging #1465 (7183822) into master (d02425d) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1465      +/-   ##
==========================================
+ Coverage   82.70%   82.71%   +0.01%     
==========================================
  Files         187      187              
  Lines       54169    54255      +86     
==========================================
+ Hits        44801    44878      +77     
- Misses       9368     9377       +9     
Impacted Files Coverage Δ
arrow/src/compute/kernels/length.rs 100.00% <100.00%> (ø)
arrow/src/array/data.rs 82.80% <0.00%> (-0.16%) ⬇️
integration-testing/src/lib.rs 0.00% <0.00%> (ø)
arrow/src/array/array_binary.rs 92.62% <0.00%> (ø)
parquet/src/arrow/schema.rs 85.68% <0.00%> (+0.11%) ⬆️
parquet/src/arrow/arrow_reader.rs 93.79% <0.00%> (+0.22%) ⬆️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (+0.22%) ⬆️
arrow/src/array/array_dictionary.rs 91.91% <0.00%> (+0.24%) ⬆️
parquet/src/arrow/array_reader.rs 78.72% <0.00%> (+0.45%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d02425d...7183822. Read the comment docs.

@HaoYang670
Copy link
Contributor Author

@alamb Could you please help to review?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nicely done @HaoYang670 🏅

Thank you and sorry for the review delay


#[test]
fn length_test_large_binary() -> Result<()> {
let value: Vec<&[u8]> = vec![b"zero", &[0xff, 0xf8], b"two"];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for non UTF8

@@ -94,18 +111,20 @@ where
.downcast_ref::<GenericStringArray<O>>()
.unwrap();
let bits_in_bytes = O::from_usize(8).unwrap();
unary_offsets_string::<O, _>(array, T::DATA_TYPE, |x| x * bits_in_bytes)
unary_offsets!(array, T::DATA_TYPE, |x| x * bits_in_bytes)
}

/// Returns an array of Int32/Int64 denoting the number of bytes in each string in the array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Returns an array of Int32/Int64 denoting the number of bytes in each string in the array.
/// Returns an array of Int32/Int64 denoting the number of bytes in each string/binary in the array.

@@ -115,13 +134,15 @@ pub fn length(array: &dyn Array) -> Result<ArrayRef> {

/// Returns an array of Int32/Int64 denoting the number of bits in each string in the array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you. lgtm

simplify the way to get offsets. No performance penalty

Signed-off-by: remzi <13716567376yh@gmail.com>
// this is a 30% improvement over iterating over u8s and building OffsetSize, which
// justifies the usage of `unsafe`.
let slice: &[O] = &unsafe { offsets.typed_data::<O>() }[$array.offset()..];
let slice = $array.value_offsets();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ that is much nicer

@alamb alamb merged commit 2932da3 into apache:master Mar 24, 2022
@alamb
Copy link
Contributor

alamb commented Mar 24, 2022

Thanks @HaoYang670 and @viirya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The length kernel should work with BinaryArrays
4 participants