Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add substring support for binary #1608

Merged
merged 6 commits into from
Apr 25, 2022

Conversation

HaoYang670
Copy link
Contributor

Which issue does this PR close?

Closes #1593 .

Rationale for this change

What changes are included in this PR?

  1. Add substring support for (Large)BinaryArray
  2. Add some tests
  3. fix a bug in the test of string array
  4. update doc

Are there any user-facing changes?

Update some docs

fix some test for string array

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 23, 2022
@@ -291,11 +575,14 @@ mod tests {

cases.into_iter().try_for_each::<_, Result<()>>(
|(array, start, length, expected)| {
let array = StringArray::from(array);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did not cover LargeStringArray in the past.

@@ -25,7 +26,68 @@ use crate::{
};
use std::cmp::Ordering;

fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
fn binary_substring<OffsetSize: BinaryOffsetSizeTrait>(
Copy link
Contributor Author

@HaoYang670 HaoYang670 Apr 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is similar to utf8_substring except that there is no char boundary checking.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than replicate quite so much code, I wonder if it would be possible to make generic_substring take a function to check char boundaries and and then pass in a function that does nothing

Maybe like(untested):

fn generic_substring<OffsetSize, F>(
    array: &GenericBinaryArray<OffsetSize>,
    start: OffsetSize,
    length: Option<OffsetSize>,
    check_char_boundary: F,
)
where 
  OffsetSize: StringOffsetSizeTrait,
  F: Fn(OffsetSize) -> Result<OffsetSize>
{
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid that BinaryArray and StringArray can not share one API because GenericBinaryArray and GenericStringArray are two different types.
Maybe we could use macro to extract some common codes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- makes sense -- duplication makes sense to me then

@codecov-commenter
Copy link

codecov-commenter commented Apr 23, 2022

Codecov Report

Merging #1608 (38787ca) into master (fd9cb23) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1608      +/-   ##
==========================================
+ Coverage   82.95%   83.00%   +0.04%     
==========================================
  Files         193      193              
  Lines       55435    55571     +136     
==========================================
+ Hits        45988    46125     +137     
+ Misses       9447     9446       -1     
Impacted Files Coverage Δ
arrow/src/compute/kernels/substring.rs 100.00% <100.00%> (ø)
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd9cb23...38787ca. Read the comment docs.

Signed-off-by: remzi <13716567376yh@gmail.com>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @HaoYang670 -- I went through the tests carefully and it looks good to me.

cc @viirya and @jhorstmann

@@ -25,7 +26,68 @@ use crate::{
};
use std::cmp::Ordering;

fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
fn binary_substring<OffsetSize: BinaryOffsetSizeTrait>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than replicate quite so much code, I wonder if it would be possible to make generic_substring take a function to check char boundaries and and then pass in a function that does nothing

Maybe like(untested):

fn generic_substring<OffsetSize, F>(
    array: &GenericBinaryArray<OffsetSize>,
    start: OffsetSize,
    length: Option<OffsetSize>,
    check_char_boundary: F,
)
where 
  OffsetSize: StringOffsetSizeTrait,
  F: Fn(OffsetSize) -> Result<OffsetSize>
{
...
}

fn with_nulls<T: 'static + Array + PartialEq + From<Vec<Option<&'static str>>>>(
) -> Result<()> {
#[allow(clippy::type_complexity)]
fn with_nulls_generic_binary<O: BinaryOffsetSizeTrait>() -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this follows string version with_nulls, just wondering why in with_nulls_... only edge cases are tested against. without_nulls_... has normal test cases but it doesn't include nulls.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks @HaoYang670

@alamb alamb merged commit 3e933ea into apache:master Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for BinaryArray in substring kernel
4 participants