Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add substring support for FixedSizeBinaryArray #1633

Merged
merged 9 commits into from May 5, 2022

Conversation

HaoYang670
Copy link
Contributor

@HaoYang670 HaoYang670 commented May 1, 2022

Which issue does this PR close?

Closes #1618.

What changes are included in this PR?

  1. make the substring kernel support FixedSizeBinaryArray
  2. add more test cases for binary and utf8
  3. add benchmark for FixedSizeBinaryArray

Are there any user-facing changes?

Rename some benchmarks.

Bench Result

substring utf8 (start = 0, length = None)                                                                            
                        time:   [19.811 ms 19.843 ms 19.883 ms]

substring utf8 (start = 1, length = str_len - 1)                                                                            
                        time:   [21.270 ms 21.284 ms 21.299 ms]


substring fixed size binary array                                                                            
                        time:   [19.717 ms 19.753 ms 19.798 ms]

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@github-actions github-actions bot added the arrow Changes to the arrow crate label May 1, 2022
Comment on lines +100 to +103
let new_len = match length {
Some(len) => len.min(old_len - new_start),
None => old_len - new_start,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that if length is negative? DataType::FixedSizeBinary(negative) seems invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer is no for several reasons:

  1. the definition of the public function makes sure that length is always >=0
pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {
  1. the definition of the data type of FixedSizeBinary seems to allow negative value, although it is somewhat weird:
    /// Opaque binary data of fixed size.
    /// Enum parameter specifies the number of bytes per value.
    FixedSizeBinary(i32),

Although it may be more reasonable to use unsigned int to type the length, in Apache Arrow specification, the length must be i32. https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout (For nested arrays, we always use signed integer to represent the length or offsets.)

A fixed size list type is specified like FixedSizeList<T>[N], where T is any type (primitive or nested) and N is a 32-bit signed integer representing the length of the lists.

What would happen if we give a negative length or negative offsets buffer?
This is a fun game!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any discussion or explanation of why we use signed integer to represent length?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's right, I didn't notice that length is u64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since unsigned integers can be more difficult to work with in some cases (e.g. in the JVM), we recommend preferring signed integers over unsigned integers for representing dictionary indices. Additionally, we recommend avoiding using 64-bit unsigned integer indices unless they are required by an application.

This is some wordings I read in the spec. It is not talking about array length, but I think it might be somehow related to the question.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general. Thanks @HaoYang670.

Comment on lines +307 to +308
// all-nulls array is always identical
(vec![None, None, None], -1, Some(1), vec![None, None, None]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding test coverage.

Comment on lines 125 to 126
array.data_ref().null_buffer().cloned(),
0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 offset works for the new array value new_values. But null_buffer is cloned, so maybe it has different offset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for finding the bug. I have fixed it by getting the bit_slice of null_buffer. Also a corresponding test is added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for other arrays' implementations we have the same bug, I will file a follow-up issue to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @HaoYang670

Signed-off-by: remzi <13716567376yh@gmail.com>
Signed-off-by: remzi <13716567376yh@gmail.com>
@codecov-commenter
Copy link

Codecov Report

Merging #1633 (566e0ff) into master (7d00e3c) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1633      +/-   ##
==========================================
+ Coverage   83.02%   83.08%   +0.05%     
==========================================
  Files         193      193              
  Lines       55577    55772     +195     
==========================================
+ Hits        46145    46338     +193     
- Misses       9432     9434       +2     
Impacted Files Coverage Δ
arrow/src/compute/kernels/substring.rs 100.00% <100.00%> (ø)
arrow/src/array/transform/mod.rs 86.57% <0.00%> (-0.23%) ⬇️
arrow/src/array/array_union.rs 90.63% <0.00%> (-0.09%) ⬇️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (ø)
parquet/src/arrow/arrow_writer.rs 97.66% <0.00%> (+<0.01%) ⬆️
parquet/src/encodings/encoding.rs 93.56% <0.00%> (+0.18%) ⬆️
arrow/src/array/array_binary.rs 93.26% <0.00%> (+0.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d00e3c...566e0ff. Read the comment docs.

@viirya
Copy link
Member

viirya commented May 4, 2022

I will merge this later today if no more comments.

cc @tustvold @alamb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any more comments other than thank you @HaoYang670 and @viirya 👍

@viirya
Copy link
Member

viirya commented May 5, 2022

Thank you @HaoYang670 @alamb ! 😄

@viirya viirya merged commit e51de5e into apache:master May 5, 2022
@HaoYang670
Copy link
Contributor Author

Thanks a lot for your review @viirya @alamb !

@HaoYang670 HaoYang670 deleted the fixed_size_binary_substring branch May 5, 2022 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add substring support for FixedSizeBinaryArray
4 participants