-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add substring
support for FixedSizeBinaryArray
#1633
Changes from 7 commits
bc1484d
4065ae6
a22e7cc
3373b6c
e73a737
c7e3624
08bf094
6906fc0
566e0ff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -86,6 +86,52 @@ fn binary_substring<OffsetSize: BinaryOffsetSizeTrait>( | |
Ok(make_array(data)) | ||
} | ||
|
||
fn fixed_size_binary_substring( | ||
array: &FixedSizeBinaryArray, | ||
old_len: i32, | ||
start: i32, | ||
length: Option<i32>, | ||
) -> Result<ArrayRef> { | ||
let new_start = if start >= 0 { | ||
start.min(old_len) | ||
} else { | ||
(old_len + start).max(0) | ||
}; | ||
let new_len = match length { | ||
Some(len) => len.min(old_len - new_start), | ||
None => old_len - new_start, | ||
}; | ||
|
||
// build value buffer | ||
let num_of_elements = array.len(); | ||
let values = array.value_data(); | ||
let data = values.as_slice(); | ||
let mut new_values = MutableBuffer::new(num_of_elements * (new_len as usize)); | ||
(0..num_of_elements) | ||
.map(|idx| { | ||
let offset = array.value_offset(idx); | ||
( | ||
(offset + new_start) as usize, | ||
(offset + new_start + new_len) as usize, | ||
) | ||
}) | ||
.for_each(|(start, end)| new_values.extend_from_slice(&data[start..end])); | ||
|
||
let array_data = unsafe { | ||
ArrayData::new_unchecked( | ||
DataType::FixedSizeBinary(new_len), | ||
num_of_elements, | ||
None, | ||
array.data_ref().null_buffer().cloned(), | ||
0, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you very much for finding the bug. I have fixed it by getting the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As for other arrays' implementations we have the same bug, I will file a follow-up issue to fix it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you @HaoYang670 |
||
vec![new_values.into()], | ||
vec![], | ||
) | ||
}; | ||
|
||
Ok(make_array(array_data)) | ||
} | ||
|
||
/// substring by byte | ||
fn utf8_substring<OffsetSize: StringOffsetSizeTrait>( | ||
array: &GenericStringArray<OffsetSize>, | ||
|
@@ -219,6 +265,15 @@ pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<A | |
start as i32, | ||
length.map(|e| e as i32), | ||
), | ||
DataType::FixedSizeBinary(old_len) => fixed_size_binary_substring( | ||
array | ||
.as_any() | ||
.downcast_ref::<FixedSizeBinaryArray>() | ||
.expect("a fixed size binary is expected"), | ||
*old_len, | ||
start as i32, | ||
length.map(|e| e as i32), | ||
), | ||
DataType::LargeUtf8 => utf8_substring( | ||
array | ||
.as_any() | ||
|
@@ -249,6 +304,8 @@ mod tests { | |
#[allow(clippy::type_complexity)] | ||
fn with_nulls_generic_binary<O: BinaryOffsetSizeTrait>() -> Result<()> { | ||
let cases: Vec<(Vec<Option<&[u8]>>, i64, Option<u64>, Vec<Option<&[u8]>>)> = vec![ | ||
// all-nulls array is always identical | ||
(vec![None, None, None], -1, Some(1), vec![None, None, None]), | ||
Comment on lines
+310
to
+311
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for adding test coverage. |
||
// identity | ||
( | ||
vec![Some(b"hello"), None, Some(&[0xf8, 0xf9, 0xff, 0xfa])], | ||
|
@@ -318,6 +375,8 @@ mod tests { | |
#[allow(clippy::type_complexity)] | ||
fn without_nulls_generic_binary<O: BinaryOffsetSizeTrait>() -> Result<()> { | ||
let cases: Vec<(Vec<&[u8]>, i64, Option<u64>, Vec<&[u8]>)> = vec![ | ||
// empty array is always identical | ||
(vec![b"", b"", b""], 2, Some(1), vec![b"", b"", b""]), | ||
// increase start | ||
( | ||
vec![b"hello", b"", &[0xf8, 0xf9, 0xff, 0xfa]], | ||
|
@@ -453,8 +512,265 @@ mod tests { | |
without_nulls_generic_binary::<i64>() | ||
} | ||
|
||
#[test] | ||
#[allow(clippy::type_complexity)] | ||
fn with_nulls_fixed_size_binary() -> Result<()> { | ||
let cases: Vec<(Vec<Option<&[u8]>>, i64, Option<u64>, Vec<Option<&[u8]>>)> = vec![ | ||
// all-nulls array is always identical | ||
(vec![None, None, None], 3, Some(2), vec![None, None, None]), | ||
// increase start | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
0, | ||
None, | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
1, | ||
None, | ||
vec![Some(b"at"), None, Some(&[0xf9, 0xff])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
2, | ||
None, | ||
vec![Some(b"t"), None, Some(&[0xff])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
3, | ||
None, | ||
vec![Some(b""), None, Some(&[])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
10, | ||
None, | ||
vec![Some(b""), None, Some(b"")], | ||
), | ||
// increase start negatively | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
-1, | ||
None, | ||
vec![Some(b"t"), None, Some(&[0xff])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
-2, | ||
None, | ||
vec![Some(b"at"), None, Some(&[0xf9, 0xff])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
-3, | ||
None, | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
-10, | ||
None, | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
), | ||
// increase length | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
1, | ||
Some(1), | ||
vec![Some(b"a"), None, Some(&[0xf9])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
1, | ||
Some(2), | ||
vec![Some(b"at"), None, Some(&[0xf9, 0xff])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
1, | ||
Some(3), | ||
vec![Some(b"at"), None, Some(&[0xf9, 0xff])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
-3, | ||
Some(1), | ||
vec![Some(b"c"), None, Some(&[0xf8])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
-3, | ||
Some(2), | ||
vec![Some(b"ca"), None, Some(&[0xf8, 0xf9])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
-3, | ||
Some(3), | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
), | ||
( | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
-3, | ||
Some(4), | ||
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])], | ||
), | ||
]; | ||
|
||
cases.into_iter().try_for_each::<_, Result<()>>( | ||
|(array, start, length, expected)| { | ||
let array = FixedSizeBinaryArray::try_from_sparse_iter(array.into_iter()) | ||
.unwrap(); | ||
let result = substring(&array, start, length)?; | ||
assert_eq!(array.len(), result.len()); | ||
let result = result | ||
.as_any() | ||
.downcast_ref::<FixedSizeBinaryArray>() | ||
.unwrap(); | ||
let expected = | ||
FixedSizeBinaryArray::try_from_sparse_iter(expected.into_iter()) | ||
.unwrap(); | ||
assert_eq!(&expected, result,); | ||
Ok(()) | ||
}, | ||
)?; | ||
|
||
Ok(()) | ||
} | ||
|
||
#[test] | ||
#[allow(clippy::type_complexity)] | ||
fn without_nulls_fixed_size_binary() -> Result<()> { | ||
let cases: Vec<(Vec<&[u8]>, i64, Option<u64>, Vec<&[u8]>)> = vec![ | ||
// empty array is always identical | ||
(vec![b"", b"", &[]], 3, Some(2), vec![b"", b"", &[]]), | ||
// increase start | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
0, | ||
None, | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
1, | ||
None, | ||
vec![b"at", b"og", &[0xf9, 0xff]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
2, | ||
None, | ||
vec![b"t", b"g", &[0xff]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
3, | ||
None, | ||
vec![b"", b"", &[]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
10, | ||
None, | ||
vec![b"", b"", b""], | ||
), | ||
// increase start negatively | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
-1, | ||
None, | ||
vec![b"t", b"g", &[0xff]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
-2, | ||
None, | ||
vec![b"at", b"og", &[0xf9, 0xff]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
-3, | ||
None, | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
-10, | ||
None, | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
), | ||
// increase length | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
1, | ||
Some(1), | ||
vec![b"a", b"o", &[0xf9]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
1, | ||
Some(2), | ||
vec![b"at", b"og", &[0xf9, 0xff]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
1, | ||
Some(3), | ||
vec![b"at", b"og", &[0xf9, 0xff]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
-3, | ||
Some(1), | ||
vec![b"c", b"d", &[0xf8]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
-3, | ||
Some(2), | ||
vec![b"ca", b"do", &[0xf8, 0xf9]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
-3, | ||
Some(3), | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
), | ||
( | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
-3, | ||
Some(4), | ||
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]], | ||
), | ||
]; | ||
|
||
cases.into_iter().try_for_each::<_, Result<()>>( | ||
|(array, start, length, expected)| { | ||
let array = | ||
FixedSizeBinaryArray::try_from_iter(array.into_iter()).unwrap(); | ||
let result = substring(&array, start, length)?; | ||
assert_eq!(array.len(), result.len()); | ||
let result = result | ||
.as_any() | ||
.downcast_ref::<FixedSizeBinaryArray>() | ||
.unwrap(); | ||
let expected = | ||
FixedSizeBinaryArray::try_from_iter(expected.into_iter()).unwrap(); | ||
assert_eq!(&expected, result,); | ||
Ok(()) | ||
}, | ||
)?; | ||
|
||
Ok(()) | ||
} | ||
|
||
fn with_nulls_generic_string<O: StringOffsetSizeTrait>() -> Result<()> { | ||
let cases = vec![ | ||
// all-nulls array is always identical | ||
(vec![None, None, None], 0, None, vec![None, None, None]), | ||
// identity | ||
( | ||
vec![Some("hello"), None, Some("word")], | ||
|
@@ -523,6 +839,8 @@ mod tests { | |
|
||
fn without_nulls_generic_string<O: StringOffsetSizeTrait>() -> Result<()> { | ||
let cases = vec![ | ||
// empty array is always identical | ||
(vec!["", "", ""], 0, None, vec!["", "", ""]), | ||
// increase start | ||
( | ||
vec!["hello", "", "word"], | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that if
length
is negative?DataType::FixedSizeBinary(negative)
seems invalid.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the answer is no for several reasons:
length
is always >=0FixedSizeBinary
seems to allow negative value, although it is somewhat weird:Although it may be more reasonable to use unsigned int to type the
length
, in Apache Arrow specification, thelength
must bei32
. https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout (For nested arrays, we always use signed integer to represent the length or offsets.)What would happen if we give a negative length or negative offsets buffer?
This is a fun game!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any discussion or explanation of why we use signed integer to represent length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's right, I didn't notice that
length
isu64
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some wordings I read in the spec. It is not talking about array length, but I think it might be somehow related to the question.