Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add substring support for FixedSizeBinaryArray #1633

Merged
merged 9 commits into from
May 5, 2022
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 11 additions & 6 deletions arrow/benches/string_kernels.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,22 +25,27 @@ use arrow::array::*;
use arrow::compute::kernels::substring::substring;
use arrow::util::bench_util::*;

fn bench_substring(arr: &StringArray, start: i64, length: Option<u64>) {
fn bench_substring(arr: &dyn Array, start: i64, length: Option<u64>) {
substring(criterion::black_box(arr), start, length).unwrap();
}

fn add_benchmark(c: &mut Criterion) {
let size = 65536;
let str_len = 1000;
let val_len = 1000;

let arr_string = create_string_array_with_len::<i32>(size, 0.0, str_len);
let arr_string = create_string_array_with_len::<i32>(size, 0.0, val_len);
let arr_fsb = create_fsb_array(size, 0.0, val_len);

c.bench_function("substring (start = 0, length = None)", |b| {
c.bench_function("substring utf8 (start = 0, length = None)", |b| {
b.iter(|| bench_substring(&arr_string, 0, None))
});

c.bench_function("substring (start = 1, length = str_len - 1)", |b| {
b.iter(|| bench_substring(&arr_string, 1, Some((str_len - 1) as u64)))
c.bench_function("substring utf8 (start = 1, length = str_len - 1)", |b| {
b.iter(|| bench_substring(&arr_string, 1, Some((val_len - 1) as u64)))
});

c.bench_function("substring fixed size binary array", |b| {
b.iter(|| bench_substring(&arr_fsb, 1, Some((val_len - 1) as u64)))
});
}

Expand Down
318 changes: 318 additions & 0 deletions arrow/src/compute/kernels/substring.rs
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,52 @@ fn binary_substring<OffsetSize: BinaryOffsetSizeTrait>(
Ok(make_array(data))
}

fn fixed_size_binary_substring(
array: &FixedSizeBinaryArray,
old_len: i32,
start: i32,
length: Option<i32>,
) -> Result<ArrayRef> {
let new_start = if start >= 0 {
start.min(old_len)
} else {
(old_len + start).max(0)
};
let new_len = match length {
Some(len) => len.min(old_len - new_start),
None => old_len - new_start,
};
Comment on lines +100 to +103
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that if length is negative? DataType::FixedSizeBinary(negative) seems invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer is no for several reasons:

  1. the definition of the public function makes sure that length is always >=0
pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<ArrayRef> {
  1. the definition of the data type of FixedSizeBinary seems to allow negative value, although it is somewhat weird:
    /// Opaque binary data of fixed size.
    /// Enum parameter specifies the number of bytes per value.
    FixedSizeBinary(i32),

Although it may be more reasonable to use unsigned int to type the length, in Apache Arrow specification, the length must be i32. https://arrow.apache.org/docs/format/Columnar.html#fixed-size-list-layout (For nested arrays, we always use signed integer to represent the length or offsets.)

A fixed size list type is specified like FixedSizeList<T>[N], where T is any type (primitive or nested) and N is a 32-bit signed integer representing the length of the lists.

What would happen if we give a negative length or negative offsets buffer?
This is a fun game!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any discussion or explanation of why we use signed integer to represent length?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's right, I didn't notice that length is u64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since unsigned integers can be more difficult to work with in some cases (e.g. in the JVM), we recommend preferring signed integers over unsigned integers for representing dictionary indices. Additionally, we recommend avoiding using 64-bit unsigned integer indices unless they are required by an application.

This is some wordings I read in the spec. It is not talking about array length, but I think it might be somehow related to the question.


// build value buffer
let num_of_elements = array.len();
let values = array.value_data();
let data = values.as_slice();
let mut new_values = MutableBuffer::new(num_of_elements * (new_len as usize));
(0..num_of_elements)
.map(|idx| {
let offset = array.value_offset(idx);
(
(offset + new_start) as usize,
(offset + new_start + new_len) as usize,
)
})
.for_each(|(start, end)| new_values.extend_from_slice(&data[start..end]));

let array_data = unsafe {
ArrayData::new_unchecked(
DataType::FixedSizeBinary(new_len),
num_of_elements,
None,
array.data_ref().null_buffer().cloned(),
0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 offset works for the new array value new_values. But null_buffer is cloned, so maybe it has different offset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for finding the bug. I have fixed it by getting the bit_slice of null_buffer. Also a corresponding test is added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for other arrays' implementations we have the same bug, I will file a follow-up issue to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @HaoYang670

vec![new_values.into()],
vec![],
)
};

Ok(make_array(array_data))
}

/// substring by byte
fn utf8_substring<OffsetSize: StringOffsetSizeTrait>(
array: &GenericStringArray<OffsetSize>,
Expand Down Expand Up @@ -219,6 +265,15 @@ pub fn substring(array: &dyn Array, start: i64, length: Option<u64>) -> Result<A
start as i32,
length.map(|e| e as i32),
),
DataType::FixedSizeBinary(old_len) => fixed_size_binary_substring(
array
.as_any()
.downcast_ref::<FixedSizeBinaryArray>()
.expect("a fixed size binary is expected"),
*old_len,
start as i32,
length.map(|e| e as i32),
),
DataType::LargeUtf8 => utf8_substring(
array
.as_any()
Expand Down Expand Up @@ -249,6 +304,8 @@ mod tests {
#[allow(clippy::type_complexity)]
fn with_nulls_generic_binary<O: BinaryOffsetSizeTrait>() -> Result<()> {
let cases: Vec<(Vec<Option<&[u8]>>, i64, Option<u64>, Vec<Option<&[u8]>>)> = vec![
// all-nulls array is always identical
(vec![None, None, None], -1, Some(1), vec![None, None, None]),
Comment on lines +310 to +311
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding test coverage.

// identity
(
vec![Some(b"hello"), None, Some(&[0xf8, 0xf9, 0xff, 0xfa])],
Expand Down Expand Up @@ -318,6 +375,8 @@ mod tests {
#[allow(clippy::type_complexity)]
fn without_nulls_generic_binary<O: BinaryOffsetSizeTrait>() -> Result<()> {
let cases: Vec<(Vec<&[u8]>, i64, Option<u64>, Vec<&[u8]>)> = vec![
// empty array is always identical
(vec![b"", b"", b""], 2, Some(1), vec![b"", b"", b""]),
// increase start
(
vec![b"hello", b"", &[0xf8, 0xf9, 0xff, 0xfa]],
Expand Down Expand Up @@ -453,8 +512,265 @@ mod tests {
without_nulls_generic_binary::<i64>()
}

#[test]
#[allow(clippy::type_complexity)]
fn with_nulls_fixed_size_binary() -> Result<()> {
let cases: Vec<(Vec<Option<&[u8]>>, i64, Option<u64>, Vec<Option<&[u8]>>)> = vec![
// all-nulls array is always identical
(vec![None, None, None], 3, Some(2), vec![None, None, None]),
// increase start
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
0,
None,
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
1,
None,
vec![Some(b"at"), None, Some(&[0xf9, 0xff])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
2,
None,
vec![Some(b"t"), None, Some(&[0xff])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
3,
None,
vec![Some(b""), None, Some(&[])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
10,
None,
vec![Some(b""), None, Some(b"")],
),
// increase start negatively
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
-1,
None,
vec![Some(b"t"), None, Some(&[0xff])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
-2,
None,
vec![Some(b"at"), None, Some(&[0xf9, 0xff])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
-3,
None,
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
-10,
None,
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
),
// increase length
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
1,
Some(1),
vec![Some(b"a"), None, Some(&[0xf9])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
1,
Some(2),
vec![Some(b"at"), None, Some(&[0xf9, 0xff])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
1,
Some(3),
vec![Some(b"at"), None, Some(&[0xf9, 0xff])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
-3,
Some(1),
vec![Some(b"c"), None, Some(&[0xf8])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
-3,
Some(2),
vec![Some(b"ca"), None, Some(&[0xf8, 0xf9])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
-3,
Some(3),
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
),
(
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
-3,
Some(4),
vec![Some(b"cat"), None, Some(&[0xf8, 0xf9, 0xff])],
),
];

cases.into_iter().try_for_each::<_, Result<()>>(
|(array, start, length, expected)| {
let array = FixedSizeBinaryArray::try_from_sparse_iter(array.into_iter())
.unwrap();
let result = substring(&array, start, length)?;
assert_eq!(array.len(), result.len());
let result = result
.as_any()
.downcast_ref::<FixedSizeBinaryArray>()
.unwrap();
let expected =
FixedSizeBinaryArray::try_from_sparse_iter(expected.into_iter())
.unwrap();
assert_eq!(&expected, result,);
Ok(())
},
)?;

Ok(())
}

#[test]
#[allow(clippy::type_complexity)]
fn without_nulls_fixed_size_binary() -> Result<()> {
let cases: Vec<(Vec<&[u8]>, i64, Option<u64>, Vec<&[u8]>)> = vec![
// empty array is always identical
(vec![b"", b"", &[]], 3, Some(2), vec![b"", b"", &[]]),
// increase start
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
0,
None,
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
1,
None,
vec![b"at", b"og", &[0xf9, 0xff]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
2,
None,
vec![b"t", b"g", &[0xff]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
3,
None,
vec![b"", b"", &[]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
10,
None,
vec![b"", b"", b""],
),
// increase start negatively
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
-1,
None,
vec![b"t", b"g", &[0xff]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
-2,
None,
vec![b"at", b"og", &[0xf9, 0xff]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
-3,
None,
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
-10,
None,
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
),
// increase length
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
1,
Some(1),
vec![b"a", b"o", &[0xf9]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
1,
Some(2),
vec![b"at", b"og", &[0xf9, 0xff]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
1,
Some(3),
vec![b"at", b"og", &[0xf9, 0xff]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
-3,
Some(1),
vec![b"c", b"d", &[0xf8]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
-3,
Some(2),
vec![b"ca", b"do", &[0xf8, 0xf9]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
-3,
Some(3),
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
),
(
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
-3,
Some(4),
vec![b"cat", b"dog", &[0xf8, 0xf9, 0xff]],
),
];

cases.into_iter().try_for_each::<_, Result<()>>(
|(array, start, length, expected)| {
let array =
FixedSizeBinaryArray::try_from_iter(array.into_iter()).unwrap();
let result = substring(&array, start, length)?;
assert_eq!(array.len(), result.len());
let result = result
.as_any()
.downcast_ref::<FixedSizeBinaryArray>()
.unwrap();
let expected =
FixedSizeBinaryArray::try_from_iter(expected.into_iter()).unwrap();
assert_eq!(&expected, result,);
Ok(())
},
)?;

Ok(())
}

fn with_nulls_generic_string<O: StringOffsetSizeTrait>() -> Result<()> {
let cases = vec![
// all-nulls array is always identical
(vec![None, None, None], 0, None, vec![None, None, None]),
// identity
(
vec![Some("hello"), None, Some("word")],
Expand Down Expand Up @@ -523,6 +839,8 @@ mod tests {

fn without_nulls_generic_string<O: StringOffsetSizeTrait>() -> Result<()> {
let cases = vec![
// empty array is always identical
(vec!["", "", ""], 0, None, vec!["", "", ""]),
// increase start
(
vec!["hello", "", "word"],
Expand Down