Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve doc string for substring kernel #1529

Merged
merged 3 commits into from
Apr 10, 2022
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 27 additions & 2 deletions arrow/src/compute/kernels/substring.rs
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,33 @@ fn generic_substring<OffsetSize: StringOffsetSizeTrait>(
Ok(make_array(data))
}

/// Returns an ArrayRef with a substring starting from `start` and with optional length `length` of each of the elements in `array`.
/// `start` can be negative, in which case the start counts from the end of the string.
/// Returns an ArrayRef with substrings of all the elements in `array`.
///
/// # Arguments
///
/// * `start` - The start index of all substrings.
/// If `start >= 0`, then count from the start of the string,
/// otherwise count from the end of the string.
///
/// * `length`(option) - The length of all substrings.
/// If `length` is `None`, then the substring is from `start` to the end of the string.
///
/// Attention: Both `start` and `length` are counted by byte, not by char.
///
/// # Warning
///
/// This function **might** return in invalid utf-8 format if strings contain chars whose codepoint > `0x007F`.
HaoYang670 marked this conversation as resolved.
Show resolved Hide resolved
/// ## Example of getting an invalid substring
/// ```
/// # use arrow::array::StringArray;
/// # use arrow::compute::kernels::substring::substring;
/// let array = StringArray::from(vec![Some("E=mc²")]);
/// let result = substring(&array, -1, &None).unwrap();
/// let result = result.as_any().downcast_ref::<StringArray>().unwrap();
/// assert_eq!(result.value(0).as_bytes(), &[0x00B2]); // invalid utf-8 format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Is it worth a ticket to add utf8 validation to this kernel? It should be straightforward, but will likely slow it down.

Copy link
Contributor Author

@HaoYang670 HaoYang670 Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid the time complexity will be changed. Currently, the time complexity is O(length of offset buffer). Adding validation might increase the time complexity to O(length of offset buffer + length of value buffer) because we have to read every byte in the value buffer in the worst case:
https://github.com/apache/arrow-rs/blob/master/arrow/src/array/data.rs#L1090-L1112

Copy link
Contributor Author

@HaoYang670 HaoYang670 Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, is it necessary to mark substring unsafe?
Besides, we could implement a safe version of substring and also substring_by_char:

name Safety Performance
unsafe substring_unckecked user should make sure valid utf8 substrings can be gotten with start and length fast
substring_checked None slow
substring_by_char None slow

Users should always make sure all values of the input array are in valid utf-8 format.

This will improve the safety of the substring kernel, but break the public API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, Arrow2 has implemented substring for binary array. I am not sure whether we need to do it.

Copy link
Contributor Author

@HaoYang670 HaoYang670 Apr 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add substring support for ListArray, FixedSizeListArray and FixedSizeBinaryArray

Copy link
Contributor

@alamb alamb Apr 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HaoYang670 I think these are all good ideas (making safe / unsafe versions, and adding substring for binary etc). I would be happy to create tickets for them if that would help

If we are going to change the substring kernel, I suggest we do the following to follow Rust's safe-by-default mantra:

name Safety Performance Notes
unsafe substring_bytes_unchecked unsafe fast what is currently called substring - user should make sure valid utf8 substrings can be gotten with start and length
substring_bytes safe slow Implement substring by bytes - throws error of invalid utf-8 sequence is created
substring safe slow Implement substring by char, ensuring all substrings are utf-8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb . I have filed as epic issue: #1531 .
Please review.

/// ```
///
/// # Error
/// this function errors when the passed array is not a \[Large\]String array.
pub fn substring(
array: &dyn Array,
Expand Down