The `substring` kernel panics when chars > U+0x007F #1478

HaoYang670 · 2022-03-23T12:50:27Z

Describe the bug
The substring kernel can only work on chars that are encoded as 1 byte in utf-8 standard. If the string contains a char that requires more than 1 byte, the function will panic.

To Reproduce
Steps to reproduce the behavior:
Give a string "E=mc²", start index = -1, length = None.
the expected result is "²".
However, I got:

thread 'compute::kernels::substring::tests::without_nulls_string' panicked at 'byte index 2 is out of bounds of `�`', library/core/src/fmt/mod.rs:2160:30

The reason is that the char ² is encoded as 0xC2 0xB2 in utf8 standard. When we tried to get the last char in string, what we really get is a byte sequence [0xB2] which is invalid in utf-8 standard.

Expected behavior
I think there are three ways to fix the bug:
1.(easy) Update the doc of the substring function to explain we only support 1-byte utf-8 chars. Also explain that start and length are counted in bytes.
2.(a little difficult) check the string array only contains 1-byte utf-8 chars (the highest-order bit is 0) in the substring function.
3.(difficult, and the API will be changed) Intercept based on characters, not bytes.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

HaoYang670 · 2022-03-23T12:53:12Z

@alamb please review. Thank you!

alamb · 2022-03-23T20:04:17Z

Hi @HaoYang670 -- I suggest

Leave the implementation in terms of bytes
Clarifying the documentation (your suggestion 1)
Verify that the StringArray created by substring contains only valid utf8 data and throwing a more specific error message if it does not.

I am not sure if the code already does 3 or not.

The challenge with proper unicode support, from my perspective, is that it will likely be slower and require a new dependency (to identify the unicode graphemes). There is a unicode aware implementation of substr in the datafusion repo I believe contributed by @ovr.

https://github.com/apache/arrow-datafusion/blob/eb5a18a427bb718bffbf477c8fdf0230bb0a6242/datafusion-physical-expr/src/unicode_expressions.rs#L413-L441

Another possibility is to add an optional feature flag to arrow-rs for "unicode" string support and base the behavior on that flag. But that sounds a little over complicated

HaoYang670 · 2022-03-26T13:23:57Z

In the substring kernel of CUDF, I find that there is some code to calculate the number of bytes of each char. Maybe we could introduce it to our code in the future.
https://github.com/rapidsai/cudf/blob/branch-22.06/cpp/src/strings/substring.cu#L90-L96

alamb · 2022-03-27T10:37:40Z

In the substring kernel of CUDF, I find that there is some code to calculate the number of bytes of each char. Maybe we could introduce it to our code in the future.

Being able to calculate unicode characters / graphemes without bringing in a new dependency would be great

HaoYang670 · 2022-03-28T06:56:15Z

Great, I will create an issue!

HaoYang670 added the bug label Mar 23, 2022

HaoYang670 mentioned this issue Mar 28, 2022

Support calculate length by chars for StringArray #1493

Closed

HaoYang670 mentioned this issue Apr 8, 2022

Improve doc string for substring kernel #1529

Merged

alamb closed this as completed in #1529 Apr 10, 2022

HaoYang670 mentioned this issue Apr 12, 2022

Mark the current substring function as unsafe and rename it. #1541

Closed

alamb added the arrow Changes to the arrow crate label Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `substring` kernel panics when chars > U+0x007F #1478

The `substring` kernel panics when chars > U+0x007F #1478

HaoYang670 commented Mar 23, 2022 •

edited

Loading

HaoYang670 commented Mar 23, 2022

alamb commented Mar 23, 2022

HaoYang670 commented Mar 26, 2022

alamb commented Mar 27, 2022

HaoYang670 commented Mar 28, 2022

The substring kernel panics when chars > U+0x007F #1478

The substring kernel panics when chars > U+0x007F #1478

Comments

HaoYang670 commented Mar 23, 2022 • edited Loading

HaoYang670 commented Mar 23, 2022

alamb commented Mar 23, 2022

HaoYang670 commented Mar 26, 2022

alamb commented Mar 27, 2022

HaoYang670 commented Mar 28, 2022

The `substring` kernel panics when chars > U+0x007F #1478

The `substring` kernel panics when chars > U+0x007F #1478

HaoYang670 commented Mar 23, 2022 •

edited

Loading