Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The substring kernel panics when chars > U+0x007F #1478

Closed
HaoYang670 opened this issue Mar 23, 2022 · 5 comments · Fixed by #1529
Closed

The substring kernel panics when chars > U+0x007F #1478

HaoYang670 opened this issue Mar 23, 2022 · 5 comments · Fixed by #1529
Labels
arrow Changes to the arrow crate bug

Comments

@HaoYang670
Copy link
Contributor

HaoYang670 commented Mar 23, 2022

Describe the bug
The substring kernel can only work on chars that are encoded as 1 byte in utf-8 standard. If the string contains a char that requires more than 1 byte, the function will panic.

To Reproduce
Steps to reproduce the behavior:
Give a string "E=mc²", start index = -1, length = None.
the expected result is "²".
However, I got:

thread 'compute::kernels::substring::tests::without_nulls_string' panicked at 'byte index 2 is out of bounds of `�`', library/core/src/fmt/mod.rs:2160:30

The reason is that the char ² is encoded as 0xC2 0xB2 in utf8 standard. When we tried to get the last char in string, what we really get is a byte sequence [0xB2] which is invalid in utf-8 standard.

Expected behavior
I think there are three ways to fix the bug:
1.(easy) Update the doc of the substring function to explain we only support 1-byte utf-8 chars. Also explain that start and length are counted in bytes.
2.(a little difficult) check the string array only contains 1-byte utf-8 chars (the highest-order bit is 0) in the substring function.
3.(difficult, and the API will be changed) Intercept based on characters, not bytes.

Additional context
Add any other context about the problem here.

@HaoYang670 HaoYang670 added the bug label Mar 23, 2022
@HaoYang670
Copy link
Contributor Author

@alamb please review. Thank you!

@alamb
Copy link
Contributor

alamb commented Mar 23, 2022

Hi @HaoYang670 -- I suggest

  1. Leave the implementation in terms of bytes
  2. Clarifying the documentation (your suggestion 1)
  3. Verify that the StringArray created by substring contains only valid utf8 data and throwing a more specific error message if it does not.

I am not sure if the code already does 3 or not.

The challenge with proper unicode support, from my perspective, is that it will likely be slower and require a new dependency (to identify the unicode graphemes). There is a unicode aware implementation of substr in the datafusion repo I believe contributed by @ovr.

https://github.com/apache/arrow-datafusion/blob/eb5a18a427bb718bffbf477c8fdf0230bb0a6242/datafusion-physical-expr/src/unicode_expressions.rs#L413-L441

Another possibility is to add an optional feature flag to arrow-rs for "unicode" string support and base the behavior on that flag. But that sounds a little over complicated

@HaoYang670
Copy link
Contributor Author

In the substring kernel of CUDF, I find that there is some code to calculate the number of bytes of each char. Maybe we could introduce it to our code in the future.
https://github.com/rapidsai/cudf/blob/branch-22.06/cpp/src/strings/substring.cu#L90-L96

@alamb
Copy link
Contributor

alamb commented Mar 27, 2022

In the substring kernel of CUDF, I find that there is some code to calculate the number of bytes of each char. Maybe we could introduce it to our code in the future.

Being able to calculate unicode characters / graphemes without bringing in a new dependency would be great

@HaoYang670
Copy link
Contributor Author

Great, I will create an issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants