Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Mixed support for binary types in regex functions #29497

Closed
asfimport opened this issue Sep 3, 2021 · 7 comments
Closed

[C++] Mixed support for binary types in regex functions #29497

asfimport opened this issue Sep 3, 2021 · 7 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Sep 3, 2021

The functions count_substring, count_substring_regex, find_substring, and find_substring_regex all accept binary types but the function extract_regex, match_substring, match_substring_regex, match_like, starts_with, ends_with, split_pattern, and split_pattern_regex do not.

They should either all accept binary types or none should.

Reporter: Weston Pace / @westonpace
Assignee: Eduardo Ponce / @edponce

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-13879. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Eduardo Ponce / @edponce:
Adding "true" binary support to Arrow (ie. not encoded as ASCII/UTF-8) seems to be more involved than what I initially expected. Most of the kernels listed in this PR are implemented in such a way that they could in theory work with non-encoded binary data, but the fact that std::string and std::string_view are used in certain parts to hold the binary data prevents "true" binary support because of they are treated as null-terminating strings. Therefore, binary data containing a NULL byte (0x00) will yield incorrect results.

Two possible solutions are:

  • to consider BinaryTypes different from StringTypes where the former always requires an explicit length and the latter can depend on null-terminating strings

  • to treat strings as non-null-terminating sequence of bytes and wrap them in a string-like container that uses an explicit size instead of depending on a NULL value

    For this issue's PR, I am adding partial binary support to all the kernels since a null byte is not supported.

@asfimport
Copy link
Collaborator Author

Eduardo Ponce / @edponce:
Python supports zeroed bytes:

>>> s = b'\x84\x00\x2b'
>>> len(s)
3

@asfimport
Copy link
Collaborator Author

Eduardo Ponce / @edponce:
Not sure if this is of importance here, but the protobuf types map both string and bytes to C++ string.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
std::string can contain null. Some place is constructing it from a pointer instead of a pointer + length if that's the case.

@asfimport
Copy link
Collaborator Author

Eduardo Ponce / @edponce:
Well, the issue is that string_view is used to encapsulate binary data in certain parts and string_view.length() is used to get the size but it stops at the first null byte, thus providing an incorrect size.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
length() is only used again if you construct from a pointer instead of a pointer + length - so this is still an issue with usage and not with string_view.

@asfimport
Copy link
Collaborator Author

David Li / @lidavidm:
Issue resolved by pull request 11233
#11233

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant