-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12835: [C++][Python][R] Implement case-insensitive match using RE2 #10369
Conversation
Bikeshedding here, but maybe we should rename the C++ and Python argument to |
ignore_case is also what Python's regex engine calls it at least (though not RE2), so I've renamed it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you give an example where re2 doesn't follow proper Unicode semantics?
}; | ||
|
||
template <typename Type, typename Matcher> | ||
struct MatchSubstring { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why two MatchSubstring
and MatchSubstringImpl
classes. It seems one should be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one of each. I moved them around in this PR, but it's the same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example of RE2's Unicode handling is here: google/re2#262
That said I don't think it's too big a deal for us.
}; | ||
|
||
template <typename Type, typename Matcher> | ||
struct MatchSubstring { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one of each. I moved them around in this PR, but it's the same as before.
It depends what you mean. The fact that |
It is a bit of a bummer but I think it's also not 'unexpected' in that other systems (languages, etc) probably make the same tradeoff, hence why I thought it was worth a note in the docs, but wasn't worth implementing a full solution using utf8proc. |
Rebased to check for CI. |
This uses RE2 to implement a case-insensitive substring search. Originally, I implemented this using utf8proc, but then found it was about an order of magnitude slower than RE2. (This isn't an apples-to-apples comparison; utf8proc does it more 'properly' and handles more Unicode corners.) So I switched to just doing it with RE2 instead, especially since the utf8proc approach was complicated. (You can still see it in the original commit here if you're curious.) Closes apache#10369 from lidavidm/arrow-12835 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
This uses RE2 to implement a case-insensitive substring search.
Originally, I implemented this using utf8proc, but then found it was about an order of magnitude slower than RE2. (This isn't an apples-to-apples comparison; utf8proc does it more 'properly' and handles more Unicode corners.) So I switched to just doing it with RE2 instead, especially since the utf8proc approach was complicated. (You can still see it in the original commit here if you're curious.)