ARROW-12835: [C++][Python][R] Implement case-insensitive match using RE2 #10369

lidavidm · 2021-05-20T20:38:56Z

This uses RE2 to implement a case-insensitive substring search.

Originally, I implemented this using utf8proc, but then found it was about an order of magnitude slower than RE2. (This isn't an apples-to-apples comparison; utf8proc does it more 'properly' and handles more Unicode corners.) So I switched to just doing it with RE2 instead, especially since the utf8proc approach was complicated. (You can still see it in the original commit here if you're curious.)

r/R/dplyr-functions.R

r/src/compute.cpp

ianmcook · 2021-05-20T21:42:03Z

Bikeshedding here, but maybe we should rename the C++ and Python argument to ignore_case instead of case_insensitive which is a little easier to understand without getting mixed up in double negatives

github-actions · 2021-05-20T22:05:22Z

https://issues.apache.org/jira/browse/ARROW-12835

lidavidm · 2021-05-20T22:57:03Z

ignore_case is also what Python's regex engine calls it at least (though not RE2), so I've renamed it.

pitrou

Could you give an example where re2 doesn't follow proper Unicode semantics?

pitrou · 2021-05-26T14:26:03Z

cpp/src/arrow/compute/kernels/scalar_string.cc

 };

+template <typename Type, typename Matcher>
+struct MatchSubstring {


I don't understand why two MatchSubstring and MatchSubstringImpl classes. It seems one should be sufficient?

There's only one of each. I moved them around in this PR, but it's the same as before.

lidavidm

An example of RE2's Unicode handling is here: google/re2#262

That said I don't think it's too big a deal for us.

lidavidm · 2021-05-26T17:13:52Z

cpp/src/arrow/compute/kernels/scalar_string.cc

 };

+template <typename Type, typename Matcher>
+struct MatchSubstring {


There's only one of each. I moved them around in this PR, but it's the same as before.

pitrou · 2021-05-26T17:29:35Z

That said I don't think it's too big a deal for us.

It depends what you mean. The fact that ß and ss don't match is a bit of a bummer for German text, for example. I don't know what the intended use case is.

lidavidm · 2021-05-26T18:42:10Z

That said I don't think it's too big a deal for us.

It depends what you mean. The fact that ß and ss don't match is a bit of a bummer for German text, for example. I don't know what the intended use case is.

It is a bit of a bummer but I think it's also not 'unexpected' in that other systems (languages, etc) probably make the same tradeoff, hence why I thought it was worth a note in the docs, but wasn't worth implementing a full solution using utf8proc.

pitrou · 2021-06-01T16:16:50Z

Rebased to check for CI.

This uses RE2 to implement a case-insensitive substring search. Originally, I implemented this using utf8proc, but then found it was about an order of magnitude slower than RE2. (This isn't an apples-to-apples comparison; utf8proc does it more 'properly' and handles more Unicode corners.) So I switched to just doing it with RE2 instead, especially since the utf8proc approach was complicated. (You can still see it in the original commit here if you're curious.) Closes apache#10369 from lidavidm/arrow-12835 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

ianmcook reviewed May 20, 2021

View reviewed changes

r/R/dplyr-functions.R Outdated Show resolved Hide resolved

ianmcook reviewed May 20, 2021

View reviewed changes

r/src/compute.cpp Outdated Show resolved Hide resolved

github-actions bot added Component: C++ Component: Python Component: R labels May 20, 2021

ianmcook mentioned this pull request May 21, 2021

ARROW-12717: [C++][Python] Add find_substring kernel #10353

Closed

lidavidm force-pushed the arrow-12835 branch from d4aa588 to 8985309 Compare May 25, 2021 14:14

pitrou reviewed May 26, 2021

View reviewed changes

lidavidm commented May 26, 2021

View reviewed changes

pitrou approved these changes May 26, 2021

View reviewed changes

ARROW-12835: [C++][Python][R] Implement case-insensitive substring match

4e7c02c

pitrou force-pushed the arrow-12835 branch from 8985309 to 4e7c02c Compare June 1, 2021 16:16

pitrou closed this in b3e9da8 Jun 1, 2021

asfimport mentioned this pull request Jun 24, 2021

[C++] Implement case insenstive match in match_substring(_regex) and match_like #28569

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-12835: [C++][Python][R] Implement case-insensitive match using RE2 #10369

ARROW-12835: [C++][Python][R] Implement case-insensitive match using RE2 #10369

lidavidm commented May 20, 2021

ianmcook commented May 20, 2021

github-actions bot commented May 20, 2021

lidavidm commented May 20, 2021

pitrou left a comment

pitrou May 26, 2021

lidavidm May 26, 2021

lidavidm left a comment

lidavidm May 26, 2021

pitrou commented May 26, 2021

lidavidm commented May 26, 2021

pitrou commented Jun 1, 2021

ARROW-12835: [C++][Python][R] Implement case-insensitive match using RE2 #10369

ARROW-12835: [C++][Python][R] Implement case-insensitive match using RE2 #10369

Conversation

lidavidm commented May 20, 2021

ianmcook commented May 20, 2021

github-actions bot commented May 20, 2021

lidavidm commented May 20, 2021

pitrou left a comment

Choose a reason for hiding this comment

pitrou May 26, 2021

Choose a reason for hiding this comment

lidavidm May 26, 2021

Choose a reason for hiding this comment

lidavidm left a comment

Choose a reason for hiding this comment

lidavidm May 26, 2021

Choose a reason for hiding this comment

pitrou commented May 26, 2021

lidavidm commented May 26, 2021

pitrou commented Jun 1, 2021