Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12835: [C++][Python][R] Implement case-insensitive match using RE2 #10369

Closed
wants to merge 1 commit into from

Conversation

lidavidm
Copy link
Member

This uses RE2 to implement a case-insensitive substring search.

Originally, I implemented this using utf8proc, but then found it was about an order of magnitude slower than RE2. (This isn't an apples-to-apples comparison; utf8proc does it more 'properly' and handles more Unicode corners.) So I switched to just doing it with RE2 instead, especially since the utf8proc approach was complicated. (You can still see it in the original commit here if you're curious.)

r/R/dplyr-functions.R Outdated Show resolved Hide resolved
r/src/compute.cpp Outdated Show resolved Hide resolved
@ianmcook
Copy link
Member

Bikeshedding here, but maybe we should rename the C++ and Python argument to ignore_case instead of case_insensitive which is a little easier to understand without getting mixed up in double negatives

@github-actions
Copy link

@lidavidm
Copy link
Member Author

ignore_case is also what Python's regex engine calls it at least (though not RE2), so I've renamed it.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give an example where re2 doesn't follow proper Unicode semantics?

};

template <typename Type, typename Matcher>
struct MatchSubstring {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why two MatchSubstring and MatchSubstringImpl classes. It seems one should be sufficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one of each. I moved them around in this PR, but it's the same as before.

Copy link
Member Author

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of RE2's Unicode handling is here: google/re2#262

That said I don't think it's too big a deal for us.

};

template <typename Type, typename Matcher>
struct MatchSubstring {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one of each. I moved them around in this PR, but it's the same as before.

@pitrou
Copy link
Member

pitrou commented May 26, 2021

That said I don't think it's too big a deal for us.

It depends what you mean. The fact that ß and ss don't match is a bit of a bummer for German text, for example. I don't know what the intended use case is.

@lidavidm
Copy link
Member Author

That said I don't think it's too big a deal for us.

It depends what you mean. The fact that ß and ss don't match is a bit of a bummer for German text, for example. I don't know what the intended use case is.

It is a bit of a bummer but I think it's also not 'unexpected' in that other systems (languages, etc) probably make the same tradeoff, hence why I thought it was worth a note in the docs, but wasn't worth implementing a full solution using utf8proc.

@pitrou
Copy link
Member

pitrou commented Jun 1, 2021

Rebased to check for CI.

@pitrou pitrou closed this in b3e9da8 Jun 1, 2021
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This uses RE2 to implement a case-insensitive substring search.

Originally, I implemented this using utf8proc, but then found it was about an order of magnitude slower than RE2. (This isn't an apples-to-apples comparison; utf8proc does it more 'properly' and handles more Unicode corners.) So I switched to just doing it with RE2 instead, especially since the utf8proc approach was complicated. (You can still see it in the original commit here if you're curious.)

Closes apache#10369 from lidavidm/arrow-12835

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants