Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix like regex escaping #1085

Merged
merged 4 commits into from
Dec 22, 2021
Merged

Fix like regex escaping #1085

merged 4 commits into from
Dec 22, 2021

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Dec 22, 2021

Which issue does this PR close?

Closes #1069

Rationale for this change

Fixes like not to handle regex syntax like .* and . but escapes it.

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 22, 2021
@@ -303,7 +303,7 @@ where
/// use arrow::compute::like_utf8;
///
/// let strings = StringArray::from(vec!["Arrow", "Arrow", "Arrow", "Ar"]);
/// let patterns = StringArray::from(vec!["A%", "B%", "A.", "A."]);
/// let patterns = StringArray::from(vec!["A%", "B%", "A.", "A_"]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol - even the doctest here was relying on a non-escaped .

@codecov-commenter
Copy link

codecov-commenter commented Dec 22, 2021

Codecov Report

Merging #1085 (437782d) into master (5a32391) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 437782d differs from pull request most recent head 595c07b. Consider uploading reports for the commit 595c07b to get more accurate results
Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1085   +/-   ##
=======================================
  Coverage   82.28%   82.28%           
=======================================
  Files         168      168           
  Lines       49281    49289    +8     
=======================================
+ Hits        40549    40559   +10     
+ Misses       8732     8730    -2     
Impacted Files Coverage Δ
arrow/src/compute/kernels/comparison.rs 93.34% <100.00%> (+0.06%) ⬆️
parquet/src/encodings/encoding.rs 93.71% <0.00%> (+0.19%) ⬆️
arrow/src/datatypes/datatype.rs 66.80% <0.00%> (+0.42%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a32391...595c07b. Read the comment docs.

@jwdeitch
Copy link
Contributor

thanks for the PR! looks good!
I think there may be an outstanding escaping issue in where right-hand side is '%%%' and you want to match on '(literal%)_(literal%)'.... but this problem may be outside the scope of this fix. :)

let pat = right.value(i);
let re = if let Some(ref regex) = map.get(pat) {
let pat = escape(right.value(i));
let re = if let Some(ref regex) = map.get(&pat) {
regex
} else {
let re_pattern = pat.replace("%", ".*").replace("_", ".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this pat.replace call?

Copy link
Contributor Author

@Dandandan Dandandan Dec 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, to make a proper regex out of it (instead of literally matching on the percentage or underscore character).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we could minimize some borrow / cloning by moving the escape... So now it looks more similar to the other places.

regex
} else {
let re_pattern = pat.replace("%", ".*").replace("_", ".");
let re_pattern = escape(pat).replace("%", ".*").replace("_", ".");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less code and fewer bugs -- now that is the sign of a great fix ❤️

Nice work -- looks great to me @Dandandan .

@alamb
Copy link
Contributor

alamb commented Dec 22, 2021

I think there may be an outstanding escaping issue in where right-hand side is '%%%' and you want to match on '(literal%)_(literal%)'.... but this problem may be outside the scope of this fix. :)

We can file a follow on ticket, perhaps. If someone can whip up a reproducer that would be awesome 🤞

@alamb alamb merged commit e8cc39e into apache:master Dec 22, 2021
@alamb
Copy link
Contributor

alamb commented Dec 22, 2021

Filed #1087 to track the possible escaping problems

alamb pushed a commit that referenced this pull request Dec 22, 2021
* Fix like regex escaping

* Fix like regex escaping

* Fix doctest

* Simplify
alamb added a commit that referenced this pull request Dec 22, 2021
* Fix like regex escaping

* Fix like regex escaping

* Fix doctest

* Simplify

Co-authored-by: Daniël Heres <danielheres@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Arrow comparison LIKE/ILIKE/NLIKE kernels do not escape all special characters
4 participants