Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix escaped like wildcards in like_utf8 / nlike_utf8 kernels #2258

Conversation

daniel-martinez-maqueda-sap
Copy link
Contributor

Which issue does this PR close?

Closes #415

Rationale for this change

It is explained in the Issues linked above.

What changes are included in this PR?

Added a new function that replaces the like wildcards '%' and '_' for the regex counterparts before executing them. It also takes into account that the wildcards can be escaped, in that case, it does remove the escape characters and leaves the wildcards so that they are matched against the raw character.

This is implemented iterating over all the characters of the pattern to figure out when it needs to be transformed or not.

Are there any user-facing changes?

N/A

Added a new function that replaces the like wildcards '%' and '_' for
the regex counterparts before executing them. It also takes into account
that the wildcards can be escaped, in that case, it does remove the
escape characters and leaves the wildcards so that they are matched
against the raw character.

This is implemented iterating over all the characters of the pattern to
figure out when it needs to be transformed or not.
@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 1, 2022
@codecov-commenter
Copy link

codecov-commenter commented Aug 2, 2022

Codecov Report

Merging #2258 (f3dd390) into master (3032a52) will decrease coverage by 0.14%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2258      +/-   ##
==========================================
- Coverage   82.29%   82.15%   -0.15%     
==========================================
  Files         243      248       +5     
  Lines       62443    63537    +1094     
==========================================
+ Hits        51387    52196     +809     
- Misses      11056    11341     +285     
Impacted Files Coverage Δ
arrow/src/compute/kernels/comparison.rs 94.07% <100.00%> (+0.17%) ⬆️
parquet/src/arrow/array_reader/struct_array.rs 84.84% <0.00%> (-1.71%) ⬇️
arrow/src/datatypes/schema.rs 69.41% <0.00%> (-1.01%) ⬇️
object_store/src/path/mod.rs 90.00% <0.00%> (-0.91%) ⬇️
...row/src/array/builder/string_dictionary_builder.rs 90.64% <0.00%> (-0.72%) ⬇️
...uet/src/arrow/array_reader/complex_object_array.rs 93.35% <0.00%> (-0.68%) ⬇️
parquet/src/encodings/rle.rs 91.41% <0.00%> (-0.50%) ⬇️
parquet/src/column/writer/mod.rs 92.85% <0.00%> (-0.15%) ⬇️
arrow/src/array/builder/generic_binary_builder.rs 92.52% <0.00%> (-0.14%) ⬇️
parquet/src/encodings/levels.rs 94.23% <0.00%> (-0.03%) ⬇️
... and 45 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@alamb alamb requested review from Dandandan and removed request for jhorstmann August 2, 2022 13:14
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @daniel-martinez-maqueda-sap -- this looks like an improvement to me for sure. I left some suggestions but I think this PR is already better than master and I think it could be merged as is

arrow/Cargo.toml Show resolved Hide resolved
arrow/src/compute/kernels/comparison.rs Outdated Show resolved Hide resolved
@@ -342,7 +377,7 @@ pub fn nlike_utf8_scalar<OffsetSize: OffsetSizeTrait>(
result.append(!left.value(i).ends_with(&right[1..]));
}
} else {
let re_pattern = escape(right).replace('%', ".*").replace('_', ".");
let re_pattern = replace_like_wildcards(right)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how you have refactored this code 👍

arrow/src/compute/kernels/comparison.rs Show resolved Hide resolved
arrow/src/compute/kernels/comparison.rs Outdated Show resolved Hide resolved
arrow/src/compute/kernels/comparison.rs Show resolved Hide resolved
@@ -298,10 +298,15 @@ pub fn like_utf8_scalar<OffsetSize: OffsetSizeTrait>(
Ok(BooleanArray::from(data))
}

fn replace_like_wildcards(text: &str) -> Result<String> {
/// Transforms a like `pattern` to a regex compatible pattern. To achieve that, it does:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb alamb changed the title Fix escaped like wildcards Fix escaped like wildcards in like_utf8 / nlike_utf8 kernels Aug 3, 2022
@daniel-martinez-maqueda-sap daniel-martinez-maqueda-sap requested review from alamb and jhorstmann and removed request for jhorstmann August 3, 2022 13:26
@tustvold tustvold merged commit f78d2e6 into apache:master Aug 3, 2022
@ursabot
Copy link

ursabot commented Aug 3, 2022

Benchmark runs are scheduled for baseline = 22185fd and contender = f78d2e6. f78d2e6 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

MazterQyou pushed a commit to cube-js/arrow-rs that referenced this pull request Nov 30, 2022
…che#2258)

* Fix escaped like wildcards

Added a new function that replaces the like wildcards '%' and '_' for
the regex counterparts before executing them. It also takes into account
that the wildcards can be escaped, in that case, it does remove the
escape characters and leaves the wildcards so that they are matched
against the raw character.

This is implemented iterating over all the characters of the pattern to
figure out when it needs to be transformed or not.

* Rewrite logic with peek after PR feedback

* Simplifly logic

* Add documentation and refactor string creation in tests

* Add small fix and cargo fmt
MazterQyou pushed a commit to cube-js/arrow-rs that referenced this pull request Nov 30, 2022
…che#2258)

* Fix escaped like wildcards

Added a new function that replaces the like wildcards '%' and '_' for
the regex counterparts before executing them. It also takes into account
that the wildcards can be escaped, in that case, it does remove the
escape characters and leaves the wildcards so that they are matched
against the raw character.

This is implemented iterating over all the characters of the pattern to
figure out when it needs to be transformed or not.

* Rewrite logic with peek after PR feedback

* Simplifly logic

* Add documentation and refactor string creation in tests

* Add small fix and cargo fmt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

like / nlike_utf8 do not allow escaping %
6 participants