Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13879: [C++] Mixed support for binary types in regex functions #11233

Conversation

edponce
Copy link
Contributor

@edponce edponce commented Sep 25, 2021

This PR extends variable-width binary types support for string functions:

  • find_substring[_regex]
  • count_substring[_regex]
  • match_substring[_regex]
  • split_pattern[_regex]
  • replace_substring[_regex]
  • match_like
  • starts/ends_with
  • extract_regex

Also, updates several scalar string kernel/function registrations.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@edponce
Copy link
Contributor Author

edponce commented Sep 27, 2021

@lidavidm Since you are working on ARROW-13878 which is related, please review this PR. This PR only adds variable-width binary support.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I don't think we should add binary support to utf8_ or ascii_ functions, as they make assumptions about input encoding that don't hold for binary values. (The user can perform a safe or unsafe cast if needed.) Also, we need to make sure we actually test with non-UTF8 values and we should make sure RE2 handles this properly.

cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/codegen_internal.h Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string_test.cc Outdated Show resolved Hide resolved
@edponce edponce force-pushed the ARROW-13879-Mixed-support-for-binary-types-in-regex- branch from 2c38f33 to c91af07 Compare October 2, 2021 01:09
@edponce edponce marked this pull request as draft October 2, 2021 07:05
@edponce edponce force-pushed the ARROW-13879-Mixed-support-for-binary-types-in-regex- branch 2 times, most recently from d0e414e to 050182e Compare October 13, 2021 19:15
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/testing/gtest_util.h Outdated Show resolved Hide resolved
cpp/src/arrow/testing/gtest_util.h Outdated Show resolved Hide resolved
@edponce
Copy link
Contributor Author

edponce commented Oct 14, 2021

@pitrou If you get a chance, please take a look at this PR. It is a draft mainly because it is missing additional tests, pending review fixes, and updates to comments/docs.

@edponce edponce marked this pull request as ready for review October 15, 2021 03:08
@edponce edponce force-pushed the ARROW-13879-Mixed-support-for-binary-types-in-regex- branch from 879fde1 to dc257e5 Compare October 15, 2021 03:14
Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, overall this looks good, I left a few small comments.

cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string_test.cc Outdated Show resolved Hide resolved
@edponce edponce force-pushed the ARROW-13879-Mixed-support-for-binary-types-in-regex- branch from dc257e5 to 8e26f2e Compare October 18, 2021 15:35
@edponce edponce requested a review from lidavidm October 18, 2021 16:10
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some pretty minor nits but this looks good to me.

cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string.cc Outdated Show resolved Hide resolved
cpp/src/arrow/compute/kernels/scalar_string_test.cc Outdated Show resolved Hide resolved
@westonpace
Copy link
Member

@github-actions autotune

@westonpace
Copy link
Member

It appears my suggestions introduced some lines too long. Can you run clang-format? I tried autotune but it didn't work for some reason.

@edponce edponce force-pushed the ARROW-13879-Mixed-support-for-binary-types-in-regex- branch from 35640e6 to 2ffacd6 Compare October 22, 2021 18:19
@lidavidm lidavidm closed this in be665ef Oct 25, 2021
@ElenaHenderson
Copy link
Contributor

Benchmark runs are scheduled for baseline = e7158c6 and contender = be665ef. be665ef is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Scheduled] ec2-t3-xlarge-us-east-2
[Scheduled] ursa-i9-9960x
[Scheduled] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Nov 3, 2021
This PR extends variable-width binary types support for string functions:
* find_substring[_regex]
* count_substring[_regex]
* match_substring[_regex]
* split_pattern[_regex]
* replace_substring[_regex]
* match_like
* starts/ends_with
* extract_regex

Also, updates several scalar string kernel/function registrations.

Closes apache#11233 from edponce/ARROW-13879-Mixed-support-for-binary-types-in-regex-

Authored-by: Eduardo Ponce <edponce00@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants