-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9160: [C++] Implement contains for exact matches #7593
Conversation
This currently fails for chunked arrays. I thought that they should be handled by the kernel framework automatically but it seems, they aren't.
|
@xhochy chunked arrays should be handled automatically by the function executors. I will take a look. |
Implemented Knuth-Morris-Pratt, got roughly a 2x speedup in the benchmark.
|
@xhochy I'm fixing a couple issues with the implementation:
|
I just opened https://issues.apache.org/jira/browse/ARROW-9285 -- it should be easy to check if a kernel has mistakenly replaced a preallocated data buffer (which may be a slice of a larger output that is being populated) |
I also propose to rename the function from "contains_exact" to "utf8_contains". Ideas:
@maartenbreddels @pitrou any opinions about this? |
I think prefixes make sense. We will have sometime similar kernel names that act quite different depending on the types they work on. I would differentiate in the string/binary case the kernels between the categories:
With that, I would rename the kernel here to |
You match a regex, you don't contain it. |
That is one of the name clashes, we already have a match kernel. |
|
I like the prefixing by
(or s/contains/match 😄 ) There might be kernels that work on binary data, but do not work well with utf8, e.g. a Regarding case insensitive variants, I think we should expose functions to do normalization (eg replace the single codepoint ë by the letter e and the diaeresis combining character(the dots above the ë)), case folding, and removal of combining characters. That allows users to remove e.g. the diaeresis from ë to do pattern matching without diacritics. |
While at the difficult topic of naming, is there a conversion (agreed or emerging) for naming the functors/ArrayKernelExec implementations? I see in
|
Could you elaborate? Why is this not a problem with the lower/upper kernels? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice addition. A couple comments.
cpp/src/arrow/compute/api_scalar.h
Outdated
// String functions | ||
|
||
struct ARROW_EXPORT ContainsExactOptions : public FunctionOptions { | ||
explicit ContainsExactOptions(std::string pattern = "") : pattern(pattern) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think defaulting to the empty string makes sense here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed to be able to construct an instance of it in Cython. I can remove this and switch the Cython instantiation to a pointer.
python/pyarrow/compute.py
Outdated
@@ -113,6 +113,24 @@ def func(left, right): | |||
multiply = _simple_binary_function('multiply') | |||
|
|||
|
|||
def contains_exact(array, pattern): | |||
""" | |||
Check whether a pattern occurs as part of the values of the array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's rather vague. Does contains_exact(pa.array([1,2,3]), pa.array([1,2]))
return true? I don't think so.
The data preallocation is only for fixed size outputs (eg boolean, integers, floating point, etc). |
} | ||
} else { | ||
const auto& input = checked_cast<const BaseBinaryScalar&>(*batch[0].scalar()); | ||
auto result = checked_pointer_cast<BooleanScalar>(MakeNullScalar(out->type())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move this line into then clause before line 327.
This seems reasonable to me |
"binary_contains_exact" still works for utf8 data, though. |
OK, do you have an idea for a prefix for a function that works either on BinaryArray or StringArray? This could also be handled with aliases so we could have |
String is a subclass of binary right? So I expect binary_contains_exact to work on strings right? |
Right, I was just mentioned that "binary" isn't exclusive of "utf8" in this particular instance. |
Other types (e.g. list) will need to have a "contains" operation, so I think it's more clear to have |
Adressed all comments and CI is happy, too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I'll fix the benchmark function rename and then merge this
@@ -56,6 +57,11 @@ static void AsciiUpper(benchmark::State& state) { | |||
UnaryStringBenchmark(state, "ascii_upper"); | |||
} | |||
|
|||
static void BinaryContainsExact(benchmark::State& state) { | |||
BinaryContainsExactOptions options("abac"); | |||
UnaryStringBenchmark(state, "contains_exact", &options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like one rename missed here
Follow function rename in benchmarks
Closes apache#7593 from xhochy/ARROW-9160 Lead-authored-by: Uwe L. Korn <uwe.korn@quantco.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org>
No description provided.