ARROW-9160: [C++] Implement contains for exact matches #7593

xhochy · 2020-06-30T12:53:46Z

No description provided.

xhochy · 2020-06-30T13:09:35Z

This currently fails for chunked arrays. I thought that they should be handled by the kernel framework automatically but it seems, they aren't.

[----------] 1 test from TestStringKernels/1, where TypeParam = arrow::LargeStringType
[ RUN      ] TestStringKernels/1.ContainsExact
../src/arrow/testing/gtest_util.cc:149: Failure
Failed
Got:
  [
    [
      false,
      null,
      false
    ]
  ]
Expected:
  [
    [
      true,
      null
    ],
    [
      false
    ]
  ]
../src/arrow/testing/gtest_util.cc:149: Failure
Failed
Got:
  [
    [
      false,
      false,
      false,
      null,
      false
    ]
  ]
Expected:
  [
    [
      true,
      false
    ],
    [
      true,
      null,
      false
    ]
  ]
[  FAILED  ] TestStringKernels/1.ContainsExact, where TypeParam = arrow::LargeStringType (2 ms)

@wesm @pitrou Any pointers what I'm missing?

github-actions · 2020-06-30T13:22:18Z

https://issues.apache.org/jira/browse/ARROW-9160

wesm · 2020-06-30T14:58:46Z

@xhochy chunked arrays should be handled automatically by the function executors. I will take a look.

xhochy · 2020-06-30T17:31:21Z

Implemented Knuth-Morris-Pratt, got roughly a 2x speedup in the benchmark.

Before:
ContainsExact   53673203 ns     53323692 ns           13 bytes_per_second=297.061M/s items_per_second=19.6644M/s
After:
ContainsExact   30150889 ns     29982087 ns           23 bytes_per_second=528.329M/s items_per_second=34.9734M/s

wesm · 2020-06-30T21:11:34Z

@xhochy I'm fixing a couple issues with the implementation:

The function executor allocates memory for you unless you explicitly disable it. The idea is that you don't want to do allocations in-kernel unless it is necessary. The preallocation settings affect how execution with chunked arrays work
The slice offset of the Datum* out parameter must be respected (it may be non-zero -- the executor does not preserve the chunk layout of ChunkedArrays and will merge chunks to make larger output arrays -- this can be configured in ExecContext)

cpp/src/arrow/compute/kernels/scalar_string.cc

wesm · 2020-06-30T21:15:51Z

I just opened https://issues.apache.org/jira/browse/ARROW-9285 -- it should be easy to check if a kernel has mistakenly replaced a preallocated data buffer (which may be a slice of a larger output that is being populated)

wesm · 2020-06-30T21:25:07Z

I also propose to rename the function from "contains_exact" to "utf8_contains". ~~I'm pushing that change here shortly~~. Or should we call it "binary_contains" / "string_contains" since it will work with any base binary types? I'll hold off on making any changes until there is consensus about the name. I think it would be useful to put string functions in a "pseudo-namespace" so they appear next to each other when sorted alphabetically.

Ideas:

string_contains
string_lower_utf8
string_lower_ascii
...

@maartenbreddels @pitrou any opinions about this?

xhochy · 2020-07-01T05:11:52Z

I think prefixes make sense. We will have sometime similar kernel names that act quite different depending on the types they work on. I would differentiate in the string/binary case the kernels between the categories:

binary_: works on any binary data
ascii_: works on string data but is limited ASCII-only characters
utf8_: works on string data and respects UTF8-encoding or even makes use of an UTF8-symbol table (i.e. utf8proc).

With that, I would rename the kernel here to binary_contains_exact. The exact suffix is important for me as I intent to implement also ut8_contains_case_insensitve, binary_contains_regex (or should this be string/utf8?), utf8_contains_regex_case_insensitve.

pitrou · 2020-07-01T07:27:15Z

You match a regex, you don't contain it.

xhochy · 2020-07-01T07:28:58Z

You match a regex, you don't contain it.

That is one of the name clashes, we already have a match kernel.

pitrou · 2020-07-01T07:29:36Z

match_regex then? :-)

maartenbreddels · 2020-07-01T07:48:34Z

I like the prefixing by string. I'm a big fan of ordering 'words' in snake or camel casing for good tab completion and alphabetic ordering, so I agree with @wesm 's proposal. As a user, I first think of the type I work with ('string'), next, what I want to do with it (upper/lower casing). Tab-completion/ordering would then reveal 2 variants, very intuitive.
To make this concrete:

string_lower_utf8
string_lower_ascii
binary_contains_exact
string_contains_regex_ascii
string_contains_regex_utf8

(or s/contains/match 😄 )

There might be kernels that work on binary data, but do not work well with utf8, e.g. a binary_slice, which would slice each binary-string on a byte basis. It could cut a multibyte encoded codepoint to result in an invalid utf8 string. I'd say that's acceptable, we could check that depending on the type we get it.

Regarding case insensitive variants, I think we should expose functions to do normalization (eg replace the single codepoint ë by the letter e and the diaeresis combining character(the dots above the ë)), case folding, and removal of combining characters. That allows users to remove e.g. the diaeresis from ë to do pattern matching without diacritics.

maartenbreddels · 2020-07-01T08:08:16Z

While at the difficult topic of naming, is there a conversion (agreed or emerging) for naming the functors/ArrayKernelExec implementations? I see in scalar_string.cc we use Transform, which is probably redundant (as I understand scalar kernels basically map/transform elementwise). But I think it would be useful to include the type name, e.g.

Utf8ToUtf8 (which is now Utf8Transform) (hence Utf8Upper : Utf8ToUtf8<Type, Utf8Upper<Type>>)
Utf8ToBool / Utf8Predicate (which is now StringBoolTransform)
Utf8ToScalar
BinaryToBool

maartenbreddels · 2020-07-01T08:31:16Z

The function executor allocates memory for you unless you explicitly disable it. The idea is that you don't want to do allocations in-kernel unless it is necessary. The preallocation settings affect how execution with chunked arrays work

Could you elaborate? Why is this not a problem with the lower/upper kernels?

pitrou

This is a nice addition. A couple comments.

cpp/src/arrow/compute/kernels/scalar_string.cc

pitrou · 2020-07-01T12:28:12Z

cpp/src/arrow/compute/api_scalar.h

+// String functions
+
+struct ARROW_EXPORT ContainsExactOptions : public FunctionOptions {
+  explicit ContainsExactOptions(std::string pattern = "") : pattern(pattern) {}


I don't think defaulting to the empty string makes sense here.

This is needed to be able to construct an instance of it in Cython. I can remove this and switch the Cython instantiation to a pointer.

pitrou · 2020-07-01T12:34:15Z

python/pyarrow/compute.py

@@ -113,6 +113,24 @@ def func(left, right):
 multiply = _simple_binary_function('multiply')


+def contains_exact(array, pattern):
+    """
+    Check whether a pattern occurs as part of the values of the array.


That's rather vague. Does contains_exact(pa.array([1,2,3]), pa.array([1,2])) return true? I don't think so.

wesm · 2020-07-01T14:25:12Z

Could you elaborate? Why is this not a problem with the lower/upper kernels?

The data preallocation is only for fixed size outputs (eg boolean, integers, floating point, etc).

kiszk · 2020-07-01T17:16:20Z

cpp/src/arrow/compute/kernels/scalar_string.cc

+    }
+  } else {
+    const auto& input = checked_cast<const BaseBinaryScalar&>(*batch[0].scalar());
+    auto result = checked_pointer_cast<BooleanScalar>(MakeNullScalar(out->type()));


We can move this line into then clause before line 327.

wesm · 2020-07-02T14:17:41Z

I think prefixes make sense. We will have sometime similar kernel names that act quite different depending on the types they work on. I would differentiate in the string/binary case the kernels between the categories:

binary_: works on any binary data

ascii_: works on string data but is limited ASCII-only characters

utf8_: works on string data and respects UTF8-encoding or even makes use of an UTF8-symbol table (i.e. utf8proc).

With that, I would rename the kernel here to binary_contains_exact. The exact suffix is important for me as I intent to implement also ut8_contains_case_insensitve, binary_contains_regex (or should this be string/utf8?), utf8_contains_regex_case_insensitve.

This seems reasonable to me

pitrou · 2020-07-02T14:24:29Z

"binary_contains_exact" still works for utf8 data, though.

wesm · 2020-07-02T14:33:39Z

OK, do you have an idea for a prefix for a function that works either on BinaryArray or StringArray? This could also be handled with aliases so we could have utf8_contains_exact as an alias for the same function

maartenbreddels · 2020-07-02T14:42:12Z

String is a subclass of binary right? So I expect binary_contains_exact to work on strings right?

pitrou · 2020-07-02T14:52:16Z

Right, I was just mentioned that "binary" isn't exclusive of "utf8" in this particular instance.
However, I'm not sure why we need prefixes. "contains" seems sufficient to me :-)

wesm · 2020-07-02T14:55:42Z

Other types (e.g. list) will need to have a "contains" operation, so I think it's more clear to have binary_contains and list_contains than simply contains.

xhochy · 2020-07-03T06:37:26Z

Adressed all comments and CI is happy, too.

wesm

+1. I'll fix the benchmark function rename and then merge this

wesm · 2020-07-03T12:17:15Z

cpp/src/arrow/compute/kernels/scalar_string_benchmark.cc

@@ -56,6 +57,11 @@ static void AsciiUpper(benchmark::State& state) {
  UnaryStringBenchmark(state, "ascii_upper");
 }

+static void BinaryContainsExact(benchmark::State& state) {
+  BinaryContainsExactOptions options("abac");
+  UnaryStringBenchmark(state, "contains_exact", &options);


Looks like one rename missed here

Follow function rename in benchmarks

Closes apache#7593 from xhochy/ARROW-9160 Lead-authored-by: Uwe L. Korn <uwe.korn@quantco.com> Co-authored-by: Wes McKinney <wesm@apache.org> Signed-off-by: Wes McKinney <wesm@apache.org>

xhochy force-pushed the ARROW-9160 branch from 0ad92d9 to fc1b5cf Compare June 30, 2020 17:32

xhochy marked this pull request as ready for review June 30, 2020 17:33

xhochy mentioned this pull request Jun 30, 2020

🚀 String Super-Issue xhochy/fletcher#121

Closed

wesm reviewed Jun 30, 2020

View reviewed changes

cpp/src/arrow/compute/kernels/scalar_string.cc Show resolved Hide resolved

pitrou reviewed Jul 1, 2020

View reviewed changes

kiszk reviewed Jul 1, 2020

View reviewed changes

wesm approved these changes Jul 3, 2020

View reviewed changes

ARROW-9160: [C++] Implement contains for exact matches

4cf1520

xhochy and others added 6 commits July 3, 2020 07:18

Fix offsetted arrays

9dfb218

Add benchmark, implement KMP

1e31947

Expose kernel in Python

2a88379

Do not allocate memory in kernel. Respect offset of output array

2fa458b

Improve docstring, rename to binary_contains_exact

6f80852

Make result_value more local in scope

34d77a9

wesm force-pushed the ARROW-9160 branch from 45edb07 to eb787c7 Compare July 3, 2020 12:22

Fix cython compilation

4e587ad

Follow function rename in benchmarks

wesm force-pushed the ARROW-9160 branch from eb787c7 to 4e587ad Compare July 3, 2020 12:23

wesm closed this in d42c23b Jul 3, 2020

wesm deleted the ARROW-9160 branch July 3, 2020 12:25

xhochy mentioned this pull request Jul 8, 2020

Use Knuth-Morris-Pratt algorithm for pattern matching functions xhochy/fletcher#142

Closed

jorisvandenbossche mentioned this pull request Jul 14, 2020

ARROW-9390: [C++][Doc] Review compute function names #7755

Closed

maartenbreddels mentioned this pull request Sep 25, 2020

ARROW-9991: [C++] Split kernels for strings/binary #8271

Closed

asfimport mentioned this pull request Jun 3, 2021

[C++] Implement string/binary contains for exact matches #25266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9160: [C++] Implement contains for exact matches #7593

ARROW-9160: [C++] Implement contains for exact matches #7593

xhochy commented Jun 30, 2020

xhochy commented Jun 30, 2020 •

edited

Loading

github-actions bot commented Jun 30, 2020

wesm commented Jun 30, 2020 •

edited

Loading

xhochy commented Jun 30, 2020

wesm commented Jun 30, 2020

wesm commented Jun 30, 2020

wesm commented Jun 30, 2020 •

edited

Loading

xhochy commented Jul 1, 2020

pitrou commented Jul 1, 2020

xhochy commented Jul 1, 2020

pitrou commented Jul 1, 2020

maartenbreddels commented Jul 1, 2020 •

edited

Loading

maartenbreddels commented Jul 1, 2020

maartenbreddels commented Jul 1, 2020

pitrou left a comment

pitrou Jul 1, 2020

xhochy Jul 2, 2020

pitrou Jul 1, 2020

wesm commented Jul 1, 2020

kiszk Jul 1, 2020

wesm commented Jul 2, 2020

pitrou commented Jul 2, 2020

wesm commented Jul 2, 2020 •

edited

Loading

maartenbreddels commented Jul 2, 2020

pitrou commented Jul 2, 2020

wesm commented Jul 2, 2020

xhochy commented Jul 3, 2020

wesm left a comment

wesm Jul 3, 2020

ARROW-9160: [C++] Implement contains for exact matches #7593

ARROW-9160: [C++] Implement contains for exact matches #7593

Conversation

xhochy commented Jun 30, 2020

xhochy commented Jun 30, 2020 • edited Loading

github-actions bot commented Jun 30, 2020

wesm commented Jun 30, 2020 • edited Loading

xhochy commented Jun 30, 2020

wesm commented Jun 30, 2020

wesm commented Jun 30, 2020

wesm commented Jun 30, 2020 • edited Loading

xhochy commented Jul 1, 2020

pitrou commented Jul 1, 2020

xhochy commented Jul 1, 2020

pitrou commented Jul 1, 2020

maartenbreddels commented Jul 1, 2020 • edited Loading

maartenbreddels commented Jul 1, 2020

maartenbreddels commented Jul 1, 2020

pitrou left a comment

Choose a reason for hiding this comment

pitrou Jul 1, 2020

Choose a reason for hiding this comment

xhochy Jul 2, 2020

Choose a reason for hiding this comment

pitrou Jul 1, 2020

Choose a reason for hiding this comment

wesm commented Jul 1, 2020

kiszk Jul 1, 2020

Choose a reason for hiding this comment

wesm commented Jul 2, 2020

pitrou commented Jul 2, 2020

wesm commented Jul 2, 2020 • edited Loading

maartenbreddels commented Jul 2, 2020

pitrou commented Jul 2, 2020

wesm commented Jul 2, 2020

xhochy commented Jul 3, 2020

wesm left a comment

Choose a reason for hiding this comment

wesm Jul 3, 2020

Choose a reason for hiding this comment

xhochy commented Jun 30, 2020 •

edited

Loading

wesm commented Jun 30, 2020 •

edited

Loading

wesm commented Jun 30, 2020 •

edited

Loading

maartenbreddels commented Jul 1, 2020 •

edited

Loading

wesm commented Jul 2, 2020 •

edited

Loading