Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9160: [C++] Implement contains for exact matches #7593

Closed
wants to merge 8 commits into from

Conversation

xhochy
Copy link
Member

@xhochy xhochy commented Jun 30, 2020

No description provided.

@xhochy
Copy link
Member Author

xhochy commented Jun 30, 2020

This currently fails for chunked arrays. I thought that they should be handled by the kernel framework automatically but it seems, they aren't.

[----------] 1 test from TestStringKernels/1, where TypeParam = arrow::LargeStringType
[ RUN      ] TestStringKernels/1.ContainsExact
../src/arrow/testing/gtest_util.cc:149: Failure
Failed
Got:
  [
    [
      false,
      null,
      false
    ]
  ]
Expected:
  [
    [
      true,
      null
    ],
    [
      false
    ]
  ]
../src/arrow/testing/gtest_util.cc:149: Failure
Failed
Got:
  [
    [
      false,
      false,
      false,
      null,
      false
    ]
  ]
Expected:
  [
    [
      true,
      false
    ],
    [
      true,
      null,
      false
    ]
  ]
[  FAILED  ] TestStringKernels/1.ContainsExact, where TypeParam = arrow::LargeStringType (2 ms)

@wesm @pitrou Any pointers what I'm missing?

@github-actions
Copy link

@wesm
Copy link
Member

wesm commented Jun 30, 2020

@xhochy chunked arrays should be handled automatically by the function executors. I will take a look.

@xhochy
Copy link
Member Author

xhochy commented Jun 30, 2020

Implemented Knuth-Morris-Pratt, got roughly a 2x speedup in the benchmark.

Before:
ContainsExact   53673203 ns     53323692 ns           13 bytes_per_second=297.061M/s items_per_second=19.6644M/s
After:
ContainsExact   30150889 ns     29982087 ns           23 bytes_per_second=528.329M/s items_per_second=34.9734M/s

@xhochy xhochy marked this pull request as ready for review June 30, 2020 17:33
@wesm
Copy link
Member

wesm commented Jun 30, 2020

@xhochy I'm fixing a couple issues with the implementation:

  • The function executor allocates memory for you unless you explicitly disable it. The idea is that you don't want to do allocations in-kernel unless it is necessary. The preallocation settings affect how execution with chunked arrays work
  • The slice offset of the Datum* out parameter must be respected (it may be non-zero -- the executor does not preserve the chunk layout of ChunkedArrays and will merge chunks to make larger output arrays -- this can be configured in ExecContext)

@wesm
Copy link
Member

wesm commented Jun 30, 2020

I just opened https://issues.apache.org/jira/browse/ARROW-9285 -- it should be easy to check if a kernel has mistakenly replaced a preallocated data buffer (which may be a slice of a larger output that is being populated)

@wesm
Copy link
Member

wesm commented Jun 30, 2020

I also propose to rename the function from "contains_exact" to "utf8_contains". I'm pushing that change here shortly. Or should we call it "binary_contains" / "string_contains" since it will work with any base binary types? I'll hold off on making any changes until there is consensus about the name. I think it would be useful to put string functions in a "pseudo-namespace" so they appear next to each other when sorted alphabetically.

Ideas:

  • string_contains
  • string_lower_utf8
  • string_lower_ascii
  • ...

@maartenbreddels @pitrou any opinions about this?

@xhochy
Copy link
Member Author

xhochy commented Jul 1, 2020

I think prefixes make sense. We will have sometime similar kernel names that act quite different depending on the types they work on. I would differentiate in the string/binary case the kernels between the categories:

  • binary_: works on any binary data
  • ascii_: works on string data but is limited ASCII-only characters
  • utf8_: works on string data and respects UTF8-encoding or even makes use of an UTF8-symbol table (i.e. utf8proc).

With that, I would rename the kernel here to binary_contains_exact. The exact suffix is important for me as I intent to implement also ut8_contains_case_insensitve, binary_contains_regex (or should this be string/utf8?), utf8_contains_regex_case_insensitve.

@pitrou
Copy link
Member

pitrou commented Jul 1, 2020

You match a regex, you don't contain it.

@xhochy
Copy link
Member Author

xhochy commented Jul 1, 2020

You match a regex, you don't contain it.

That is one of the name clashes, we already have a match kernel.

@pitrou
Copy link
Member

pitrou commented Jul 1, 2020

match_regex then? :-)

@maartenbreddels
Copy link
Contributor

maartenbreddels commented Jul 1, 2020

I like the prefixing by string. I'm a big fan of ordering 'words' in snake or camel casing for good tab completion and alphabetic ordering, so I agree with @wesm 's proposal. As a user, I first think of the type I work with ('string'), next, what I want to do with it (upper/lower casing). Tab-completion/ordering would then reveal 2 variants, very intuitive.
To make this concrete:

  • string_lower_utf8
  • string_lower_ascii
  • binary_contains_exact
  • string_contains_regex_ascii
  • string_contains_regex_utf8

(or s/contains/match 😄 )

There might be kernels that work on binary data, but do not work well with utf8, e.g. a binary_slice, which would slice each binary-string on a byte basis. It could cut a multibyte encoded codepoint to result in an invalid utf8 string. I'd say that's acceptable, we could check that depending on the type we get it.

Regarding case insensitive variants, I think we should expose functions to do normalization (eg replace the single codepoint ë by the letter e and the diaeresis combining character(the dots above the ë)), case folding, and removal of combining characters. That allows users to remove e.g. the diaeresis from ë to do pattern matching without diacritics.

@maartenbreddels
Copy link
Contributor

While at the difficult topic of naming, is there a conversion (agreed or emerging) for naming the functors/ArrayKernelExec implementations? I see in scalar_string.cc we use Transform, which is probably redundant (as I understand scalar kernels basically map/transform elementwise). But I think it would be useful to include the type name, e.g.

  • Utf8ToUtf8 (which is now Utf8Transform) (hence Utf8Upper : Utf8ToUtf8<Type, Utf8Upper<Type>>)
  • Utf8ToBool / Utf8Predicate (which is now StringBoolTransform)
  • Utf8ToScalar
  • BinaryToBool

@maartenbreddels
Copy link
Contributor

  • The function executor allocates memory for you unless you explicitly disable it. The idea is that you don't want to do allocations in-kernel unless it is necessary. The preallocation settings affect how execution with chunked arrays work

Could you elaborate? Why is this not a problem with the lower/upper kernels?

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice addition. A couple comments.

// String functions

struct ARROW_EXPORT ContainsExactOptions : public FunctionOptions {
explicit ContainsExactOptions(std::string pattern = "") : pattern(pattern) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think defaulting to the empty string makes sense here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to be able to construct an instance of it in Cython. I can remove this and switch the Cython instantiation to a pointer.

@@ -113,6 +113,24 @@ def func(left, right):
multiply = _simple_binary_function('multiply')


def contains_exact(array, pattern):
"""
Check whether a pattern occurs as part of the values of the array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's rather vague. Does contains_exact(pa.array([1,2,3]), pa.array([1,2])) return true? I don't think so.

@wesm
Copy link
Member

wesm commented Jul 1, 2020

Could you elaborate? Why is this not a problem with the lower/upper kernels?

The data preallocation is only for fixed size outputs (eg boolean, integers, floating point, etc).

}
} else {
const auto& input = checked_cast<const BaseBinaryScalar&>(*batch[0].scalar());
auto result = checked_pointer_cast<BooleanScalar>(MakeNullScalar(out->type()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this line into then clause before line 327.

@wesm
Copy link
Member

wesm commented Jul 2, 2020

I think prefixes make sense. We will have sometime similar kernel names that act quite different depending on the types they work on. I would differentiate in the string/binary case the kernels between the categories:

  • binary_: works on any binary data
  • ascii_: works on string data but is limited ASCII-only characters
  • utf8_: works on string data and respects UTF8-encoding or even makes use of an UTF8-symbol table (i.e. utf8proc).

With that, I would rename the kernel here to binary_contains_exact. The exact suffix is important for me as I intent to implement also ut8_contains_case_insensitve, binary_contains_regex (or should this be string/utf8?), utf8_contains_regex_case_insensitve.

This seems reasonable to me

@pitrou
Copy link
Member

pitrou commented Jul 2, 2020

"binary_contains_exact" still works for utf8 data, though.

@wesm
Copy link
Member

wesm commented Jul 2, 2020

OK, do you have an idea for a prefix for a function that works either on BinaryArray or StringArray? This could also be handled with aliases so we could have utf8_contains_exact as an alias for the same function

@maartenbreddels
Copy link
Contributor

String is a subclass of binary right? So I expect binary_contains_exact to work on strings right?

@pitrou
Copy link
Member

pitrou commented Jul 2, 2020

Right, I was just mentioned that "binary" isn't exclusive of "utf8" in this particular instance.
However, I'm not sure why we need prefixes. "contains" seems sufficient to me :-)

@wesm
Copy link
Member

wesm commented Jul 2, 2020

Other types (e.g. list) will need to have a "contains" operation, so I think it's more clear to have binary_contains and list_contains than simply contains.

@xhochy
Copy link
Member Author

xhochy commented Jul 3, 2020

Adressed all comments and CI is happy, too.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I'll fix the benchmark function rename and then merge this

@@ -56,6 +57,11 @@ static void AsciiUpper(benchmark::State& state) {
UnaryStringBenchmark(state, "ascii_upper");
}

static void BinaryContainsExact(benchmark::State& state) {
BinaryContainsExactOptions options("abac");
UnaryStringBenchmark(state, "contains_exact", &options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like one rename missed here

Follow function rename in benchmarks
@wesm wesm closed this in d42c23b Jul 3, 2020
@wesm wesm deleted the ARROW-9160 branch July 3, 2020 12:25
sgnkc pushed a commit to sgnkc/arrow that referenced this pull request Jul 6, 2020
Closes apache#7593 from xhochy/ARROW-9160

Lead-authored-by: Uwe L. Korn <uwe.korn@quantco.com>
Co-authored-by: Wes McKinney <wesm@apache.org>
Signed-off-by: Wes McKinney <wesm@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants