Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement like/ilike etc for StringViewArray #5931

Merged
merged 5 commits into from
Jun 24, 2024

Conversation

XiangpengHao
Copy link
Contributor

@XiangpengHao XiangpengHao commented Jun 21, 2024

Which issue does this PR close?

Part of apache/datafusion#11024 and #5374

Rationale for this change

Currently, the arrow-string crate does not support ops (like, ilike, etc.) against string view types. This pr implements them in a generic way.

What changes are included in this PR?

(1) Added a StringArrayType (could have a better name, or placed at a better location) trait
https://github.com/apache/arrow-rs/compare/master...XiangpengHao:string-view-like?expand=1#diff-14974bf6baecb88fa03010ef05c5927627a26bdeb74c3fe186ac2f9527bdca45R151

Arrays (currently, StringArray, LargeStringArray, StringViewArray) implement this trait will automatically gain the ability to all kinds of string operations (like, ilike, nilike, contains, starts_with, ends_with)

This also means that we did not try to be smart when handling StringViewArray, i.e., does not leverage the inlined 4 bytes to shortcut some comparison. I plan to leave this as a future work as it can significantly complicate the code. (I also doubt that we can gain any benefit)

(2) Overhauled the related test so that they no longer use the depreciated kernels and I believe now they look easier to reason.

(3) Since we are not smart at handling view types, I didn't make special test cases for StringViewArray, i.e., I did not deliberately include test cases with long/short strings. We rely on the correctness of the StringViewArray iterator, which has already been tested elsewhere.

Are there any user-facing changes?

@XiangpengHao XiangpengHao changed the title Implement like/ilike etc for StringViewArry Implement like/ilike etc for StringViewArray Jun 21, 2024
@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 21, 2024
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not leverage the inlined 4 bytes to shortcut some comparison. I plan to leave this as a future work as it can significantly complicate the code. (I also doubt that we can gain any benefit)

I agree we should only start adding specialized paths if we determine it is worth it. starts_with might be, especially if the parameter is short enough to always be able to use the prefix. We can decide based on benchmarks which are appropriate.

arrow-string/src/like.rs Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao -- this is a very clever PR (adds new features + test coverage while removing net lines of code)

(3) Since we are not smart at handling view types, I didn't make special test cases for StringViewArray, i.e., I did not deliberately include test cases with long/short strings. We rely on the correctness of the StringViewArray iterator, which has already been tested elsewhere.

Even though you are right that the implementation is the same, I still think it is important to add test coverage so that if the implementations change in the future the code will be better

For example I can easily imagine someone wanting to add the 'starts_with' optimization suggested by @wjones127

I think like also has several prefix optimizations (for example, special casing matches like prefix%).

So I think in order to complete this PR we need:

  1. Test coverage for short/long strings
  2. Benchmarks on this branch (I am running these now) to ensure no regressions


/// Returns true if all data within this array is ASCII
pub fn is_ascii(&self) -> bool {
// Alternative (but incorrect): directly check the underlying buffers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

arrow-string/src/like.rs Show resolved Hide resolved
arrow-string/src/like.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Jun 21, 2024

I made a PR with some benchmarks for like here: #5936

What I suggest we do is get this PR in with good test coverage, and ensure no performance regressions, and then file at ticket to continue to improve the performance

@alamb
Copy link
Contributor

alamb commented Jun 21, 2024

Benchmarks show no change:

++ critcmp master string-view-like
group                                              master                                 string-view-like
-----                                              ------                                 ----------------
ilike_utf8 scalar complex                          1.00    313.1±0.89µs        ? ?/sec    1.06    331.9±1.14µs        ? ?/sec
ilike_utf8 scalar contains                         1.02   1503.1±5.61µs        ? ?/sec    1.00   1473.2±4.67µs        ? ?/sec
ilike_utf8 scalar ends with                        1.00    234.0±0.47µs        ? ?/sec    1.13    264.2±0.74µs        ? ?/sec
ilike_utf8 scalar equals                           1.00    248.8±0.64µs        ? ?/sec    1.00    248.3±0.47µs        ? ?/sec
ilike_utf8 scalar starts with                      1.00    238.6±0.42µs        ? ?/sec    1.40    333.5±0.92µs        ? ?/sec
ilike_utf8_scalar_dyn dictionary[10] string[4])    1.00     49.1±0.14µs        ? ?/sec    1.00     49.1±0.08µs        ? ?/sec
like_utf8 scalar complex                           1.00    294.0±1.01µs        ? ?/sec    1.06    311.6±0.89µs        ? ?/sec
like_utf8 scalar contains                          1.02    357.5±1.27µs        ? ?/sec    1.00    350.2±2.50µs        ? ?/sec
like_utf8 scalar ends with                         1.00    209.1±0.35µs        ? ?/sec    1.00    209.3±0.44µs        ? ?/sec
like_utf8 scalar equals                            1.00    218.7±0.26µs        ? ?/sec    1.01    221.1±0.73µs        ? ?/sec
like_utf8 scalar starts with                       1.01    233.0±1.51µs        ? ?/sec    1.00    230.8±0.32µs        ? ?/sec
like_utf8_scalar_dyn dictionary[10] string[4])     1.00     49.0±0.11µs        ? ?/sec    1.00     49.0±0.11µs        ? ?/sec
nilike_utf8 scalar complex                         1.00    312.4±1.07µs        ? ?/sec    1.06    331.2±1.59µs        ? ?/sec
nilike_utf8 scalar contains                        1.02   1501.0±5.94µs        ? ?/sec    1.00   1477.4±8.60µs        ? ?/sec
nilike_utf8 scalar ends with                       1.00    234.2±1.02µs        ? ?/sec    1.13    264.0±0.25µs        ? ?/sec
nilike_utf8 scalar equals                          1.00    248.6±0.69µs        ? ?/sec    1.00    248.3±0.36µs        ? ?/sec
nilike_utf8 scalar starts with                     1.00    239.0±0.48µs        ? ?/sec    1.40    334.4±6.81µs        ? ?/sec
nlike_utf8 scalar complex                          1.00    294.7±1.31µs        ? ?/sec    1.06    312.0±1.26µs        ? ?/sec
nlike_utf8 scalar contains                         1.00    348.2±0.85µs        ? ?/sec    1.03    357.5±0.74µs        ? ?/sec
nlike_utf8 scalar ends with                        1.00    209.3±0.40µs        ? ?/sec    1.00    209.2±0.29µs        ? ?/sec
nlike_utf8 scalar equals                           1.00    218.9±0.67µs        ? ?/sec    1.01    222.0±0.93µs        ? ?/sec
nlike_utf8 scalar starts with                      1.01    232.8±0.39µs        ? ?/sec    1.00    230.8±0.32µs        ? ?/sec

@XiangpengHao
Copy link
Contributor Author

ilike_utf8 scalar starts with 1.00 238.6±0.42µs ? ?/sec 1.40 333.5±0.92µs ? ?/sec
nilike_utf8 scalar starts with 1.00 239.0±0.48µs ? ?/sec 1.40 334.4±6.81µs ? ?/sec

Looks like two starts with get much slower... I'll take closer look 👀

@XiangpengHao
Copy link
Contributor Author

I can't reproduce the results (I get many fluctuations on my machine). Looking at the code changes, there's no reason the new code could be slower. Can you benchmark them again? @alamb

@XiangpengHao
Copy link
Contributor Author

Added more tests to cover longer string cases

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reran the benchmarks and I still see a slow down with ilike/nilike and a scalar 🤔

ilike_utf8 scalar contains                         1.00   1466.0±4.48µs        ? ?/sec    1.00   1469.5±5.21µs        ? ?/sec    1.11   1622.9±5.84µs        ? ?/sec
alamb@aal-dev:~/arrow-rs$ critcmp  master master-2 string-view-like
group                                              master                                 master-2                               string-view-like
-----                                              ------                                 --------                               ----------------
ilike_utf8 scalar complex                          1.02    310.8±1.01µs        ? ?/sec    1.02    311.3±0.93µs        ? ?/sec    1.00    303.9±1.51µs        ? ?/sec
ilike_utf8 scalar contains                         1.00   1466.0±4.48µs        ? ?/sec    1.00   1469.5±5.21µs        ? ?/sec    1.11   1622.9±5.84µs        ? ?/sec
ilike_utf8 scalar ends with                        1.00    246.7±0.52µs        ? ?/sec    1.00    246.7±0.43µs        ? ?/sec    1.00    246.9±0.70µs        ? ?/sec
ilike_utf8 scalar equals                           1.00    196.9±0.29µs        ? ?/sec    1.00    197.2±0.41µs        ? ?/sec    1.26    248.8±1.06µs        ? ?/sec
ilike_utf8 scalar starts with                      1.02    286.9±2.29µs        ? ?/sec    1.00    282.4±0.71µs        ? ?/sec    1.00    282.3±0.45µs        ? ?/sec
ilike_utf8_scalar_dyn dictionary[10] string[4])    1.00     49.0±0.15µs        ? ?/sec    1.00     49.0±0.12µs        ? ?/sec    1.00     49.0±0.14µs        ? ?/sec
like_utf8 scalar complex                           1.03    293.6±1.22µs        ? ?/sec    1.04    295.1±1.53µs        ? ?/sec    1.00    283.9±1.50µs        ? ?/sec
like_utf8 scalar contains                          1.00    347.2±1.41µs        ? ?/sec    1.00    347.9±0.72µs        ? ?/sec    1.01    349.6±0.85µs        ? ?/sec
like_utf8 scalar ends with                         1.00    209.1±0.56µs        ? ?/sec    1.00    209.2±1.13µs        ? ?/sec    1.00    209.1±0.26µs        ? ?/sec
like_utf8 scalar equals                            1.01    220.7±0.95µs        ? ?/sec    1.01    220.7±0.73µs        ? ?/sec    1.00    217.6±0.40µs        ? ?/sec
like_utf8 scalar starts with                       1.00    231.8±1.11µs        ? ?/sec    1.00    231.4±1.47µs        ? ?/sec    1.01    232.8±0.32µs        ? ?/sec
like_utf8_scalar_dyn dictionary[10] string[4])     1.00     49.1±0.15µs        ? ?/sec    1.00     49.0±0.12µs        ? ?/sec    1.00     49.1±0.16µs        ? ?/sec
nilike_utf8 scalar complex                         1.03    313.1±2.25µs        ? ?/sec    1.03    312.3±6.04µs        ? ?/sec    1.00    304.4±2.76µs        ? ?/sec
nilike_utf8 scalar contains                        1.00   1466.8±7.42µs        ? ?/sec    1.00   1470.9±7.23µs        ? ?/sec    1.11   1622.7±7.95µs        ? ?/sec
nilike_utf8 scalar ends with                       1.00    246.8±0.58µs        ? ?/sec    1.00    246.6±0.51µs        ? ?/sec    1.00    246.9±0.53µs        ? ?/sec
nilike_utf8 scalar equals                          1.00    197.0±0.45µs        ? ?/sec    1.00    197.1±0.44µs        ? ?/sec    1.26    248.7±0.65µs        ? ?/sec
nilike_utf8 scalar starts with                     1.01    286.2±0.54µs        ? ?/sec    1.00    282.4±0.95µs        ? ?/sec    1.00    282.2±0.29µs        ? ?/sec
nlike_utf8 scalar complex                          1.04    293.4±1.22µs        ? ?/sec    1.04    293.9±3.01µs        ? ?/sec    1.00    283.2±1.52µs        ? ?/sec
nlike_utf8 scalar contains                         1.00    346.9±0.55µs        ? ?/sec    1.00    347.9±0.76µs        ? ?/sec    1.01    349.7±0.67µs        ? ?/sec
nlike_utf8 scalar ends with                        1.00    209.5±2.23µs        ? ?/sec    1.00    208.9±0.22µs        ? ?/sec    1.00    209.2±0.82µs        ? ?/sec
nlike_utf8 scalar equals                           1.01    220.4±0.80µs        ? ?/sec    1.02    221.3±0.94µs        ? ?/sec    1.00    217.8±0.67µs        ? ?/sec
nlike_utf8 scalar starts with                      1.00    231.5±0.35µs        ? ?/sec    1.00    231.7±4.16µs        ? ?/sec    1.01    233.0±0.42µs        ? ?/sec

array: &GenericStringArray<O>,
negate: bool,
) -> BooleanArray {
pub fn evaluate_array<'i, T>(&self, array: T, negate: bool) -> BooleanArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually I think we should consider making some special variant of this function for StringViews (that can take advantage of the inlined views)

Maybe we could add some method to ArrayAccessor for quickly checking prefix-equality 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work.
I think this pr is to give a reasonable baseline for string view implementation and unblock the string view for DF. And if we care enough later, we do need to copy-paste a lot of the code and make specialized comparsion for StringViewArray.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#5951 tracks potential optimizations

vec!["arrow", "arrow", "arrow", "arrow", "arrow", "arrows", "arrow", "arrow"],
vec![
"arrow",
"arrow_long_string_more than 12 bytes",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@XiangpengHao
Copy link
Contributor Author

I think the regression is related to the inline art discussed (and fixed) in #5922 (comment)

We can wait that pr to merge and see if anything changes.

@XiangpengHao
Copy link
Contributor Author

I've updated the pr to include the changes from #5922.

I benchmarked again and find some performance increase some decrease, but they all within 10%.

@alamb
Copy link
Contributor

alamb commented Jun 24, 2024

I agree -- I reran the benchmarks and see some report got faster / slower

++ critcmp master string-view-like
group                                              master                                 string-view-like
-----                                              ------                                 ----------------
ilike_utf8 scalar complex                          1.03    310.6±1.65µs        ? ?/sec    1.00    300.1±1.33µs        ? ?/sec
ilike_utf8 scalar contains                         1.00  1562.6±22.82µs        ? ?/sec    1.05   1648.0±5.32µs        ? ?/sec
ilike_utf8 scalar ends with                        1.00   238.4±26.45µs        ? ?/sec    1.04    247.3±0.47µs        ? ?/sec
ilike_utf8 scalar equals                           1.08    214.3±0.73µs        ? ?/sec    1.00    197.6±0.34µs        ? ?/sec
ilike_utf8 scalar starts with                      1.02   256.9±26.49µs        ? ?/sec    1.00    252.8±0.65µs        ? ?/sec
ilike_utf8_scalar_dyn dictionary[10] string[4])    1.00     49.1±0.18µs        ? ?/sec    1.00     49.0±0.19µs        ? ?/sec
like_utf8 scalar complex                           1.04    291.6±0.74µs        ? ?/sec    1.00    280.9±0.94µs        ? ?/sec
like_utf8 scalar contains                          1.00    347.0±0.46µs        ? ?/sec    1.00    347.8±1.66µs        ? ?/sec
like_utf8 scalar ends with                         1.00    209.1±0.84µs        ? ?/sec    1.05    218.7±0.92µs        ? ?/sec
like_utf8 scalar equals                            1.00    194.6±0.36µs        ? ?/sec    1.00    195.3±1.36µs        ? ?/sec
like_utf8 scalar starts with                       1.00    232.7±0.32µs        ? ?/sec    1.00    231.9±0.63µs        ? ?/sec
like_utf8_scalar_dyn dictionary[10] string[4])     1.00     49.1±0.32µs        ? ?/sec    1.00     48.9±0.09µs        ? ?/sec
nilike_utf8 scalar complex                         1.04    311.1±1.24µs        ? ?/sec    1.00    299.4±0.96µs        ? ?/sec
nilike_utf8 scalar contains                        1.00   1552.6±5.49µs        ? ?/sec    1.06   1645.1±4.68µs        ? ?/sec
nilike_utf8 scalar ends with                       1.00    235.9±8.61µs        ? ?/sec    1.05    247.1±0.44µs        ? ?/sec
nilike_utf8 scalar equals                          1.09    214.2±0.86µs        ? ?/sec    1.00    197.4±0.40µs        ? ?/sec
nilike_utf8 scalar starts with                     1.00    253.8±6.14µs        ? ?/sec    1.00    252.7±0.33µs        ? ?/sec
nlike_utf8 scalar complex                          1.04    290.7±0.90µs        ? ?/sec    1.00    280.5±0.71µs        ? ?/sec
nlike_utf8 scalar contains                         1.00    346.9±0.52µs        ? ?/sec    1.00    347.4±0.58µs        ? ?/sec
nlike_utf8 scalar ends with                        1.00    209.1±0.40µs        ? ?/sec    1.05    218.5±0.36µs        ? ?/sec
nlike_utf8 scalar equals                           1.00    194.6±0.38µs        ? ?/sec    1.00    194.7±0.42µs        ? ?/sec
nlike_utf8 scalar starts with                      1.00    232.8±1.20µs        ? ?/sec    1.00    231.8±0.33µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Jun 24, 2024

does not leverage the inlined 4 bytes to shortcut some comparison. I plan to leave this as a future work as it can significantly complicate the code. (I also doubt that we can gain any benefit)

I agree we should only start adding specialized paths if we determine it is worth it. starts_with might be, especially if the parameter is short enough to always be able to use the prefix. We can decide based on benchmarks which are appropriate.

I filed #5951 to track potentially improving the performance

@alamb
Copy link
Contributor

alamb commented Jun 24, 2024

Thanks again everyone for all the help

@alamb alamb merged commit 66bada5 into apache:master Jun 24, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants