Data analysts should be able to use `Text.contains` to check for substring using various matcher techniques. #3285

radeusgd · 2022-02-17T17:37:06Z

Pull Request Description

Important Notes

Checklist

Please include the following checklist in your PR:

The documentation has been updated if necessary.
All code conforms to the Scala, Java, and Rust style guides.
All code has been tested:
- Unit tests have been written where possible.
- If GUI codebase was changed: Enso GUI was tested when built using BOTH ./run dist and ./run watch.

distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Extensions.enso

radeusgd · 2022-02-18T21:01:10Z

test/Tests/src/Data/Text_Spec.enso

+            ## TODO what do we do with that?? Since the standard decomposition
+               splits 'ś' into 's+{accent}', 'ś'.contains 's', but I don't think
+               this is the expected behaviour...
+            "Cześć".contains 's' . should_be_true
+            'Czes\u{301}c\u{301}'.contains 's' . should_be_true


I've found a non-trivial and quite problematic edge case: since we perform the normalization, the accented letter is represented by unaccented letter + accent, thus if we just do: normalize and then do naive Java contains, it finds the s (which is just a part of the representation of this grapheme).
I don't think this is something we want, because logically s is not contained in ś (although visually in a way, it is).

I've added an analogous test and there it works as expected (i.e. ś does not contain s in regex matching mode). That's an argument for making this work correctly in exact matching mode - we want to be consistent.
Also that's reassuring, because if Regex didn't support this properly and we still wanted that property - it could have been very hard to 'fix' the Regex implementation.

EDIT:
I was wrong. Regex does work in the direction "ś" . contains 's\u{301}' and also correctly handles 's\u{301}' . contains 'ś'. But it actually does return True for 's\u{301}' . contains 's' - contrary to what we'd expect. At this point I'm not sure what to do...

One solution to that could be to just write the contains serach manually, using the BreakIterator, ensuring that it looks at whole grapheme clusters correctly. This is likely going to be slightly slower than what we have now (ICU Normalizer2 preprocess step + Java contains), but may be the only way to go to retain correctness.

...

After a long and deep dive into ICU4J API, I've found StringSearch which should do what we need in an efficient manner. Will try it out.

Interestingly it allows setting locale - I have no idea how changing locale can influence the search in general - but noting this as it may be something we may want to explore (although I'd set up a separate chore task for it instead of digging into it right now - but that's up to discussion).

radeusgd · 2022-02-18T22:39:41Z

Comparison of the Normalizer+Java.contains vs StringSearch implementations (full results sheet):

Test	Java Contains	ICU StringSearch
Text.contains exact	25.3	107.8
Text.contains case-insensitive	64	143.1
Text.contains exact regex	188.7	184.6
Text.contains case-insensitive regex	344.8	343.8
Text.contains const-width regex	419.4	497.2
Text.contains wildcard regex	119.4	114.4

Only first two rows are relevant - Regex implementation did not change, so any differences there are only due to measurement uncertainty.

We can see that unfortunately StringSearch is 2-4x slower. I don't think we can get a better solution which will handle the edge cases correctly though - unlikely that we can get something both correct and at the same faster than the ICU implementation. Also a significant part of this cost is likely due to the additional logic needed to correctly handle the edge cases - which is just unavoidable if we want this (simply - more complex) behaviour.

radeusgd · 2022-02-18T23:26:51Z

test/Tests/src/Data/Text_Spec.enso

+            "Straße" . contains "ss" . should_be_false
+            "Strasse" . contains "ß" . should_be_false
+            "Straße" . contains "ss" (Text_Matcher Case_Insensitive.new) . should_be_true
+            "Strasse" . contains "ß" (Text_Matcher Case_Insensitive.new) . should_be_true


Documenting this slightly peculiar case - due to how we candle case insensitive operations (tolower+toupper), and given the fact that the uppercase variant of ß is SS, ß and ss get collated in case insensitive mode.

More generally (also shown in tests here, just different place), currently in Enso: "ß".equals_ignore_case "ss" == false.

Not sure if this is good or bad:

It seems bad, because the difference that got collated is not exactly a case difference.

OTOH, it seems natural that these two symbols mean the same thing so under a less strict equality they may be equated.

However, in Java "ß".equalsIgnoreCase("ss") == false.

Moreover, it's really a different kind of difference - scharfes S is more like a ligature, i.e. in the similar spirit maybe æ should also get collated with ae etc.

So I'd lean more into the direction of trying to get rid of this collation - but I'm not exactly sure how to do this efficiently - the ICU normalizer we use for equals_ignore_case supports case folding, but does not accept a locale. Seems the only way to handle cases with locale is through the to_lower_case and to_upper_case methods. Interestingly, how does Java get away with this? Because their equalsIgnoreCase processes character-by-character (not even by grapheme clsuters!) and since the proper upper-case of ß is SS which takes two characters, the Character.toUpperCase simply ignores it and returns back ß (because it is incapable of returning two characters). So Java gets this right, because it is handling characters on a more lower lever which is too limited to encounter this issue.

Quick solutions that come to mind:

Use the ICU's case folding that is not locale aware, possibly adding an if for the Turkish locale which seems to be toggleable in ICU (maybe that's the only difference between all locales so we don't need others?).

Use BreakIterator and implement this manually.

(2) is likely going to be slower, so probably don't want this (although may need a benchmark to be sure). (1) could be incorrect which would be bad, unless really the only Locale having different case handling is Turkish - that's possible - but we'd need to research that - possibly ask some linguist.

I think it may make most sense to create a separate task to explore this, especially check if (1) is viable as it would be our best shot. For now I'd just live with this collation - but open to discussion if this should be resolved before merging this.

Turns out Swift's caseInsensitiveCompare also compares scharfes s as equal to ss, so I guess we can keep the current behaviour for now.

Would probably need someone knowing German linguistics very well to understand if collating these two in case-insensitive mode makes sense or not.

radeusgd · 2022-02-18T23:42:20Z

test/Tests/src/Data/Text_Spec.enso

@@ -256,6 +256,7 @@ spec =
            "Cześć" . contains 's\u{301}' Regex_Matcher.new . should_be_true
            'Czes\u{301}c\u{301}' . contains 's\u{301}' Regex_Matcher.new . should_be_true
            'Czes\u{301}c\u{301}' . contains 'ś' Regex_Matcher.new . should_be_true
+            'Czes\u{301}c\u{301}' . contains 's' Regex_Matcher.new . should_be_false


Unfortunately, this test fails...

So Regex only works well with Unicode normalization to some extent - it does correctly find ś in s\u{301} and vice-versa. It does correctly not find s in ś. But it incorrectly (according to what I'd expect) finds s in s\u{301}. This is quite inconsistent. Maybe it actually should be reported as a bug in the Regex implementation - we already got one bug accepted there, so maybe we could get there with this one too - not exactly sure if this will be considered a bug, but the behaviour is not consistent - I don't think the results should depend on whether the string is normalized or not.

Not sure if simple workarounds exist for this - we could normalize the text before passing it to the engine but normalization does split ś into s\u{301} (IIRC), so that would make it even worse (but at least consistent, irrelevant of if the input was normalized).

Turns out Swift has a similar problem - actually it handles these cases even worse than our Regex

Seems to be a widely-known issue with Regex implementations: https://www.regular-expressions.info/unicode.html with no known implementations which do better in this case.

Will document this nuance in contains docstring and add unit tests showing it so that we are aware of it, but I expect we can't do much more than that.

distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Extensions.enso

test/Tests/src/Data/Text_Spec.enso

radeusgd force-pushed the wip/radeusgd/text-contains-181265795 branch from 8c076c0 to 03e6e12 Compare February 18, 2022 10:27

jdunkerley reviewed Feb 18, 2022

View reviewed changes

distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Extensions.enso Show resolved Hide resolved

radeusgd marked this pull request as ready for review February 18, 2022 12:02

radeusgd requested a review from 4e6 as a code owner February 18, 2022 12:02

jdunkerley approved these changes Feb 18, 2022

View reviewed changes

radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Feb 18, 2022

radeusgd self-assigned this Feb 18, 2022

radeusgd removed the CI: Ready to merge This PR is eligible for automatic merge label Feb 18, 2022

radeusgd requested a review from jdunkerley February 18, 2022 20:52

radeusgd commented Feb 18, 2022

View reviewed changes

radeusgd force-pushed the wip/radeusgd/text-contains-181265795 branch from e7dfca0 to 3dfd999 Compare February 18, 2022 23:12

radeusgd commented Feb 18, 2022

View reviewed changes

radeusgd force-pushed the wip/radeusgd/text-contains-181265795 branch 2 times, most recently from a5ca22a to fcdee07 Compare February 21, 2022 13:15

radeusgd added 14 commits February 22, 2022 14:27

Add matching mode definitions

198a931

Add stub for new method API and an initial test suite

1c7f97f

Fix tests, implement exact matching

76dafc2

Implement Regex matching

68bc056

changelog

0cf339f

Add benchmarks

095f0e0

Wokraround for case insensitive regex locale support

e96485c

minor tweaks

0f645f2

Unify Case_Insensitive

247e291

Update edge cases

fd51782

Fix other affected places

712fd96

minor style change

6549f70

Add a problematic test

2c92fbc

Add a regex test for a similar situation

4957945

radeusgd added 5 commits February 22, 2022 14:27

Migrate to StringSearch:wq

94f525b

Add test cases for scharfes S edge case

4fbe408

Add problematic Regex Unicode normalization test

380802a

Document the regex accents peculiarity

d1ca54d

Do not apply the normalization in ASCII only mode

264fe31

radeusgd force-pushed the wip/radeusgd/text-contains-181265795 branch from fcdee07 to 264fe31 Compare February 22, 2022 13:28

jdunkerley reviewed Feb 22, 2022

View reviewed changes

distribution/lib/Standard/Base/0.0.0-dev/src/Data/Text/Extensions.enso Outdated Show resolved Hide resolved

jdunkerley reviewed Feb 22, 2022

View reviewed changes

test/Tests/src/Data/Text_Spec.enso Outdated Show resolved Hide resolved

cr

0d5b286

jdunkerley approved these changes Feb 22, 2022

View reviewed changes

radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Feb 22, 2022

mergify bot merged commit ae9d515 into develop Feb 22, 2022

mergify bot deleted the wip/radeusgd/text-contains-181265795 branch February 22, 2022 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data analysts should be able to use `Text.contains` to check for substring using various matcher techniques. #3285

Data analysts should be able to use `Text.contains` to check for substring using various matcher techniques. #3285

radeusgd commented Feb 17, 2022 •

edited

Loading

radeusgd Feb 18, 2022

radeusgd Feb 18, 2022 •

edited

Loading

radeusgd Feb 18, 2022

radeusgd commented Feb 18, 2022

radeusgd Feb 18, 2022

radeusgd Feb 21, 2022 •

edited

Loading

radeusgd Feb 18, 2022

radeusgd Feb 21, 2022

Data analysts should be able to use Text.contains to check for substring using various matcher techniques. #3285

Data analysts should be able to use Text.contains to check for substring using various matcher techniques. #3285

Conversation

radeusgd commented Feb 17, 2022 • edited Loading

Pull Request Description

Important Notes

Checklist

radeusgd Feb 18, 2022

Choose a reason for hiding this comment

radeusgd Feb 18, 2022 • edited Loading

Choose a reason for hiding this comment

radeusgd Feb 18, 2022

Choose a reason for hiding this comment

radeusgd commented Feb 18, 2022

radeusgd Feb 18, 2022

Choose a reason for hiding this comment

radeusgd Feb 21, 2022 • edited Loading

Choose a reason for hiding this comment

radeusgd Feb 18, 2022

Choose a reason for hiding this comment

radeusgd Feb 21, 2022

Choose a reason for hiding this comment

Data analysts should be able to use `Text.contains` to check for substring using various matcher techniques. #3285

Data analysts should be able to use `Text.contains` to check for substring using various matcher techniques. #3285

radeusgd commented Feb 17, 2022 •

edited

Loading

radeusgd Feb 18, 2022 •

edited

Loading

radeusgd Feb 21, 2022 •

edited

Loading