Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data analysts should be able to use Text.contains to check for substring using various matcher techniques. #3285

Merged
merged 20 commits into from
Feb 22, 2022

Conversation

radeusgd
Copy link
Member

@radeusgd radeusgd commented Feb 17, 2022

Pull Request Description

Important Notes

Checklist

Please include the following checklist in your PR:

  • The documentation has been updated if necessary.
  • All code conforms to the Scala, Java, and Rust style guides.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed: Enso GUI was tested when built using BOTH ./run dist and ./run watch.

@radeusgd radeusgd force-pushed the wip/radeusgd/text-contains-181265795 branch from 8c076c0 to 03e6e12 Compare February 18, 2022 10:27
@radeusgd radeusgd marked this pull request as ready for review February 18, 2022 12:02
@radeusgd radeusgd requested a review from 4e6 as a code owner February 18, 2022 12:02
@radeusgd radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Feb 18, 2022
@radeusgd radeusgd self-assigned this Feb 18, 2022
@radeusgd radeusgd removed the CI: Ready to merge This PR is eligible for automatic merge label Feb 18, 2022
Comment on lines 221 to 225
## TODO what do we do with that?? Since the standard decomposition
splits 'ś' into 's+{accent}', 'ś'.contains 's', but I don't think
this is the expected behaviour...
"Cześć".contains 's' . should_be_true
'Czes\u{301}c\u{301}'.contains 's' . should_be_true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found a non-trivial and quite problematic edge case: since we perform the normalization, the accented letter is represented by unaccented letter + accent, thus if we just do: normalize and then do naive Java contains, it finds the s (which is just a part of the representation of this grapheme).
I don't think this is something we want, because logically s is not contained in ś (although visually in a way, it is).

Copy link
Member Author

@radeusgd radeusgd Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added an analogous test and there it works as expected (i.e. ś does not contain s in regex matching mode). That's an argument for making this work correctly in exact matching mode - we want to be consistent.
Also that's reassuring, because if Regex didn't support this properly and we still wanted that property - it could have been very hard to 'fix' the Regex implementation.

EDIT:
I was wrong. Regex does work in the direction "ś" . contains 's\u{301}' and also correctly handles 's\u{301}' . contains 'ś'. But it actually does return True for 's\u{301}' . contains 's' - contrary to what we'd expect. At this point I'm not sure what to do...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One solution to that could be to just write the contains serach manually, using the BreakIterator, ensuring that it looks at whole grapheme clusters correctly. This is likely going to be slightly slower than what we have now (ICU Normalizer2 preprocess step + Java contains), but may be the only way to go to retain correctness.

...

After a long and deep dive into ICU4J API, I've found StringSearch which should do what we need in an efficient manner. Will try it out.

Interestingly it allows setting locale - I have no idea how changing locale can influence the search in general - but noting this as it may be something we may want to explore (although I'd set up a separate chore task for it instead of digging into it right now - but that's up to discussion).

@radeusgd
Copy link
Member Author

Comparison of the Normalizer+Java.contains vs StringSearch implementations (full results sheet):

Test Java Contains ICU StringSearch
Text.contains exact 25.3 107.8
Text.contains case-insensitive 64 143.1
Text.contains exact regex 188.7 184.6
Text.contains case-insensitive regex 344.8 343.8
Text.contains const-width regex 419.4 497.2
Text.contains wildcard regex 119.4 114.4

Only first two rows are relevant - Regex implementation did not change, so any differences there are only due to measurement uncertainty.

We can see that unfortunately StringSearch is 2-4x slower. I don't think we can get a better solution which will handle the edge cases correctly though - unlikely that we can get something both correct and at the same faster than the ICU implementation. Also a significant part of this cost is likely due to the additional logic needed to correctly handle the edge cases - which is just unavoidable if we want this (simply - more complex) behaviour.

@radeusgd radeusgd force-pushed the wip/radeusgd/text-contains-181265795 branch from e7dfca0 to 3dfd999 Compare February 18, 2022 23:12
Comment on lines +241 to +244
"Straße" . contains "ss" . should_be_false
"Strasse" . contains "ß" . should_be_false
"Straße" . contains "ss" (Text_Matcher Case_Insensitive.new) . should_be_true
"Strasse" . contains "ß" (Text_Matcher Case_Insensitive.new) . should_be_true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documenting this slightly peculiar case - due to how we candle case insensitive operations (tolower+toupper), and given the fact that the uppercase variant of ß is SS, ß and ss get collated in case insensitive mode.

More generally (also shown in tests here, just different place), currently in Enso: "ß".equals_ignore_case "ss" == false.

Not sure if this is good or bad:

  1. It seems bad, because the difference that got collated is not exactly a case difference.
  2. OTOH, it seems natural that these two symbols mean the same thing so under a less strict equality they may be equated.

However, in Java "ß".equalsIgnoreCase("ss") == false.

Moreover, it's really a different kind of difference - scharfes S is more like a ligature, i.e. in the similar spirit maybe æ should also get collated with ae etc.

So I'd lean more into the direction of trying to get rid of this collation - but I'm not exactly sure how to do this efficiently - the ICU normalizer we use for equals_ignore_case supports case folding, but does not accept a locale. Seems the only way to handle cases with locale is through the to_lower_case and to_upper_case methods. Interestingly, how does Java get away with this? Because their equalsIgnoreCase processes character-by-character (not even by grapheme clsuters!) and since the proper upper-case of ß is SS which takes two characters, the Character.toUpperCase simply ignores it and returns back ß (because it is incapable of returning two characters). So Java gets this right, because it is handling characters on a more lower lever which is too limited to encounter this issue.

Quick solutions that come to mind:

  1. Use the ICU's case folding that is not locale aware, possibly adding an if for the Turkish locale which seems to be toggleable in ICU (maybe that's the only difference between all locales so we don't need others?).
  2. Use BreakIterator and implement this manually.

(2) is likely going to be slower, so probably don't want this (although may need a benchmark to be sure). (1) could be incorrect which would be bad, unless really the only Locale having different case handling is Turkish - that's possible - but we'd need to research that - possibly ask some linguist.

I think it may make most sense to create a separate task to explore this, especially check if (1) is viable as it would be our best shot. For now I'd just live with this collation - but open to discussion if this should be resolved before merging this.

Copy link
Member Author

@radeusgd radeusgd Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out Swift's caseInsensitiveCompare also compares scharfes s as equal to ss, so I guess we can keep the current behaviour for now.

Would probably need someone knowing German linguistics very well to understand if collating these two in case-insensitive mode makes sense or not.

@@ -256,6 +256,7 @@ spec =
"Cześć" . contains 's\u{301}' Regex_Matcher.new . should_be_true
'Czes\u{301}c\u{301}' . contains 's\u{301}' Regex_Matcher.new . should_be_true
'Czes\u{301}c\u{301}' . contains 'ś' Regex_Matcher.new . should_be_true
'Czes\u{301}c\u{301}' . contains 's' Regex_Matcher.new . should_be_false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this test fails...

So Regex only works well with Unicode normalization to some extent - it does correctly find ś in s\u{301} and vice-versa. It does correctly not find s in ś. But it incorrectly (according to what I'd expect) finds s in s\u{301}. This is quite inconsistent. Maybe it actually should be reported as a bug in the Regex implementation - we already got one bug accepted there, so maybe we could get there with this one too - not exactly sure if this will be considered a bug, but the behaviour is not consistent - I don't think the results should depend on whether the string is normalized or not.

Not sure if simple workarounds exist for this - we could normalize the text before passing it to the engine but normalization does split ś into s\u{301} (IIRC), so that would make it even worse (but at least consistent, irrelevant of if the input was normalized).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out Swift has a similar problem - actually it handles these cases even worse than our Regex
image

Seems to be a widely-known issue with Regex implementations: https://www.regular-expressions.info/unicode.html with no known implementations which do better in this case.

Will document this nuance in contains docstring and add unit tests showing it so that we are aware of it, but I expect we can't do much more than that.

@radeusgd radeusgd force-pushed the wip/radeusgd/text-contains-181265795 branch 2 times, most recently from a5ca22a to fcdee07 Compare February 21, 2022 13:15
@radeusgd radeusgd force-pushed the wip/radeusgd/text-contains-181265795 branch from fcdee07 to 264fe31 Compare February 22, 2022 13:28
@radeusgd radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Feb 22, 2022
@mergify mergify bot merged commit ae9d515 into develop Feb 22, 2022
@mergify mergify bot deleted the wip/radeusgd/text-contains-181265795 branch February 22, 2022 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants