Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that Text.compare_to compares strings according to grapheme clusters #3282

Merged
merged 21 commits into from
Feb 17, 2022

Conversation

radeusgd
Copy link
Member

@radeusgd radeusgd commented Feb 17, 2022

Pull Request Description

https://www.pivotaltracker.com/story/show/181175238

Important Notes

Checklist

Please include the following checklist in your PR:

  • The documentation has been updated if necessary.
  • All code conforms to the Scala, Java, and Rust style guides.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed: Enso GUI was tested when built using BOTH ./run dist and ./run watch.

@radeusgd
Copy link
Member Author

radeusgd commented Feb 17, 2022

Two approaches were implemented and compared against the old implementation.
For context, the old implementation is sometimes faster because it simply relied on simpler logic (which was incorrect for our case!).

The first approach uses Normalizer.compare which seems to be the go-to method for Unicode-aware comparison in ICU. The downside is that it involves a preprocessing step which checks if the input is in FCD form (and we cannot assume all texts will be to skip this check), which involves scanning the whole text. So the complexity of this approach is O(N+M) where N and M are lengths of the texts being compared.

An alternative approach has been tried which relies on the BreakIterator.getCharacterInstance() and only then compares single grapheme clusters using Normalizer.compare. The complexity of this method is optimal - O(D) where D is the position of the first difference. However, it involves the overhead of using the iterator, creating substrings and in general calling Normalizer.compare in a much 'less-tight' loop. Thus, it's constant is pretty big and so in cases where D and N+M are close to each other, it can be a few times slower.

The results are shown below (GSheet showing all results). The ASCII test checks strings which contain only ASCII characters (but are still Unicode-encoded) and the Unicode test checks strings which contain non-trivial grapheme clusters, like ę or added accent marks as separate codepoints.

Test Old Normalizer BreakIterator
strcmp Unicode very short 60.5 45.4 133.4
strcmp Unicode medium 18.7 7.6 10
strcmp Unicode big - random 231.9 105.9 0.1
strcmp Unicode big - early difference 232.9 105.9 0.3
strcmp Unicode big - late difference 1000.4 730.8 2055.3
strcmp ASCII very short 34.4 33.6 112.2
strcmp ASCII medium 2.9 2.6 9.5
strcmp ASCII big - random 26.9 27.7 0.1
strcmp ASCII big - early difference 25.9 26.7 0.1
strcmp ASCII big - late difference 30 59.5 1082.3

We can see that for short strings and long strings where the difference shows up at the end, the BreakIterator approach can be from 2x to even almost 20x slower. On the other hand, for long strings where the difference occurs early, it can even be more than 200x faster (because it can terminate very early, whereas Normalizer must always do a full scan of the string).

Since we are going to be comparing short strings most of the time, we decided to go for the Normalizer approach for now.

I will create a task to investigate a hybrid approach switching to the BreakIterator for longer strings (because in a long string it's quite likely that a difference can appear early, so while the pessimistic case is far slower than Normalizer on these examples too, in practice it may likely be faster for many use-cases). We can also investigate improving the loop, but this may require porting some ICU code which can be brittle.

@radeusgd
Copy link
Member Author

An alternative to improving the BreakIterator performance or the hybrid approach that is also worth considering after a deeper analysis of Normalizer.compare is following:

The O(N+M) cost of Normalizer.compare is incurred by ensuring that the input strings are in FCD form [1] [2]. The Normalizer does allow to skip this check if we know that input is FCD, and then it's runtime is O(D) with a pretty good small constant. Since our strings will be coming from various places, we cannot guarantee all inputs will be in FCD though. Most of the inputs should be in FCD form though (all English text and text in languages like Polish where each letter can contain only one accent should all be FCD, only very non-standard scripts or maybe complex mathematical symbols will not be).

So an optimization that we could do is to cache if we know that the input string is in FCD form (our Text instance already contains opaque mutable state, like if it is represented by ropes or stored compactly as String, so we can also add one more integer) (we could not only cache just if the string is in FCD form but also how big of a prefix is to make the FCD normalization faster (see how the check + normalization work in Normalizer.compare for details)). Then the FCD check will happen only once for FCD strings, essentially giving us O(D) complexity of comparisons - the first FCD check is amortized in the cost of constructing the strings (although the check will not happen upon construction but only on demand if it is needed).

What we cannot do is cache the FCD form by replacing the stored text, because this would make accessors like utf_8 return different results after a comparison (as normalizing will change the byte representation, obviously) and that is unacceptable. So with this approach the non-FCD strings would not gain any performance improvements (but if implemented correctly, the performance will be no-worse). Still, given that in practice most strings are in FCD this optimization should still improve performance a lot. Thus I would suggest this is a better way to go than trying to squeeze out better performance of the BreakIterator which is still likely to have greater overhead.

One more venue that could be explored is to cache the FCD form next to the regular string - but this would make the non-FCD Texts take 2x more space because of this cache which can be quite costly for large strings. This is a classic memory vs time complexity trade-off. I don't think this particular optimization is worth it, because the cache would live on as long as the Text itself (which can be quite long) and the benefit isn't that large in terms of comparison speed - so probably better to take the time complexity hit here (especially as it only affects non-FCD strings anyway).

@radeusgd radeusgd force-pushed the wip/radeusgd/text-compare-to-graphemes-181175238 branch from 82d1600 to 3051ebe Compare February 17, 2022 13:30
@radeusgd radeusgd marked this pull request as ready for review February 17, 2022 13:42
@jdunkerley jdunkerley enabled auto-merge (squash) February 17, 2022 14:08
Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we organise the Bench_Text a little differently please.

@radeusgd radeusgd self-assigned this Feb 17, 2022
@radeusgd radeusgd enabled auto-merge (squash) February 17, 2022 14:34
@radeusgd radeusgd force-pushed the wip/radeusgd/text-compare-to-graphemes-181175238 branch from fb6f47a to 4c3d251 Compare February 17, 2022 15:13
@radeusgd radeusgd force-pushed the wip/radeusgd/text-compare-to-graphemes-181175238 branch from 4c3d251 to fc36de9 Compare February 17, 2022 15:42
@radeusgd radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Feb 17, 2022
@mergify mergify bot merged commit 14f5727 into develop Feb 17, 2022
@mergify mergify bot deleted the wip/radeusgd/text-compare-to-graphemes-181175238 branch February 17, 2022 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants