Ensure that `Text.compare_to` compares strings according to grapheme clusters #3282

radeusgd · 2022-02-17T11:12:18Z

Pull Request Description

https://www.pivotaltracker.com/story/show/181175238

Important Notes

Checklist

Please include the following checklist in your PR:

The documentation has been updated if necessary.
All code conforms to the Scala, Java, and Rust style guides.
All code has been tested:
- Unit tests have been written where possible.
- If GUI codebase was changed: Enso GUI was tested when built using BOTH ./run dist and ./run watch.

radeusgd · 2022-02-17T11:23:10Z

Two approaches were implemented and compared against the old implementation.
For context, the old implementation is sometimes faster because it simply relied on simpler logic (which was incorrect for our case!).

The first approach uses Normalizer.compare which seems to be the go-to method for Unicode-aware comparison in ICU. The downside is that it involves a preprocessing step which checks if the input is in FCD form (and we cannot assume all texts will be to skip this check), which involves scanning the whole text. So the complexity of this approach is O(N+M) where N and M are lengths of the texts being compared.

An alternative approach has been tried which relies on the BreakIterator.getCharacterInstance() and only then compares single grapheme clusters using Normalizer.compare. The complexity of this method is optimal - O(D) where D is the position of the first difference. However, it involves the overhead of using the iterator, creating substrings and in general calling Normalizer.compare in a much 'less-tight' loop. Thus, it's constant is pretty big and so in cases where D and N+M are close to each other, it can be a few times slower.

The results are shown below (GSheet showing all results). The ASCII test checks strings which contain only ASCII characters (but are still Unicode-encoded) and the Unicode test checks strings which contain non-trivial grapheme clusters, like ę or added accent marks as separate codepoints.

Test	Old	Normalizer	BreakIterator
strcmp Unicode very short	60.5	45.4	133.4
strcmp Unicode medium	18.7	7.6	10
strcmp Unicode big - random	231.9	105.9	0.1
strcmp Unicode big - early difference	232.9	105.9	0.3
strcmp Unicode big - late difference	1000.4	730.8	2055.3
strcmp ASCII very short	34.4	33.6	112.2
strcmp ASCII medium	2.9	2.6	9.5
strcmp ASCII big - random	26.9	27.7	0.1
strcmp ASCII big - early difference	25.9	26.7	0.1
strcmp ASCII big - late difference	30	59.5	1082.3

We can see that for short strings and long strings where the difference shows up at the end, the BreakIterator approach can be from 2x to even almost 20x slower. On the other hand, for long strings where the difference occurs early, it can even be more than 200x faster (because it can terminate very early, whereas Normalizer must always do a full scan of the string).

Since we are going to be comparing short strings most of the time, we decided to go for the Normalizer approach for now.

I will create a task to investigate a hybrid approach switching to the BreakIterator for longer strings (because in a long string it's quite likely that a difference can appear early, so while the pessimistic case is far slower than Normalizer on these examples too, in practice it may likely be faster for many use-cases). We can also investigate improving the loop, but this may require porting some ICU code which can be brittle.

radeusgd · 2022-02-17T12:06:27Z

An alternative to improving the BreakIterator performance or the hybrid approach that is also worth considering after a deeper analysis of Normalizer.compare is following:

The O(N+M) cost of Normalizer.compare is incurred by ensuring that the input strings are in FCD form [1] [2]. The Normalizer does allow to skip this check if we know that input is FCD, and then it's runtime is O(D) with a pretty good small constant. Since our strings will be coming from various places, we cannot guarantee all inputs will be in FCD though. Most of the inputs should be in FCD form though (all English text and text in languages like Polish where each letter can contain only one accent should all be FCD, only very non-standard scripts or maybe complex mathematical symbols will not be).

So an optimization that we could do is to cache if we know that the input string is in FCD form (our Text instance already contains opaque mutable state, like if it is represented by ropes or stored compactly as String, so we can also add one more integer) (we could not only cache just if the string is in FCD form but also how big of a prefix is to make the FCD normalization faster (see how the check + normalization work in Normalizer.compare for details)). Then the FCD check will happen only once for FCD strings, essentially giving us O(D) complexity of comparisons - the first FCD check is amortized in the cost of constructing the strings (although the check will not happen upon construction but only on demand if it is needed).

What we cannot do is cache the FCD form by replacing the stored text, because this would make accessors like utf_8 return different results after a comparison (as normalizing will change the byte representation, obviously) and that is unacceptable. So with this approach the non-FCD strings would not gain any performance improvements (but if implemented correctly, the performance will be no-worse). Still, given that in practice most strings are in FCD this optimization should still improve performance a lot. Thus I would suggest this is a better way to go than trying to squeeze out better performance of the BreakIterator which is still likely to have greater overhead.

One more venue that could be explored is to cache the FCD form next to the regular string - but this would make the non-FCD Texts take 2x more space because of this cache which can be quite costly for large strings. This is a classic memory vs time complexity trade-off. I don't think this particular optimization is worth it, because the cache would live on as long as the Text itself (which can be quite long) and the benefit isn't that large in terms of comparison speed - so probably better to take the time complexity hit here (especially as it only affects non-FCD strings anyway).

jdunkerley

Can we organise the Bench_Text a little differently please.

This reverts commit 513990e.

This reverts commit c4793dd.

This reverts commit 82d1600.

This reverts commit 3591298.

radeusgd force-pushed the wip/radeusgd/text-compare-to-graphemes-181175238 branch from 82d1600 to 3051ebe Compare February 17, 2022 13:30

radeusgd marked this pull request as ready for review February 17, 2022 13:42

radeusgd requested review from 4e6 and jdunkerley as code owners February 17, 2022 13:43

jdunkerley enabled auto-merge (squash) February 17, 2022 14:08

jdunkerley disabled auto-merge February 17, 2022 14:08

jdunkerley requested changes Feb 17, 2022

View reviewed changes

radeusgd requested a review from jdunkerley February 17, 2022 14:21

jdunkerley approved these changes Feb 17, 2022

View reviewed changes

radeusgd self-assigned this Feb 17, 2022

radeusgd added the Category: Stdlib label Feb 17, 2022

radeusgd enabled auto-merge (squash) February 17, 2022 14:34

radeusgd force-pushed the wip/radeusgd/text-compare-to-graphemes-181175238 branch from fb6f47a to 4c3d251 Compare February 17, 2022 15:13

radeusgd added 15 commits February 17, 2022 16:38

Add the test

ff38a84

Add more tests

9eb86c2

Add benchmarks for Text comparison

a104422

Migrate to normalizer

275622e

Switch Faker.make_string to UTF-16

da68d27

Add nontrivial Unicode to benchmarks

dc92290

Add comment

12ce3ad

Temporarily switch back to old impl

462eeeb

This reverts commit 513990e.

test both simple and complex unicode

e9e4671

Switch back to normalizer

45c40fa

This reverts commit c4793dd.

Add tests

743d6e8

Try a BreakIterator more lazy approach

d3d9991

Switch to normalizer, slightly faster

712ab9c

Revert "Switch to normalizer, slightly faster"

2358981

This reverts commit 82d1600.

Revert "Try a BreakIterator more lazy approach"

dfbc290

This reverts commit 3591298.

radeusgd added 6 commits February 17, 2022 16:38

Add tests for equals_ignore_case

b00a401

Add some more tests

5f8944e

changelog

0ed18e5

Optimize imports

2aa976f

Docs

b17b1fb

Split benchmarks and move to directory structure

fc36de9

radeusgd force-pushed the wip/radeusgd/text-compare-to-graphemes-181175238 branch from 4c3d251 to fc36de9 Compare February 17, 2022 15:42

radeusgd added the CI: Ready to merge This PR is eligible for automatic merge label Feb 17, 2022

radeusgd disabled auto-merge February 17, 2022 15:48

mergify bot merged commit 14f5727 into develop Feb 17, 2022

mergify bot deleted the wip/radeusgd/text-compare-to-graphemes-181175238 branch February 17, 2022 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that `Text.compare_to` compares strings according to grapheme clusters #3282

Ensure that `Text.compare_to` compares strings according to grapheme clusters #3282

radeusgd commented Feb 17, 2022 •

edited

Loading

radeusgd commented Feb 17, 2022 •

edited

Loading

radeusgd commented Feb 17, 2022

jdunkerley left a comment

Ensure that Text.compare_to compares strings according to grapheme clusters #3282

Ensure that Text.compare_to compares strings according to grapheme clusters #3282

Conversation

radeusgd commented Feb 17, 2022 • edited Loading

Pull Request Description

Important Notes

Checklist

radeusgd commented Feb 17, 2022 • edited Loading

radeusgd commented Feb 17, 2022

jdunkerley left a comment

Choose a reason for hiding this comment

Ensure that `Text.compare_to` compares strings according to grapheme clusters #3282

Ensure that `Text.compare_to` compares strings according to grapheme clusters #3282

radeusgd commented Feb 17, 2022 •

edited

Loading

radeusgd commented Feb 17, 2022 •

edited

Loading