-
Notifications
You must be signed in to change notification settings - Fork 317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure that Text.compare_to
compares strings according to grapheme clusters
#3282
Ensure that Text.compare_to
compares strings according to grapheme clusters
#3282
Conversation
Two approaches were implemented and compared against the old implementation. The first approach uses An alternative approach has been tried which relies on the The results are shown below (GSheet showing all results). The ASCII test checks strings which contain only ASCII characters (but are still Unicode-encoded) and the Unicode test checks strings which contain non-trivial grapheme clusters, like
We can see that for short strings and long strings where the difference shows up at the end, the Since we are going to be comparing short strings most of the time, we decided to go for the I will create a task to investigate a hybrid approach switching to the BreakIterator for longer strings (because in a long string it's quite likely that a difference can appear early, so while the pessimistic case is far slower than Normalizer on these examples too, in practice it may likely be faster for many use-cases). We can also investigate improving the loop, but this may require porting some ICU code which can be brittle. |
An alternative to improving the BreakIterator performance or the hybrid approach that is also worth considering after a deeper analysis of The O(N+M) cost of So an optimization that we could do is to cache if we know that the input string is in FCD form (our Text instance already contains opaque mutable state, like if it is represented by ropes or stored compactly as String, so we can also add one more integer) (we could not only cache just if the string is in FCD form but also how big of a prefix is to make the FCD normalization faster (see how the check + normalization work in What we cannot do is cache the FCD form by replacing the stored text, because this would make accessors like One more venue that could be explored is to cache the FCD form next to the regular string - but this would make the non-FCD Texts take 2x more space because of this cache which can be quite costly for large strings. This is a classic memory vs time complexity trade-off. I don't think this particular optimization is worth it, because the cache would live on as long as the Text itself (which can be quite long) and the benefit isn't that large in terms of comparison speed - so probably better to take the time complexity hit here (especially as it only affects non-FCD strings anyway). |
82d1600
to
3051ebe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we organise the Bench_Text a little differently please.
fb6f47a
to
4c3d251
Compare
4c3d251
to
fc36de9
Compare
Pull Request Description
https://www.pivotaltracker.com/story/show/181175238
Important Notes
Checklist
Please include the following checklist in your PR:
./run dist
and./run watch
.