unicode: switch to uucode grapheme break to (mostly) match unicode spec#9680
unicode: switch to uucode grapheme break to (mostly) match unicode spec#9680mitchellh merged 12 commits intoghostty-org:mainfrom
Conversation
|
Rebase again due to merge, thanks :) |
Done! |
|
This looks awesome. I want to run a |
Sounds good. I haven't gotten to run |
|
@mitchellh I see the same thing. for reference for others, this is main:
So, this PR drops the score from 106 to 98. And then if I run ucs-detect on jacobsandlund#1, built on top of this PR, I see:
Dropping the score again to 91. I'll do some more investigation, but ucs-detect is based on python wcwidth, which is treating too many categories of characters as zero width: https://github.com/jquast/wcwidth/blob/915166f9453098a56e87a7fb69e697696cefe206/bin/update-tables.py#L149-L155
Here you can see the jacobsandlund#1 result of printing a couple of those graphemes, and selecting it correctly groups the cluster, and it displays wide as it should:
and here's the result on main:
just out of curiosity, i changed the uucode
Interestingly, still down 1 point from main. But, I think we can't treat |
I agree with this, we may want to also loop in @jquast too who can maybe provide some opinions. 😄 |
This reminds me... this looks very similar to the overlapping characters seen in #5637 — maybe it wasn't a font shaping/layout problem but rather that we simply weren't assigning the correct cell widths for these characters? |
|
I'm still investigating this, but I'll share my findings thus far. I have a branch on top of jacobsandlund#1 that compares the cumulative I run Then I do some Here's the diff as a gist, with additions being graphemes that are now wrong with the And some lines, inline: -\u{11341} → 𑍁 x: 15.263850999996066
\u{11341}\u{0}\u{11302} → 𑍁 x: 31.586641117930412
\u{11341}\u{0}\u{11303} → 𑍁 x: 25.883497582748532
-\u{11343} → 𑍃 x: 11.315520860254765
-\u{1134d} → 𑍍 x: 16.005108382552862
-\u{1134d} → 𑍍 x: 6.4292732160538435
+\u{11342} → 𑍂 x: 20.558546589687467
\u{1703} → ᜃ x: 28.2716427154541
\u{1704} → ᜄ x: 24.0481351146698
\u{1704}\u{1714} → ᜄ᜔ x: 24.0481351146698
@@ -1044,9 +1157,12 @@
\u{1711} → ᜑ x: 30.13918008995056
\u{1711}\u{1712} → ᜑᜒ x: 30.139180089950557
\u{1780} → ក x: 20.769738256931305
+\u{1780} → ក x: 31.30658107995987
\u{1780}\u{17b7} → កិ x: 20.769738256931305
\u{1780}\u{17bb} → កុ x: 20.769738256931305
\u{1780}\u{17bb}\u{17c6} → កុំ x: 20.769738256931305
+\u{1780}\u{17be} → កើ x: 33.02894961833954
+\u{1780}\u{17c1} → កេ x: 33.02894961833954
\u{1780}\u{17c4} → កោ x: 43.5657924413681
\u{1780}\u{17c6} → កំ x: 20.769738256931305
\u{1780}\u{17cb} → ក់ x: 20.769738256931305
@@ -1058,11 +1174,17 @@
\u{1780}\u{17d2}\u{178a}\u{17c5} → ក្ដៅ x: 43.5657924413681
\u{1780}\u{17d2}\u{179a}\u{17c4} → ក្រោ x: 53.90000367164612
\u{1781} → ខ x: 20.769738256931305
+\u{1781} → ខ x: 31.30658107995987
+\u{1781}\u{17b7} → ខិ x: 20.769738256931305
\u{1781}\u{17bb} → ខុ x: 20.769738256931305
\u{1781}\u{17c6} → ខំ x: 20.769738256931305
\u{1781}\u{17d2} → ខ្ x: 20.769738256931305
+\u{1781}\u{17d2} → ខ្ x: 31.30658107995987
\u{1782} → គ x: 20.769738256931305
+\u{1782} → គ x: 31.30658107995987
+\u{1782}\u{17b6}\u{17c6} → គាំ x: 31.30658107995987
\u{1782}\u{17b7} → គិ x: 20.769738256931305
+\u{1782}\u{17c1} → គេ x: 33.02894961833954
\u{1782}\u{17c4} → គោ x: 43.5657924413681
\u{1782}\u{17c6} → គំ x: 20.769738256931305I'll keep looking a little closer, but I'll also reach out and comment on jquast/wcwidth#155 |
|
Here's the result of my investigation, as a comment on the |
|
CI isn't running for some reason, but you've always been diligent so I'm going to trust you on this. Thank you. |








This PR builds on #9678
so the diff from there is included here (it's not possible to stack PRs unless it's a PR against my own fork)--review that one first!This PR updates the
graphemeBreakcalculation to useuucode'scomputeGraphemeBreakNoControl, which has tests in uucode that confirm it passes theGraphemeBreakTest.txt(minus some exceptions).Note that the
grapheme_break(andgrapheme_break_no_control) property inuucodeincorporatesemoji_modifierandemoji_modifier_base, diverging from UAX #29 but matching UTS #51. See this comment in uucode for details.The
grapheme_break_no_controlproperty andcomputeGraphemeBreakNoControlboth assumecontrol,cr, andlfhave been filtered out, matching the current grapheme break logic in Ghostty.This PR keeps the
Precompute.datalogic mostly equivalent, since theuucodeprecomputedGraphemeBreaklacks benchmarks in theuucoderepository (it was benchmarked in the original PR addinguucodeto Ghostty). Note however, that due tographeme_breakbeing one bit larger thangrapheme_boundary_classand the newBreakStatealso being one bit larger, the state jumps up by a factor of 8 (u10 -> u13), to 8KB.Benchmarks
I benchmarked the oldmainversion versus this PR for+grapheme-breakand surprisingly this PR is 2% faster (?). Looking at the assembly though, I'm thinking something else might be causing that. Once I get to the bottom of that I'll remove the below TODO and include the benchmark results here.When seeing the speedup with
data.txtand maybe a tiny speedup on English wiki, I was surprised given the 1KB -> 8KB tables. Here's what AI said when I asked it to inspect the assembly: https://ampcode.com/threads/T-979b1743-19e7-47c9-8074-9778b4b2a61e, and here's what it said when I asked it to predict the faster version: https://ampcode.com/threads/T-3291dcd3-7a21-4d24-a192-7b3f6e18cd31It looks like two loads got reordered and that put the load that depended on stage1 -> stage2 -> stage3 second, "hiding memory latency". So that makes the new one faster when looking up the
grapheme_breakproperty. These gains go away with the Japanese and Arabic benchmarks, which spend more time processing utf8, and may even have more grapheme clusters too.with data.txt (200 MB ghostty-gen random utf8)
with English wiki dump
with Japanese wiki dump
with Arabic wiki dump
TODO:
uucode'sprecomputedGraphemeBreakwhich uses a 1445 byte table since it uses a dense table (indexed using multiplication instead of bitCast, though, which did show up in the initial benchmarks from deps: Replace ziglyph with uucode #8757 a small amount.)AI was used in some of the uucode changes in #9678 (Amp--primarily for tests), but everything was carefully vetted and much of it done by hand. This PR was made without AI with the exception of consulting AI about whether the "Prepend + ASCII" scenario is common (hopefully it's right about that being uncommon).