Skip to content

unicode: switch to uucode grapheme break to (mostly) match unicode spec#9680

Merged
mitchellh merged 12 commits intoghostty-org:mainfrom
jacobsandlund:grapheme-break
Jan 20, 2026
Merged

unicode: switch to uucode grapheme break to (mostly) match unicode spec#9680
mitchellh merged 12 commits intoghostty-org:mainfrom
jacobsandlund:grapheme-break

Conversation

@jacobsandlund
Copy link
Contributor

@jacobsandlund jacobsandlund commented Nov 24, 2025

This PR builds on #9678 so the diff from there is included here (it's not possible to stack PRs unless it's a PR against my own fork)--review that one first!

This PR updates the graphemeBreak calculation to use uucode's computeGraphemeBreakNoControl, which has tests in uucode that confirm it passes the GraphemeBreakTest.txt (minus some exceptions).

Note that the grapheme_break (and grapheme_break_no_control) property in uucode incorporates emoji_modifier and emoji_modifier_base, diverging from UAX #29 but matching UTS #51. See this comment in uucode for details.

The grapheme_break_no_control property and computeGraphemeBreakNoControl both assume control, cr, and lf have been filtered out, matching the current grapheme break logic in Ghostty.

This PR keeps the Precompute.data logic mostly equivalent, since the uucode precomputedGraphemeBreak lacks benchmarks in the uucode repository (it was benchmarked in the original PR adding uucode to Ghostty). Note however, that due to grapheme_break being one bit larger than grapheme_boundary_class and the new BreakState also being one bit larger, the state jumps up by a factor of 8 (u10 -> u13), to 8KB.

Benchmarks

I benchmarked the old main version versus this PR for +grapheme-break and surprisingly this PR is 2% faster (?). Looking at the assembly though, I'm thinking something else might be causing that. Once I get to the bottom of that I'll remove the below TODO and include the benchmark results here.

When seeing the speedup with data.txt and maybe a tiny speedup on English wiki, I was surprised given the 1KB -> 8KB tables. Here's what AI said when I asked it to inspect the assembly: https://ampcode.com/threads/T-979b1743-19e7-47c9-8074-9778b4b2a61e, and here's what it said when I asked it to predict the faster version: https://ampcode.com/threads/T-3291dcd3-7a21-4d24-a192-7b3f6e18cd31

It looks like two loads got reordered and that put the load that depended on stage1 -> stage2 -> stage3 second, "hiding memory latency". So that makes the new one faster when looking up the grapheme_break property. These gains go away with the Japanese and Arabic benchmarks, which spend more time processing utf8, and may even have more grapheme clusters too.

with data.txt (200 MB ghostty-gen random utf8)

CleanShot 2025-11-26 at 08 42 03@2x

with English wiki dump

CleanShot 2025-11-26 at 08 43 15@2x

with Japanese wiki dump

CleanShot 2025-11-26 at 08 43 49@2x

with Arabic wiki dump

CleanShot 2025-11-26 at 08 44 25@2x

TODO:

  • Take a closer look at the assembly and understand why this PR (8 KB vs 1 KB table) is faster on my machine.
  • (edit: checking this off because it seems unnecessary) If this turns out to actually be unacceptably slower, one possibility is to switch to uucode's precomputedGraphemeBreak which uses a 1445 byte table since it uses a dense table (indexed using multiplication instead of bitCast, though, which did show up in the initial benchmarks from deps: Replace ziglyph with uucode #8757 a small amount.)

AI was used in some of the uucode changes in #9678 (Amp--primarily for tests), but everything was carefully vetted and much of it done by hand. This PR was made without AI with the exception of consulting AI about whether the "Prepend + ASCII" scenario is common (hopefully it's right about that being uncommon).

@jacobsandlund jacobsandlund requested review from a team as code owners November 24, 2025 14:56
@jacobsandlund jacobsandlund changed the title unicode: switch to uucode grapheme break to (mostly) match unicode 16 spec unicode: switch to uucode grapheme break to (mostly) match unicode spec Nov 24, 2025
@mitchellh
Copy link
Contributor

Rebase again due to merge, thanks :)

@jacobsandlund
Copy link
Contributor Author

Rebase again due to merge, thanks :)

Done!

@mitchellh
Copy link
Contributor

This looks awesome. I want to run a vtebench, cat a non-ASCII file, and also see what our wcwidth test results are after this before merging, if possible. I can help with this when I have time. 😄

@jacobsandlund
Copy link
Contributor Author

This looks awesome. I want to run a vtebench, cat a non-ASCII file, and also see what our wcwidth test results are after this before merging, if possible. I can help with this when I have time. 😄

Sounds good. I haven't gotten to run vtebench on it yet, but I will. I updated the description with the +grapheme-break benchmarks.

@mitchellh
Copy link
Contributor

In this PR:

CleanShot 2025-11-26 at 13 32 11@2x

@jacobsandlund
Copy link
Contributor Author

@mitchellh I see the same thing. for reference for others, this is main:

CleanShot 2025-11-26 at 17 45 43@2x

So, this PR drops the score from 106 to 98.

And then if I run ucs-detect on jacobsandlund#1, built on top of this PR, I see:

CleanShot 2025-11-26 at 17 51 19@2x

Dropping the score again to 91.

I'll do some more investigation, but ucs-detect is based on python wcwidth, which is treating too many categories of characters as zero width: https://github.com/jquast/wcwidth/blob/915166f9453098a56e87a7fb69e697696cefe206/bin/update-tables.py#L149-L155

Mc (which isn't even present in that comment) is spacing combining mark, which I think should generally be treated as taking up space, so that the character should be wide. Here's an amp thread: https://ampcode.com/threads/T-93ae4196-0abf-43db-8f26-f1d0ea869dba

Here you can see the jacobsandlund#1 result of printing a couple of those graphemes, and selecting it correctly groups the cluster, and it displays wide as it should:

CleanShot 2025-11-26 at 18 29 32@2x CleanShot 2025-11-26 at 18 36 21@2x

and here's the result on main:

CleanShot 2025-11-26 at 18 32 56@2x CleanShot 2025-11-26 at 18 36 51@2x

just out of curiosity, i changed the uucode wcwidth calculation to treat Mc and Cf as 0 to match ucs-detect, on top of jacobsandlund#1 and I get this score:

CleanShot 2025-11-26 at 18 40 22@2x

Interestingly, still down 1 point from main. But, I think we can't treat ucs-detect here as the source of truth. I could open a PR on ucs-detect/wcwidth for fixes that it needs, or even make a uucode one.

@mitchellh
Copy link
Contributor

Interestingly, still down 1 point from main. But, I think we can't treat ucs-detect here as the source of truth. I could open a PR on ucs-detect/wcwidth for fixes that it needs, or even make a uucode one.

I agree with this, we may want to also loop in @jquast too who can maybe provide some opinions. 😄

@pluiedev
Copy link
Member

Here you can see the jacobsandlund#1 result of printing a couple of those graphemes, and selecting it correctly groups the cluster, and it displays wide as it should:

This reminds me... this looks very similar to the overlapping characters seen in #5637 — maybe it wasn't a font shaping/layout problem but rather that we simply weren't assigning the correct cell widths for these characters?

@jacobsandlund
Copy link
Contributor Author

jacobsandlund commented Dec 3, 2025

I'm still investigating this, but I'll share my findings thus far. I have a branch on top of jacobsandlund#1 that compares the cumulative advance.width for a grapheme cluster using Coretext (I'm on Mac) and the expected Ghostty or uucode width, and logs when there's a mismatch: jacobsandlund/ghostty@grapheme-width-changes...debug-width

I run ucs-detect.sh and tee those logs to a file, then I repeat but with Mc and Cf treated as zero width. That results in 133,865 lines in the former case and 144,257 in the latter.

Then I do some sed and sort -u and get basically just the failing graphemes. There are some repeats, because I'm also including x: <advance.width>, and that apparently can change in different shaping runs sometimes (I need to understand shaping more). That gives me 4952 failing graphemes for the first case (with a little double counting), and 5799 failing graphemes for Mc and Cf as zero width.

Here's the diff as a gist, with additions being graphemes that are now wrong with the Mc and Cf treated as zero width: https://gist.github.com/jacobsandlund/03a9f052198b5cd5b28a0e5e803ebf80

And some lines, inline:

-\u{11341} → 𑍁    x: 15.263850999996066
 \u{11341}\u{0}\u{11302} → 𑍁    x: 31.586641117930412
 \u{11341}\u{0}\u{11303} → 𑍁    x: 25.883497582748532
-\u{11343} → 𑍃    x: 11.315520860254765
-\u{1134d} → 𑍍    x: 16.005108382552862
-\u{1134d} → 𑍍    x: 6.4292732160538435
+\u{11342} → 𑍂    x: 20.558546589687467
 \u{1703} → ᜃ    x: 28.2716427154541
 \u{1704} → ᜄ    x: 24.0481351146698
 \u{1704}\u{1714} → ᜄ᜔    x: 24.0481351146698
@@ -1044,9 +1157,12 @@
 \u{1711} → ᜑ    x: 30.13918008995056
 \u{1711}\u{1712} → ᜑᜒ    x: 30.139180089950557
 \u{1780} → ក    x: 20.769738256931305
+\u{1780} → ក    x: 31.30658107995987
 \u{1780}\u{17b7} → កិ    x: 20.769738256931305
 \u{1780}\u{17bb} → កុ    x: 20.769738256931305
 \u{1780}\u{17bb}\u{17c6} → កុំ    x: 20.769738256931305
+\u{1780}\u{17be} → កើ    x: 33.02894961833954
+\u{1780}\u{17c1} → កេ    x: 33.02894961833954
 \u{1780}\u{17c4} → កោ    x: 43.5657924413681
 \u{1780}\u{17c6} → កំ    x: 20.769738256931305
 \u{1780}\u{17cb} → ក់    x: 20.769738256931305
@@ -1058,11 +1174,17 @@
 \u{1780}\u{17d2}\u{178a}\u{17c5} → ក្ដៅ    x: 43.5657924413681
 \u{1780}\u{17d2}\u{179a}\u{17c4} → ក្រោ    x: 53.90000367164612
 \u{1781} → ខ    x: 20.769738256931305
+\u{1781} → ខ    x: 31.30658107995987
+\u{1781}\u{17b7} → ខិ    x: 20.769738256931305
 \u{1781}\u{17bb} → ខុ    x: 20.769738256931305
 \u{1781}\u{17c6} → ខំ    x: 20.769738256931305
 \u{1781}\u{17d2} → ខ្    x: 20.769738256931305
+\u{1781}\u{17d2} → ខ្    x: 31.30658107995987
 \u{1782} → គ    x: 20.769738256931305
+\u{1782} → គ    x: 31.30658107995987
+\u{1782}\u{17b6}\u{17c6} → គាំ    x: 31.30658107995987
 \u{1782}\u{17b7} → គិ    x: 20.769738256931305
+\u{1782}\u{17c1} → គេ    x: 33.02894961833954
 \u{1782}\u{17c4} → គោ    x: 43.5657924413681
 \u{1782}\u{17c6} → គំ    x: 20.769738256931305

I'll keep looking a little closer, but I'll also reach out and comment on jquast/wcwidth#155

@jacobsandlund
Copy link
Contributor Author

Here's the result of my investigation, as a comment on the jquast/wcwidth#155 issue: jquast/wcwidth#155 (comment)

@mitchellh mitchellh merged commit 49b2b8d into ghostty-org:main Jan 20, 2026
@mitchellh mitchellh added this to the 1.3.0 milestone Jan 20, 2026
@mitchellh
Copy link
Contributor

CI isn't running for some reason, but you've always been diligent so I'm going to trust you on this. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants