unicode: switch to uucode grapheme break to (mostly) match unicode spec by jacobsandlund · Pull Request #9680 · ghostty-org/ghostty

jacobsandlund · 2025-11-24T14:56:16Z

This PR builds on #9678 ~~so the diff from there is included here (it's not possible to stack PRs unless it's a PR against my own fork)--review that one first!~~

This PR updates the graphemeBreak calculation to use uucode's computeGraphemeBreakNoControl, which has tests in uucode that confirm it passes the GraphemeBreakTest.txt (minus some exceptions).

Note that the grapheme_break (and grapheme_break_no_control) property in uucode incorporates emoji_modifier and emoji_modifier_base, diverging from UAX #29 but matching UTS #51. See this comment in uucode for details.

The grapheme_break_no_control property and computeGraphemeBreakNoControl both assume control, cr, and lf have been filtered out, matching the current grapheme break logic in Ghostty.

This PR keeps the Precompute.data logic mostly equivalent, since the uucode precomputedGraphemeBreak lacks benchmarks in the uucode repository (it was benchmarked in the original PR adding uucode to Ghostty). Note however, that due to grapheme_break being one bit larger than grapheme_boundary_class and the new BreakState also being one bit larger, the state jumps up by a factor of 8 (u10 -> u13), to 8KB.

Benchmarks

I benchmarked the old main version versus this PR for +grapheme-break and surprisingly this PR is 2% faster (?). Looking at the assembly though, I'm thinking something else might be causing that. Once I get to the bottom of that I'll remove the below TODO and include the benchmark results here.

When seeing the speedup with data.txt and maybe a tiny speedup on English wiki, I was surprised given the 1KB -> 8KB tables. Here's what AI said when I asked it to inspect the assembly: https://ampcode.com/threads/T-979b1743-19e7-47c9-8074-9778b4b2a61e, and here's what it said when I asked it to predict the faster version: https://ampcode.com/threads/T-3291dcd3-7a21-4d24-a192-7b3f6e18cd31

It looks like two loads got reordered and that put the load that depended on stage1 -> stage2 -> stage3 second, "hiding memory latency". So that makes the new one faster when looking up the grapheme_break property. These gains go away with the Japanese and Arabic benchmarks, which spend more time processing utf8, and may even have more grapheme clusters too.

with data.txt (200 MB ghostty-gen random utf8)

with English wiki dump

with Japanese wiki dump

with Arabic wiki dump

TODO:

Take a closer look at the assembly and understand why this PR (8 KB vs 1 KB table) is faster on my machine.
(edit: checking this off because it seems unnecessary) If this turns out to actually be unacceptably slower, one possibility is to switch to uucode's precomputedGraphemeBreak which uses a 1445 byte table since it uses a dense table (indexed using multiplication instead of bitCast, though, which did show up in the initial benchmarks from deps: Replace ziglyph with uucode #8757 a small amount.)

AI was used in some of the uucode changes in #9678 (Amp--primarily for tests), but everything was carefully vetted and much of it done by hand. This PR was made without AI with the exception of consulting AI about whether the "Prepend + ASCII" scenario is common (hopefully it's right about that being uncommon).

mitchellh · 2025-11-24T16:20:21Z

Rebase again due to merge, thanks :)

jacobsandlund · 2025-11-24T16:45:21Z

Rebase again due to merge, thanks :)

Done!

mitchellh · 2025-11-25T04:36:38Z

This looks awesome. I want to run a vtebench, cat a non-ASCII file, and also see what our wcwidth test results are after this before merging, if possible. I can help with this when I have time. 😄

jacobsandlund · 2025-11-26T13:56:57Z

This looks awesome. I want to run a vtebench, cat a non-ASCII file, and also see what our wcwidth test results are after this before merging, if possible. I can help with this when I have time. 😄

Sounds good. I haven't gotten to run vtebench on it yet, but I will. I updated the description with the +grapheme-break benchmarks.

mitchellh · 2025-11-26T21:32:25Z

In this PR:

jacobsandlund · 2025-11-26T23:43:13Z

@mitchellh I see the same thing. for reference for others, this is main:

So, this PR drops the score from 106 to 98.

And then if I run ucs-detect on jacobsandlund#1, built on top of this PR, I see:

Dropping the score again to 91.

I'll do some more investigation, but ucs-detect is based on python wcwidth, which is treating too many categories of characters as zero width: https://github.com/jquast/wcwidth/blob/915166f9453098a56e87a7fb69e697696cefe206/bin/update-tables.py#L149-L155

Mc (which isn't even present in that comment) is spacing combining mark, which I think should generally be treated as taking up space, so that the character should be wide. Here's an amp thread: https://ampcode.com/threads/T-93ae4196-0abf-43db-8f26-f1d0ea869dba

Here you can see the jacobsandlund#1 result of printing a couple of those graphemes, and selecting it correctly groups the cluster, and it displays wide as it should:

and here's the result on main:

just out of curiosity, i changed the uucode wcwidth calculation to treat Mc and Cf as 0 to match ucs-detect, on top of jacobsandlund#1 and I get this score:

Interestingly, still down 1 point from main. But, I think we can't treat ucs-detect here as the source of truth. I could open a PR on ucs-detect/wcwidth for fixes that it needs, or even make a uucode one.

mitchellh · 2025-11-26T23:45:11Z

Interestingly, still down 1 point from main. But, I think we can't treat ucs-detect here as the source of truth. I could open a PR on ucs-detect/wcwidth for fixes that it needs, or even make a uucode one.

I agree with this, we may want to also loop in @jquast too who can maybe provide some opinions. 😄

pluiedev · 2025-11-27T03:40:16Z

Here you can see the jacobsandlund#1 result of printing a couple of those graphemes, and selecting it correctly groups the cluster, and it displays wide as it should:

This reminds me... this looks very similar to the overlapping characters seen in #5637 — maybe it wasn't a font shaping/layout problem but rather that we simply weren't assigning the correct cell widths for these characters?

jacobsandlund · 2025-12-03T05:00:14Z

I'm still investigating this, but I'll share my findings thus far. I have a branch on top of jacobsandlund#1 that compares the cumulative advance.width for a grapheme cluster using Coretext (I'm on Mac) and the expected Ghostty or uucode width, and logs when there's a mismatch: jacobsandlund/ghostty@grapheme-width-changes...debug-width

I run ucs-detect.sh and tee those logs to a file, then I repeat but with Mc and Cf treated as zero width. That results in 133,865 lines in the former case and 144,257 in the latter.

Then I do some sed and sort -u and get basically just the failing graphemes. There are some repeats, because I'm also including x: <advance.width>, and that apparently can change in different shaping runs sometimes (I need to understand shaping more). That gives me 4952 failing graphemes for the first case (with a little double counting), and 5799 failing graphemes for Mc and Cf as zero width.

Here's the diff as a gist, with additions being graphemes that are now wrong with the Mc and Cf treated as zero width: https://gist.github.com/jacobsandlund/03a9f052198b5cd5b28a0e5e803ebf80

And some lines, inline:

-\u{11341} → 𑍁    x: 15.263850999996066
 \u{11341}\u{0}\u{11302} → 𑍁    x: 31.586641117930412
 \u{11341}\u{0}\u{11303} → 𑍁    x: 25.883497582748532
-\u{11343} → 𑍃    x: 11.315520860254765
-\u{1134d} → 𑍍    x: 16.005108382552862
-\u{1134d} → 𑍍    x: 6.4292732160538435
+\u{11342} → 𑍂    x: 20.558546589687467
 \u{1703} → ᜃ    x: 28.2716427154541
 \u{1704} → ᜄ    x: 24.0481351146698
 \u{1704}\u{1714} → ᜄ᜔    x: 24.0481351146698
@@ -1044,9 +1157,12 @@
 \u{1711} → ᜑ    x: 30.13918008995056
 \u{1711}\u{1712} → ᜑᜒ    x: 30.139180089950557
 \u{1780} → ក    x: 20.769738256931305
+\u{1780} → ក    x: 31.30658107995987
 \u{1780}\u{17b7} → កិ    x: 20.769738256931305
 \u{1780}\u{17bb} → កុ    x: 20.769738256931305
 \u{1780}\u{17bb}\u{17c6} → កុំ    x: 20.769738256931305
+\u{1780}\u{17be} → កើ    x: 33.02894961833954
+\u{1780}\u{17c1} → កេ    x: 33.02894961833954
 \u{1780}\u{17c4} → កោ    x: 43.5657924413681
 \u{1780}\u{17c6} → កំ    x: 20.769738256931305
 \u{1780}\u{17cb} → ក់    x: 20.769738256931305
@@ -1058,11 +1174,17 @@
 \u{1780}\u{17d2}\u{178a}\u{17c5} → ក្ដៅ    x: 43.5657924413681
 \u{1780}\u{17d2}\u{179a}\u{17c4} → ក្រោ    x: 53.90000367164612
 \u{1781} → ខ    x: 20.769738256931305
+\u{1781} → ខ    x: 31.30658107995987
+\u{1781}\u{17b7} → ខិ    x: 20.769738256931305
 \u{1781}\u{17bb} → ខុ    x: 20.769738256931305
 \u{1781}\u{17c6} → ខំ    x: 20.769738256931305
 \u{1781}\u{17d2} → ខ្    x: 20.769738256931305
+\u{1781}\u{17d2} → ខ្    x: 31.30658107995987
 \u{1782} → គ    x: 20.769738256931305
+\u{1782} → គ    x: 31.30658107995987
+\u{1782}\u{17b6}\u{17c6} → គាំ    x: 31.30658107995987
 \u{1782}\u{17b7} → គិ    x: 20.769738256931305
+\u{1782}\u{17c1} → គេ    x: 33.02894961833954
 \u{1782}\u{17c4} → គោ    x: 43.5657924413681
 \u{1782}\u{17c6} → គំ    x: 20.769738256931305

I'll keep looking a little closer, but I'll also reach out and comment on jquast/wcwidth#155

jacobsandlund · 2025-12-09T06:24:35Z

Here's the result of my investigation, as a comment on the jquast/wcwidth#155 issue: jquast/wcwidth#155 (comment)

mitchellh · 2026-01-20T17:44:55Z

CI isn't running for some reason, but you've always been diligent so I'm going to trust you on this. Thank you.

jacobsandlund added 5 commits November 23, 2025 22:33

unicode: switch to uucode grapheme break to match unicode 16 spec

97aff07

Merge branch 'uucode-update' into grapheme-break

c3d8951

Merge branch 'uucode-update' into grapheme-break

b9ab2b8

Merge branch 'uucode-update' into grapheme-break

6aaa37c

Merge branch 'uucode-update' into grapheme-break

eb9f384

jacobsandlund requested review from a team as code owners November 24, 2025 14:56

jacobsandlund mentioned this pull request Nov 24, 2025

unicode: Handle wide grapheme clusters that start with narrow code point jacobsandlund/ghostty#1

Closed

jacobsandlund changed the title ~~unicode: switch to uucode grapheme break to (mostly) match unicode 16 spec~~ unicode: switch to uucode grapheme break to (mostly) match unicode spec Nov 24, 2025

Merge remote-tracking branch 'upstream/main' into grapheme-break

2f57524

Merge remote-tracking branch 'upstream/main' into grapheme-break

e52f0d2

Add back accidentally removed line

bc42c82

Merge remote-tracking branch 'upstream/main' into grapheme-break

42bdd7f

Merge branch 'main' into grapheme-break

f68abde

Merge remote-tracking branch 'upstream/main' into grapheme-break

c49d50e

jacobsandlund mentioned this pull request Dec 9, 2025

Should Combining characters of Category 'Mc' be width 1? jquast/wcwidth#155

Closed

Merge branch 'main' into grapheme-break

7bddbfe

mitchellh merged commit 49b2b8d into ghostty-org:main Jan 20, 2026

mitchellh added this to the 1.3.0 milestone Jan 20, 2026

jacobsandlund deleted the grapheme-break branch January 26, 2026 15:17

jacobsandlund mentioned this pull request Jan 27, 2026

terminal: change cell width when wider grapheme detected #10465

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode: switch to uucode grapheme break to (mostly) match unicode spec#9680

unicode: switch to uucode grapheme break to (mostly) match unicode spec#9680
mitchellh merged 12 commits intoghostty-org:mainfrom
jacobsandlund:grapheme-break

jacobsandlund commented Nov 24, 2025 •

edited

Loading

Uh oh!

mitchellh commented Nov 24, 2025

Uh oh!

jacobsandlund commented Nov 24, 2025

Uh oh!

mitchellh commented Nov 25, 2025

Uh oh!

jacobsandlund commented Nov 26, 2025

Uh oh!

mitchellh commented Nov 26, 2025

Uh oh!

jacobsandlund commented Nov 26, 2025

Uh oh!

mitchellh commented Nov 26, 2025

Uh oh!

pluiedev commented Nov 27, 2025

Uh oh!

jacobsandlund commented Dec 3, 2025 •

edited

Loading

Uh oh!

jacobsandlund commented Dec 9, 2025

Uh oh!

mitchellh commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jacobsandlund commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

with data.txt (200 MB ghostty-gen random utf8)

with English wiki dump

with Japanese wiki dump

with Arabic wiki dump

Uh oh!

mitchellh commented Nov 24, 2025

Uh oh!

jacobsandlund commented Nov 24, 2025

Uh oh!

mitchellh commented Nov 25, 2025

Uh oh!

jacobsandlund commented Nov 26, 2025

Uh oh!

mitchellh commented Nov 26, 2025

Uh oh!

jacobsandlund commented Nov 26, 2025

Uh oh!

mitchellh commented Nov 26, 2025

Uh oh!

pluiedev commented Nov 27, 2025

Uh oh!

jacobsandlund commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jacobsandlund commented Dec 9, 2025

Uh oh!

mitchellh commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacobsandlund commented Nov 24, 2025 •

edited

Loading

jacobsandlund commented Dec 3, 2025 •

edited

Loading