Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode modifiers break width calculations #8276

Open
lilyball opened this issue Sep 7, 2021 · 24 comments
Open

Unicode modifiers break width calculations #8276

lilyball opened this issue Sep 7, 2021 · 24 comments

Comments

@lilyball
Copy link
Contributor

lilyball commented Sep 7, 2021

fish 3.3.1
Also reproduced on latest master (3.3.1-288-g139b74d8e)
macOS 11.5.2 (20G95)

macOS Terminal.app renders most emoji as 2 characters, but emoji created using Variation Selector-16 (U+FE0F) are still rendered as 1 character. Unfortunately, fish treats this as 2 characters (in particular, it treats Variation Selector-16 as 1 character, with a comment saying this is equivalent to treating emoji as 2).

There's actually two issues here:

The first is that the handling of Variation Selector-16 assumes an emoji width of 2, even when $fish_emoji_width is set to 1 or when the guessed width is 1.

The second is that macOS Terminal.app does not treat emoji created by Variation Selector-16 as a width of 2 even though they visually render in 2 columns. It appears that Terminal.app simply treats Variation Selector-16 as having a width of zero:

> echo \uFE0Fx\uFE0Fx
xx

I do not know how other terminals handle this problem.

This issue is affecting the default output of starship when the status.pipestatus config flag is set to true, as it uses ✔️ in the output. This is causing fish to miscalculate the column to start input at, causing it to appear as though there are spurious extra spaces in between the prompt and the input position.

@faho
Copy link
Member

faho commented Sep 7, 2021

There is no way for us to detect this, sorry.

The terminal is wrong. Pick a different character.

The first is that the handling of Variation Selector-16 assumes an emoji width of 2

This is correct, emoji are wide. "emoji_width" is a bit of a misnomer - it specifies if we should use the unicode 8 or 9 widths.

@faho faho closed this as completed Sep 7, 2021
@lilyball
Copy link
Contributor Author

lilyball commented Sep 7, 2021

@faho This is a very offputting response. I detailed exactly what’s going on here, and you’re just saying “sorry, the terminal emulator is wrong”.

Even ignoring the terminal emulator’s handling of this codepoint as 1-wide instead of 2-wide, Fish is still wrong in that it treats the presence of variation selector-16 as meaning an emoji that is 2 columns wide, even when fish is otherwise configured to assume emoji are 1 column wide. So at the very least, Fish should be doing something like “variation selector-16 counts as emoji_width - 1”.

But beyond that, it should not be unreasonable to say that Fish should have a heuristic for determining whether variation selector-16 expands codepoints to columns for a given terminal. Fish has Terminal-based heuristics for “how wide do we think emoji are”, so this is not breaking new ground here.

I do think Terminal.app’s behavior here is buggy, but it’s also consistent with “the terminal emulator doesn’t know the specific details about unicode rendering, it just knows that emoji codepoints are 2 columns and others are 1” and relies on the OS to do the actual details of rendering the text. This simplistic model of emoji rendering is something I would not be at all surprised to find other terminal emulators reproducing. Really, any emulator that relies on the OS to do the actual details of rendering the glyphs instead of implementing it directly is one that I would expect to implement a simplified model like this.

@lilyball lilyball reopened this Sep 7, 2021
@lilyball
Copy link
Contributor Author

lilyball commented Sep 7, 2021

The fact is, a very popular cross-shell prompt solution (starship), in configuration that is very nearly default (just requires enabling the opt-in status module), will print ✔️ sometimes (even without status.pipestatus; the current version will print that for its regular status after a job like false | true), and fish’s handling of this is to push the input rightwards by one column, as though starship printed an extra space.

So the net effect of saying “this is the Terminal’s bug, we won’t do anything about it” is to make fish feel broken for users of starship who turn on the status printing.

@faho
Copy link
Member

faho commented Sep 7, 2021

even when fish is otherwise configured to assume emoji are 1 column wide

See my edit: "emoji_width" is a bit of a misnomer - it specifies if we should use the unicode 8 or 9 widths. In other words it only affects those emoji that were specified as narrow in unicode < 9 and wide after.

It's a compatibility hack for old terminals, not a configuration knob.

So the net effect of saying “this is the Terminal’s bug, we won’t do anything about it” is to make fish feel broken for users of starship who turn on the status printing.

and a buggy terminal. We can't, in general, work around all terminal bugs. Keeping a database of all codepoints that terminals misrender (and correctly detecting all those terminals) is infeasible. Sorry.

The only feasible solution I see is for starship to pick a different character.

@faho
Copy link
Member

faho commented Sep 7, 2021

Just to be clear:

But beyond that, it should not be unreasonable to say that Fish should have a heuristic for determining whether variation selector-16 expands codepoints to columns for a given terminal.

It is. Because this is a much larger thing. Now we would need a per-terminal database of mistreated codepoints (and a good detection of those terminals and their versions, and the version it's fixed in, which is often impossible!). That's not in the same ballpark as “how wide do we think all emoji, as a category, are”.

@lilyball
Copy link
Contributor Author

lilyball commented Sep 7, 2021

See my edit: "emoji_width" is a bit of a misnomer - it specifies if we should use the unicode 8 or 9 widths. In other words it only affects those emoji that were specified as narrow in unicode < 9 and wide after.

That's still a fair number of emoji.

Keeping a database of all codepoints that terminals misrender (and correctly detecting all those terminals) is infeasible. Sorry.

This is a wild mischaracterization of what I'm asking for. I have not at all suggested keeping a database of characters.

The fact is, right now Fish has code that explicitly says "Treat variation selector-16 as a width of 1 so that way we end up with an emoji width of 2". It does this no matter what the preceding character is. For Terminal.app, this will be wrong 100% of the time. For Terminal.app, the right answer is always "treat variation selector-16 as a width of 0". Heck, the "treat it as 1" screams "compatibility hack" because variation selector-16 is typically a zero-width character. It modifies the previous character, and for 99.9% of preceding characters, the modification has no effect.

Fish's behavior is also wrong across the board for all terminals when used on characters that typically default to emoji presentation but have a text form, such as U+26A1 (⚡). These characters typically show up as emoji, but can show up as text in some contexts (for example, the GitHub comment compose textarea). Adding Variation Selector-16 will force it to emoji presentation, but should not affect the width. Fish correctly identifies U+26A1 as having width 2, and Terminal.app even assigns it width 2 despite defaulting to text presentation. Adding Variation Selector-16 does not affect the width used by the terminal, just the glyph, and yet Fish thinks string length -V \u26A1\uFE0F is 3. This is strictly a bug.

Oh, and here's a fun fact: while writing this up, I downloaded iTerm2.app to test, and it has the exact same behavior as Terminal.app with regards to Variation Selector-16 widths (it differs in defaulting to emoji presentation for U+26A1 but otherwise has no actual differences in widths for these character I'm testing).

I don't know about other terminal emulators, but both major terminal emulators on macOS agree: Variation Selector-16 always has width 0. Fish is strictly in the wrong here.

@faho
Copy link
Member

faho commented Sep 7, 2021

This is a wild mischaracterization of what I'm asking for. I have not at all suggested keeping a database of characters.

If you are asking for us to work around specific terminals misrendering specific characters, that's tantamount to asking for a per-terminal quirks database.

If we can find a way to make this independent of a terminal, sure, it's not. If we can find a way to treat entire classes of characters differently, that's also much simpler. But we'd have to figure out which classes that are.

Heck, the "treat it as 1" screams "compatibility hack" because variation selector-16 is typically a zero-width character.

To be clear: It is. Yes. We should keep the context.

Fish's behavior is also wrong across the board for all terminals when used on characters that typically default to emoji presentation but have a text form, such as U+26A1 (⚡).

Fixing that would require, again, switching to wcswidth - that's #8275 (this is the issue with filing multiple connected bug reports at the same time - I prefer keeping them in one place and then deciding where it should be split up).

Adding Variation Selector-16 will force it to emoji presentation, but should not affect the width.

Where do you get that "should" from? My experience (and e.g. #5583) says otherwise, but I'm happy to be corrected on that. If we can assign a width of 0 on that it would fix the issue for now. (but if we actually need the context for what codepoint the VS applies to, that's #8275 again)

Adding Variation Selector-16 does not affect the width used by the terminal, just the glyph, and yet Fish thinks string length -V \u26A1\uFE0F is 3.

It has a width of 3 here in Windows Terminal. Without the VS it has a width of 2. Which means it's terminal-specific again.

I don't know about other terminal emulators, but both major terminal emulators on macOS agree: Variation Selector-16 always has width 0. Fish is strictly in the wrong here.

Or both terminals are wrong in the same way. Which would also happen if the bug is in the underlying text rendering - which is even less fixable because we don't even have a version of that (also why we can't handle font differences - we have no information about the font).

@lilyball
Copy link
Contributor Author

lilyball commented Sep 7, 2021

Fixing that would require, again, switching to wcswidth

No it won’t, as it seems that fish’s behavior is wrong in all contexts, not just this one (in fact, in the context of “following anything other than a codepoint with both text and emoji presentation” it’s obviously wrong as variation selector-16 does nothing in other contexts (and naturally has a zero width).

Where do you get that from? My experience (and e.g. #5583) says otherwise, but I'm happy to be corrected on that.

Manual testing in Terminal.app and a bit in iTerm2.

I took a look at #5583 and it’s a little difficult to figure out what it’s trying to say. The original asciinema demonstrates some input issues, but I don’t know if that’s the same emoji width calculation issue or not as I’ve been focused on non-interactive testing (e.g. fish’s idea of how wide a string printed to the terminal is) to avoid any potential confounding issues in interactive input handling. Skimming the conversation I see discussion of emoji ZWJ sequences, which are a separate issue and probably not solvable without cooperation from the terminal emulator. And there was a mention of 🛠 and 🐛 having different widths, which very well could be this bug, but I’m on my phone right now and can’t look up the details on these characters.

If we can assign a width of 0 on that it would fix the issue for now.

My current belief, based on Terminal.app and iTerm2, is that this is the correct solution.

It has a width of 3 here in Windows Terminal. Without the VS it has a width of 2. Which means it's terminal-specific again.

Windows Terminal thinks VS16 takes up a cell all by itself? Sounds like a terminal bug. It strictly modifies the previous character, it does not have an intrinsic width.

In fact, now I’m curious if it’s literally classified as a combining character. Again, I’d look it up but I’m on my phone.

@faho
Copy link
Member

faho commented Sep 7, 2021

No it won’t, as it seems that fish’s behavior is wrong in all contexts

Like I said: It's not. Possibly on macOS, but I've seen multiple terminals, in multiple contexts, handle it differently.

And we've introduced this behavior because it fixed it in some cases, so it can't be "wrong in all contexts". It did fix problems.


Okay, so this looks more and more like we'd introduce one quirk "variation selector adds nothing". That can be done and handled, unlike "terminal won't combine it with these specific codepoints".

@lilyball
Copy link
Contributor Author

lilyball commented Sep 7, 2021

Okay, so this looks more and more like we'd introduce one quirk "variation selector adds nothing". That can be done and handled, unlike "terminal won't combine it with these specific codepoints".

Honestly, it sounds like this should actually be "variation selector has non-zero width", as that sounds like the buggy behavior.

Incidentally, the VSCode integrated terminal also counts it as zero width.

@lilyball
Copy link
Contributor Author

lilyball commented Sep 7, 2021

Same with Alacritty. U+FE0F is zero width.

Kitty has different behavior. It treat U+FE0F as zero width in most contexts, but it modifies text presentation characters to render as emoji with width 2 (but when applied to Emoji_Presentation characters it has zero width as those already have width 2 to begin with).

Which is to say, Kitty's behavior cannot be handled with wcwidth(), it requires wcswidth(), and also I'm inclined to file an issue against them about how this behavior diverges from other tested emulators and is harder to predict by CLI tools.

@faho
Copy link
Member

faho commented Sep 7, 2021

Honestly, it sounds like this should actually be "variation selector has non-zero width", as that sounds like the buggy behavior.

Honestly, unless we can point to some standard, I don't think we can claim either way. Because characters in "emoji presentation" being of width 2 even if the text version has width 1 makes sense.

You seem to believe that Apple is correct by default, and I really really cannot agree with that.

So: Whatever sounds nicer as a variable name in the code. If we can avoid e.g. a double-negative by turning it around? Let's do that. if (!disabled) is bad.

@lilyball
Copy link
Contributor Author

lilyball commented Sep 7, 2021

The characters with Emoji_Presentation have width 2 even when rendered as text, in all terminals I've tested except Kitty.

In fact, for Kitty, adding VS15 to an Emoji_Presentation character changes it to width 1, which is not something Fish can handle (not without wcswidth()). No other terminal I've tested has this behavior.

You seem to believe that Apple is correct by default, and I really really cannot agree with that.

No, this is not about Apple being correct by default, and nothing I've said should lead you to that conclusion.

It's about how VS16 has an intrinsic width of zero. In typical text rendering contexts, if it follows a character with both emoji and text presentation, it forces the emoji presentation. This usually changes the width of that character, but additional VS16 characters tacked on still have width zero. And in terminal emulators, where predictable width is important and is typically calculated on a character-by-character basis, the most obvious behavior is to have VS16 have zero width in all contexts and to not modify the width of the preceding character.

Terminal.app, iTerm2, VSCode's integrated terminal, and Alacritty all seem to agree here. VS16 has no width and does not affect the width of the preceding character. Characters with Emoji_Presentation have a width of 2 even if they're rendering in text presentation. So far Kitty is the only terminal I've tested that disagrees, and it still thinks VS16 has zero with in most contexts, and it also introduces behavior for VS15 that Fish cannot possibly support without wcswidth().

You suggested behavior of Windows Terminal where U+FE0F always has width 1. This seems really broken. I would like to confirm this behavior though. My usual tests here have been with U+26A1 (which has Emoji_Presentation plus a text form) and U+26A0 (which has text presentation plus an emoji form), so I'm echoing various combinations of \u26A1, \u26A0, \uFE0F, and \uFE0E, including repeating the VS16s, putting them after ascii characters, etc, and seeing how it affects the positioning of successive characters.

So: Whatever sounds nicer as a variable name in the code. If we can avoid e.g. a double-negative by turning it around? Let's do that. if (!disabled) is bad.

I don't care what we call it in code, I just care what it looks like when exposed to the user.

@lilyball
Copy link
Contributor Author

lilyball commented Sep 7, 2021

Honestly, unless we can point to some standard, I don't think we can claim either way. Because characters in "emoji presentation" being of width 2 even if the text version has width 1 makes sense.

It's a consequence of the model of the wcwidth() model, which seems to be the default for terminals, where each character width is calculated independently. Characters with Emoji_Presentation default to emoji in most contexts, and therefore are classified as emoji, and therefore have width 2 regardless of how they're actually rendered. And characters without Emoji_Presentation are classified as text and therefore have width 1, even if it's possible to modify them into emoji presentation. This all fits with the wcwidth() model.

@faho
Copy link
Member

faho commented Sep 7, 2021

I don't care what we call it in code, I just care what it looks like when exposed to the user.

This would only barely be exposed to the user. It's like $fish_emoji_width, a variable you never want to have to touch. (and all your messing around here with $fish_emoji_width was in vain! you don't want to touch it, it was already correct before. If we had named it less appealing things would have been better!)

Ideally, this would not even exist!

Calling it "$fish_variation_selector_hack" would work. Or "$fish_vs16_widens"?

@lilyball
Copy link
Contributor Author

lilyball commented Sep 7, 2021

Calling it "$fish_variation_selector_hack" would work. Or "$fish_vs16_widens"?

Sure, either one works. The default behavior should be to treat VS16 as having zero width though, barring any heuristics for detecting terminal behavior (e.g. testing $TERM_PROGRAM).

Incidentally, I just tested LXTerminal (the default terminal on my Raspberry Pi) and it agrees with Terminal.app et al. So far Kitty is the only terminal I've tested with different behavior, which I just filed kovidgoyal/kitty#3998 for.

@mqudsi
Copy link
Contributor

mqudsi commented Oct 24, 2021

I'm with @kovidgoyal on this: attempting to assign a width to a specific character is a fool's errand. Unicode codepoints in-and-of themselves don't have a width, only strings composed of those codepoints can be assigned a width. Anything other than that (including what we do here in fish) is just a hack w/ the intention of getting as many common inputs right and you shouldn't navel gaze at it too long for fear of falling into the abyss. Bike-shedding over the name of a variable isn't going to change the fact that the approach itself is fundamentally wrong but until there's some way for the shell + the terminal + the OS or text renderer to agree on the width of a string (taking into account not only its components but also the font's support for the desired glyph) the situation isn't going to change.

@zanchey zanchey added this to the fish-future milestone Jan 8, 2022
@rashil2000

This comment was marked as off-topic.

@faho

This comment was marked as off-topic.

@rashil2000

This comment was marked as off-topic.

@lilyball

This comment was marked as off-topic.

@12Me21
Copy link

12Me21 commented May 24, 2022

I feel like most of these problems could be solved with an option to use wcwidth() for all characters.
That's the closest thing we have to a standard for character widths, and it's used by many terminals.

@faho faho changed the title Characters using Variation Selector-16 are always treated as width of 2 even when they should be 1 Unicode modifiers break width calculations Aug 24, 2023
@dimaqq
Copy link

dimaqq commented Aug 24, 2023

There's a standard that lists all possible/recognised unicode code point combinations, though at times they get kinda long, like this one:

https://www.emojiall.com/en/code/1F9DC-1F3FF-200D-2640-FE0F

Which makes me wonder what the range the scalar wchar_t would be needed to account for all these possibilities... Or if a scalar is simply the wrong choice.

The official list is here, yet is lacks the black cat (cat+zwj+black square):
https://unicode.org/Public/emoji/14.0/emoji-sequences.txt

@12Me21
Copy link

12Me21 commented Sep 14, 2023

The best list is https://unicode.org/Public/emoji/15.0/emoji-test.txt
anything marked "fully-qualified" here should be rendered as an emoji, and "minimally-qualified" entries are implementation-defined

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants