Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support broken (was This ⸻ unicode ⸻ character ⸻ makes ⸻ weird ⸻ things ⸻ happen) #2652

Closed
iirelu opened this issue Jan 4, 2016 · 85 comments
Assignees
Labels
bug Something that's not working as intended
Milestone

Comments

@iirelu
Copy link

iirelu commented Jan 4, 2016

U+2E3B ⸻ THREE-EM DASH

image

Steps to reproduce:

  1. Copy that unicode character
  2. Type "test" (or anything for that matter)
  3. Paste repeatedly

What happens: Some sort of exponential growth? Pressing ^C doesn't get rid of it.

Expected results: Not that.

I've tested this in every terminal emulator I have, and it shows up in all of them (even xterm). No other shell has this problem.

@faho
Copy link
Member

faho commented Jan 4, 2016

This seems similar to #2199, no?

@zanchey
Copy link
Member

zanchey commented Jan 4, 2016

Yes, and I think also #750.

@terlar
Copy link
Contributor

terlar commented Jan 4, 2016

I don't experience the same thing, but instead the color of the dash gets all weird, guess it is due to the overlapping. I don't see any repetition of other characters.

I am using termite on arch linux.

@ridiculousfish ridiculousfish added this to the fish-future milestone Jan 6, 2016
@ridiculousfish ridiculousfish self-assigned this Jan 6, 2016
@ridiculousfish
Copy link
Member

This failure mode is when fish's idea of where the cursor is gets out of sync with where it actually is. This can come about if fish thinks a character is single width but it's actually double width, or vice versa.

My font doesn't seem to have this character :(

@xiaq
Copy link
Contributor

xiaq commented Jan 25, 2016

There are two things you can do to contain the effect of the disparity between wcwidth and the actual terminal: 1) turn off autowrap whenever the editor is in control (echo '\033[?7l' to turn off, '\033[?7h' to turn on) and insert returns explicitly; 2) always re-render whole lines, from left to right, when the buffer changes. This way the line number is always correct and you minimize your dependency on the column number.

I hope the advice is useful. :)

@pickfire
Copy link
Contributor

@xiaq Hi, I didn't know you are back.

@terlar, what font are you using? @ridiculousfish and I too actually doesn't have that character.

@ghost
Copy link

ghost commented Jan 26, 2016

@pickfire You can find it in Symbola and Duolos SIL.

@pickfire
Copy link
Contributor

@jakwings I didn't know that those two fonts isn't in my package manager. Trying to install it manually.

@terlar
Copy link
Contributor

terlar commented Jan 26, 2016

Yes, I think this glyph comes from the symbola package. Not sure which one it fallbacks to. But I know I have symbola to cover these kind of things.

@krader1961 krader1961 added the bug Something that's not working as intended label Mar 25, 2016
@krader1961
Copy link
Contributor

I can't reproduce the problem using iTerm2 on OS X without a font installed that supports that character. So I see the single width undefined char glyph. Is it possible a recent change (such as the one I made to support the C locale better) has "fixed" this? Can anyone reproduce this problem using a fish built from git head? If so can you provide better directions for reproducing the problem than was in the original comment?

@krader1961
Copy link
Contributor

After installing Symbola (from http://www.fonts2u.com/symbola.font) I can reproduce this. It's pretty clear that fish is using the wrong char width. From https://en.wikipedia.org/wiki/Dash:

Less common are the two-em dash (⸺) and three-em dash (⸻), both added to Unicode with version 6.1 as U+2E3A and U+2E3B.

So our fish_wcwidth() implementation is out of date.

P.S., It isn't just that character which renders weirdly. The Symbola font is ugly --- at least as handled by OS X Terminal.app. The inter-char spacing is just awful; even for plain ASCII chars.

@floam
Copy link
Member

floam commented Jun 19, 2016

It shouldn't be necessary to actually select Symbola as your font. Just install it and it will be used for odds and ends (like ⸻) with a regular terminal font selected.

For me this font+character does the same thing as the double-width emojis in the just-closed #750 -- just a bigger difference in width with this example.

@krader1961
Copy link
Contributor

FWIW, fish_wcwidth() returns 1 and the OS X El Capitan (10.11.6) wcwidth() returns -1.

For me this font+character does the same thing as the double-width emojis in the just-closed #750

That's weird because on my system that emoji renders as a single-width char in both Terminal.app (where I explicitly made Symbola the primary font) and iTerm2 using Consolas. However, in both of those multiple instances of that char overlap each other. So it's pretty clear it should be treated as double-width but fish and my terminals are treating it as single-width.

@floam
Copy link
Member

floam commented Jun 19, 2016

Right, and for me the same thing happens with this character - both fish and my terminal seem to render it as width-is-1 (leaving the glyph overhanging a few characters to the right). I'm not sure it's something fish could fix (at least here) - beyond the fact that it should probably just not allow it, like zsh or like it gets filtered out when using mosh to ssh into a remote fish instance.

@krader1961
Copy link
Contributor

I'm not sure it's something fish could fix...

Clearly it's impossible for fish to know whether the fonts and terminal will treat this as a character with a width of one or two. The question is whether the Unicode standard mandates either width or is silent. If the former we should ensure we conform to the standard. If the latter then my recommendation is to treat it as single width as we currently do. I'm leery of filtering such characters because that means we have to distinguish their presence in interactive commands versus scripts.

@krader1961
Copy link
Contributor

BTW, the latest relevant standard that defines how chars should be treated can be found here:

ftp://www.unicode.org/Public/UCD/latest/ucd/EastAsianWidth.txt

It says that the two chars in question have a width property of "Na" which means "narrow" (i.e., single cell when displayed). See the section "5.11.1 Enumerated and Binary Properties" in

http://www.unicode.org/reports/tr44/

for what the symbols "N", "Na", "A", etc. mean.

So simply updating our fish_wcwidth() implementation to the latest Unicode standard would not fix this issue. Nonetheless, I recommend modifying our implementation to dynamically construct its width table from that file at build time. We could then layer any customizations on top of it we felt were necessary.

I think there needs to be a discussion whether fish should continue to have its own implementation of wcwidth(). Unicode is now a mature standard that still has warts. Is it really the best use of our time to track updates to that standard rather than relying on the maintainers of the platforms we run on to do so?

@zanchey
Copy link
Member

zanchey commented Jul 25, 2016

From e7273e1:

// Big hack to use our versions of wcswidth where we know them to be broken, which is
// EVERYWHERE (https://github.com/fish-shell/fish-shell/issues/2199)
#ifndef HAVE_BROKEN_WCWIDTH
#define HAVE_BROKEN_WCWIDTH 1
#endif

The problem was (is?) that the maintainers of the platforms we run on weren't tracking updates to the standard either!

@floam
Copy link
Member

floam commented Jul 25, 2016

On Linux, it looks like glibc is pretty good? Or at least not bad (latest)?

@krader1961
Copy link
Contributor

The problem was (is?) that the maintainers of the platforms we run on weren't tracking updates to the standard either!

True, but the Gnu library and distro maintainers have more incentive to do so than we do. Simply because changes they make to the core locale subsystem affect a lot of programs. Any changes we make only affect fish and make fish behave differently than every other program on the system. If we are going to continue maintaining our own implementation then it should be semi-automated in the manner I outlined above.

@adriaanzon
Copy link
Contributor

This also happened to me when I accidentally pasted a tab character.

@pickfire
Copy link
Contributor

@krader1961 @floam I don't think this is glibc issue, I tested this on void linux (musl), same thing happens. The 3-em dash seems to merge with other character.

I have also tested this with dash and bash, seems all of them have this problem too.

@faho faho changed the title This ⸻ unicode ⸻ character ⸻ makes ⸻ weird ⸻ things ⸻ happen Unicode support broken (was This ⸻ unicode ⸻ character ⸻ makes ⸻ weird ⸻ things ⸻ happen) Mar 10, 2017
@faho faho mentioned this issue Mar 10, 2017
@sebastiencs
Copy link

This happened to me after an upgrade from Fedora 24 to 25.
I had this character in my prompt and didn't have any problem before the upgrade

@iameli
Copy link

iameli commented Jul 13, 2017

I don't know anything about fish or Unicode and I've maybe written ten lines of C++ in my life but I just compiled with HAVE_BROKEN_WCWIDTH 0 and my Ubuntu 17.04/Terminator/symbola has been handling emoji great.

🤷

@ridiculousfish
Copy link
Member

I'll see if I can reproduce.

@spacekookie
Copy link

Something that confuses me is that I do include a unicode character in my prompt, just not the proper emoji one. But I don't know what the difference is. is listed as U+2764, but on the same page it also lists ":heart:" as the same character, the one that is causing the weird issues in the gif :thinking:

@ridiculousfish
Copy link
Member

❤️ is U+2764 followed by the "emoji variation selector" U+FE0F. I think the variation selector is the one whose width is being misreported.

@ridiculousfish
Copy link
Member

ridiculousfish commented Jun 24, 2018

I haven't been able to figure out how to get color emoji working in terminal under Arch. How did you do this? Edit: never mind, I've got it working and can reproduce

@ridiculousfish
Copy link
Member

relevant kovidgoyal/kitty#73

@ridiculousfish
Copy link
Member

Ok, finally got to the bottom of this. The problem is that when U+2764 is followed by the emoji variation selector, its width increases from 1 to 2. This is not something that wcwidth can express, and not something that fish anticipated. This is basically unfixable and I expect other command line tools like vim will have issues with this character sequence.

The best thing I can think to do is be less aggressive in screen.cpp about assuming the old contents; that is make heavier use of \r to jump to the beginning of the line, so that fish becomes more resilient to this class of problem.

@spacekookie
Copy link

Isn't that why wcwidth of U+FE0F should be 0 or -1 then? Is this a bug in the glibc implementation of wcwidth or a problem with the standard itself?

Any workaround on fish's behalf to make these issues less problematic would however be very much appreciated. And thank you for getting to the bottom of this ❤️

@albinekb
Copy link

I'd like to add that this wasn't a problem in earlier versions of fish, maybe something else changed? 🤔

@jsatk
Copy link

jsatk commented Jul 3, 2018

I've been digging into this problem and reading the various issues and PRs about the issue with emoji widths in Fish shell. But I'm still at a loss for how to tell fish shell to use my system width. Any help?

@faho
Copy link
Member

faho commented Jul 4, 2018

But I'm still at a loss for how to tell fish shell to use my system width.

@jsatk: This isn't completely done or anything, there's still a bunch of work needed.

What you currently want is to either wait for 3.0, which will include some work, or to install fish from git master.

That leaves you with two choices:

  • Either you build it with "-DINCLUDED_WCWIDTH=OFF" (if you use cmake, "--without-included-wcwidth" if using autotools), which will use your system wcwidth

  • You leave it on (the default), which will already give you a better wcwidth than 2.7 ships with, and allow you to use $fish_emoji_width to select if some characters that were "narrow" in unicode 8 but "wide" in unicode 9 should be thought of as narrow or wide.

Personally, I favor using the system wcwidth unless you are running fish on a server and connecting via ssh, in which case the displaying system will have a newer wcwidth than the one running fish.

However, there are issues with ambiguous width characters, and with the aforementioned FE0F (which is quite a hard problem!) which we haven't fully solved.

@jsatk
Copy link

jsatk commented Jul 6, 2018

Thanks @faho. I think I'll just live with the very minor annoyance of emoji widths being weird and with for Fish 3.0. I am on macOS and would prefer to just keep my fish up-to-date with homebrew. Thanks for shepherding this work.

@ridiculousfish
Copy link
Member

@spacekookie emoji widths seem to be working pretty well on fish master (which includes a new wcwidth per #5081) with kitty terminal. You'll want to set emoji width to 2:

`set -g fish_emoji_width 2`

Please give it a try and report back! Really appreciate your patience with this.

@albinekb
Copy link

albinekb commented Aug 14, 2018

I got emojis to work with fish, version 2.7.1-1337-g09541e95 by unchecking Use Unicode version 9 widths in iTerm2.app settings:
image

@ridiculousfish
Copy link
Member

I'm confident this has been addressed via the fish_emoji_width variable. Closing.

@albinekb
Copy link

@ridiculousfish so always when setting up fish we need to remember to set the variable to get emojis to work? Doesn't seem very "user friendly"..

@faho
Copy link
Member

faho commented Aug 20, 2018

so always when setting up fish we need to remember to set the variable to get emojis to work?

Maybe. That's the point of the variable existing - some terminals/systems need it set to 2 and some to 1. In particular, you'll have to set it to 2 if your terminal is using the Unicode 9 widths, and to 1 otherwise - which means even with e.g. iTerm2, it depends on that option you talked about! Which means we can't do terminal detection via $TERM and friends like we do for e.g. cursor shape or truecolor support.

Doesn't seem very "user friendly"..

Unfortunately, there's nothing fantastic we can do, simply because there is no way for us to communicate with the terminal to figure out what width it uses. The variable is very much an admission of that fact.

It might be better to swap the default around - currently if you don't set that variable, we're using the pre-9 widths. Maybe it might be better to use the post-9 ones, or maybe we should wait another year or so until operating systems have caught up?

@ridiculousfish
Copy link
Member

Yeah these variables are super-lame, but there's no other option. The best we can hope for is excellent defaults (can still improve here) and fallback configurability.

I would guess the variable has the correct default sense today but could easily be wrong. Importantly though it only affects emoji that are in Unicode 8 or earlier (surprisingly few!). Emoji introduced in Unicode 9+ are always assumed to render per Unicode 9 widths.

@terlar
Copy link
Contributor

terlar commented Aug 21, 2018

The best idea I can come up with would be to have an interactive function where the user is presented with a unicode character displayed with the two different settings and can select which one looks best, that would then automatically set and persisted as the fish_emoji_width. Would that be possible?

@zanchey
Copy link
Member

zanchey commented Aug 21, 2018

That sounds like a good enhancement - would you or anyone else be interested in producing something?

@faho
Copy link
Member

faho commented Aug 21, 2018

I would guess the variable has the correct default sense today but could easily be wrong.

So, glibc got updated in 2.26, which was released on 2017-08-02, so about a year ago. For Ubuntu, that means everything after 17.10 does Unicode 9.

Notably, that leaves Ubuntu 16.04 (which is still supported and the current basis for WSL) and current Debian Stable out. I suspect other distributions to work the same, and I don't know how this works on macOS.

So that's basically a worst-case answer. Large parts of our user base are using older distributions, especially on servers, so I can't be sure that switching defaults will make it work for more users.

Alternatively we could do version-detection of the C library, but that might be too confusing given that there's often two knobs to deal with (the terminal's and ours) and keep in sync.


The best idea I can come up with would be to have an interactive function where the user is presented with a unicode character displayed with the two different settings and can select which one looks best

When is this shown? Does the user explicitly execute it? Is it a first-launch "wizard"?

Still, this might be the best idea we've seen so far. Just a simple

❤
aa

Is the heart as wide as 1 or 2 "a"?

(with the heart symbol being the last \U escape I used - no idea if that's affected by this)

@z3ntu
Copy link
Contributor

z3ntu commented Aug 21, 2018

When is this shown? Does the user explicitly execute it? Is it a first-launch "wizard"?

Maybe show it the first couple of times under the Welcome to fish, the friendly interactive shell message and after it was executed / displayed 10 times(?), hide it?

@ridiculousfish
Copy link
Member

To be sure, emoji width is not up to the C library, but instead the terminal. (I wouldn't be surprised if it were font dependent too.)

I really like the idea of a "compatibility quiz" function! We could use it to detect support for 256 colors, 24-bit color, settable titles, and emoji widths.

@YaLTeR
Copy link

YaLTeR commented Aug 26, 2018

Here's another idea for a quiz function:

👉👉👉👉                            1   2
Please select the following answer: ^

With width = 1 the ^ points at "2" and with width = 2 it points at "1".
image
image

@floam
Copy link
Member

floam commented Dec 18, 2018

Instead of quizzing the user, I think we could quiz the terminal by using an escape to get the cursor position before and after printing a spooky character.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something that's not working as intended
Projects
None yet
Development

No branches or pull requests