Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weโ€™ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fish mis-identifies rightmost point of prompt with utf8 characters #2199

Closed
nyarly opened this issue Jul 13, 2015 · 43 comments
Closed

Fish mis-identifies rightmost point of prompt with utf8 characters #2199

nyarly opened this issue Jul 13, 2015 · 43 comments
Labels
bug Something that's not working as intended
Milestone

Comments

@nyarly
Copy link
Contributor

nyarly commented Jul 13, 2015

If I start entering a command:

  0๐Ÿ˜‰ judson@dijkstra /home/judson โฎ€ ls <tab>

and start moving through the pager:

  0๐Ÿ˜‰ judson@dijkstra /home/judson โฎ€ l1.png 
1.png                                      francis_portrait.gif                  qrcode-jlinfo.png 

where what I'd expect would be:

  0๐Ÿ˜‰ judson@dijkstra /home/judson โฎ€ ls 1.png 
1.png                                      francis_portrait.gif                  qrcode-jlinfo.png 

Hitting ^c removes the last few characters of the prompt.

I've experimented with my fish_prompt - if I replace the ๐Ÿ˜‰ and โฎ€ with simple ASCII characters, all behaves as expected.

I'd tried instead removing all terminal coloring from the prompt - helps not at all, and in the absence of UTF8 characters, works fine (I'm using set_color everywhere.)

Using one of the git_status prompts is worse, since they use checkmarks etc.

Relevant environment:

I'm on Gentoo Linux, using urxvt.

LANG=en_US.utf-8
LC_CTYPE=en_US.UTF-8
TERM=rxvt-unicode-256color

I've done a quick tryout of a couple of other terminals and get similarly broken results. Weirdest is st, which doesn't display the characters but gets really confused about the correct insertion point.

@ridiculousfish
Copy link
Member

Is the emoji double width? We may need to teach wcwidth about emoji.

@ridiculousfish ridiculousfish added this to the fish-future milestone Jul 13, 2015
@zanchey
Copy link
Member

zanchey commented Jul 14, 2015

I don't have those characters in any of my fonts! Certainly some UTF-8 characters are okay - โœ” and โœš work fine in my git prompt.

@ridiculousfish
Copy link
Member

The happy face emoji is definitely double width. I'm not sure what the other one is. So it looks like wcwidth() needs to be updated.

@pickfire
Copy link
Contributor

What font are you using? It have rare characters that I seldom found in terminal.

@nyarly
Copy link
Contributor Author

nyarly commented Jul 14, 2015

I'm using a monospace font and both the happy face and the triangle are single width in the font. (I edited a local copy of monofur for my own use for the happy face - the triangle is coming from Deja Vu Mono) I'll admit that I'm somewhat beyond my depth re the width of characters etc. That said, the prompt is all me, so I should be able to dig out e.g. codepoints if that'll help.

In the meantime, is there some way I can determine what characters would be safe to use? I suspect the branch symbol I'm using in my git status is "wide" - it makes the offset worse.

@ridiculousfish
Copy link
Member

There's this comment:

/* The following two functions define the column width of an ISO 10646
 * character as follows:
 *
 *    - The null character (U+0000) has a column width of 0.
 *
 *    - Other C0/C1 control characters and DEL will lead to a return
 *      value of -1.
 *
 *    - Non-spacing and enclosing combining characters (general
 *      category code Mn or Me in the Unicode database) have a
 *      column width of 0.
 *
 *    - SOFT HYPHEN (U+00AD) has a column width of 1.
 *
 *    - Other format characters (general category code Cf in the Unicode
 *      database) and ZERO WIDTH SPACE (U+200B) have a column width of 0.
 *
 *    - Hangul Jamo medial vowels and final consonants (U+1160-U+11FF)
 *      have a column width of 0.
 *
 *    - Spacing characters in the East Asian Wide (W) or East Asian
 *      Full-width (F) category as defined in Unicode Technical
 *      Report #11 have a column width of 2.
 *
 *    - All remaining characters (including all printable
 *      ISO 8859-1 and WGL4 characters, Unicode control characters,
 *      etc.) have a column width of 1.
 *
 * This implementation assumes that wchar_t characters are encoded
 * in ISO 10646.
 */

That probably answers your question in far more detail than you want :)

@pickfire
Copy link
Contributor

In the meantime, is there some way I can determine what characters would be safe to use? I suspect the branch symbol I'm using in my git status is "wide" - it makes the offset worse.

Do you mean ๎‚ ? In My Humble Opinion, it is safe as long as you have the patched font but I think you should look at @ridiculousfish's comment.

I personally uses CJK double-width font and I did not find any issue with them when using fish on st with the recent patch.

@nyarly
Copy link
Contributor Author

nyarly commented Jul 14, 2015

@pickfire - do you mean a recent patch to st? I tried st-0.5 direct from suckless out for a second but specifically couldn't get these characters to render and dropped it. For what it's worth, the positioning is worse in st than urxvt. At least urxvt is consistent; in st, cycles of deleting and re-completing put the cursor in different places. ๐Ÿ˜ฑ

Comparing directory-specific behavior of my prompt provides two more data points:

  0๐Ÿ˜‰ judson@dijkstra /home/judson โฎ€ l|s
  0๐Ÿ˜‰ judson@dijkstra โ€ฆ4/MindSwarms/web โญ  video_processing_int โฎ€| ls

It's not conclusive, but the number of > 7bit characters in the prompt == the offset from the correct position of the cursor (synthesized here by |)

Per charmap and http://www.unicode.org/versions/Unicode7.0.0/ch02.pdf
๐Ÿ˜‰ is in Symbol, so I think general category "S"
โ€ฆ is in "Punctuation", so general category "P"
โญ  is "not assigned"
โฎ€ is "not assigned"

So: none are in C0/C1, Mn, Me, Cf, medial vowels or CJK. The latter two, charmap reports as "non printable" but according to the above comment I think they should all be width 1.

Here's a thing: I'm on Gentoo Linux, which for better or worse means that fish and the libs it depends on can have different e.g. configure settings. (For instance, I just tried recompiling ncurses with the tinfo use variable set, which also didn't help...) Are there versions and configurations of fish's deps that might impact how it calculates where to move the cursor and draw?

@nyarly
Copy link
Contributor Author

nyarly commented Jul 15, 2015

Okay, looking deeper: my build sets HAS_WCWIDTH, so it seems like I'm using glibc's wcwidth() function. Which leads to the next question: I have glibc-2.19-r1 installed. I'd think that would be current enough, but who can say for sure. Maybe I'll have a few minutes later tonight to build a wcwidth tester and see how it goes.

@nyarly
Copy link
Contributor Author

nyarly commented Jul 15, 2015

I'm plainly out of my depth. After 30 minutes of hacking I've got:

#define _XOPEN_SOURCE
  #include <stdio.h>
  #include <wchar.h>
  #include <string.h>
  #include <errno.h>

  int main(int argc, char **argv){
    int i, j, n;
    mbstate_t *ps;
    wchar_t wchar;

    for(j = 0; j < argc; j++){
      memset(&ps, 0, sizeof(ps));
      printf("%s len %d\n", argv[j], strlen(argv[j]));                                                                                                  
      n = 1;
      for(i = 0; i < 200 && n > 0;){
        n = mbrtowc(&wchar, argv[j] + i, 8, ps);
        i = i + n;
        printf("%d - %d/%d\n", wchar, n, wcwidth(wchar));
        if(n == -1 && errno == EILSEQ){
          printf("Invalid multibyte sequence\n");
        }
        if(n < 0){
          printf("Errno: %d\n", errno);
        }
      }
    }
  }

Which gets me:

> ./a.out โ€ฆ2/ xyz
./a.out len 7
46 - 1/1
47 - 1/1
97 - 1/1
46 - 1/1
111 - 1/1
117 - 1/1
116 - 1/1
0 - 0/0
โ€ฆ2/ len 5
0 - -1/0
Invalid multibyte sequence
Errno: 84
xyz len 3
120 - 1/1
121 - 1/1
122 - 1/1
0 - 0/0

I've tried switching my LC_CTYPE around to no avail. The whole point here was to try to see what widths the various characters are reported with, but I seem to have some fundamental misunderstanding of processing wide characters in C.

@nyarly
Copy link
Contributor Author

nyarly commented Jul 15, 2015

One thing that came of this experiment: fish gets even more confused when commands have multibyte characters in them.

@ridiculousfish
Copy link
Member

I tried looping through every value and could not find any that returned a wcwidth above 1, in the en_US.UTF-8 locale, even though there are plenty of double-width forms. I thought I understood this stuff but maybe not? Anyways it's tempting to either "augment" or drop entirely the system wcwidth on Linux in favor of our own. We already do on OS X.

@nyarly
Copy link
Contributor Author

nyarly commented Jul 16, 2015

The upshot there would be forcing HAVE_WCWIDTH 0 (or undefined) or BROKEN_WCWIDTH, right? That might be doable.

@zanchey
Copy link
Member

zanchey commented Jul 23, 2015

How does this look since #2217 was merged?

@ghost
Copy link

ghost commented Oct 10, 2015

Please also check this: https://github.com/JuliaLang/utf8proc

@faho
Copy link
Member

faho commented May 26, 2016

@nyarly: Any improvement here with 2.3.0?

@joehillen
Copy link

I wasn't having this issue until 2.3.0. Now I am.

@floam
Copy link
Member

floam commented Jun 14, 2016

Does #3143 fix it?

@joehillen
Copy link

@floam Yes! ๐Ÿ’ฏ

Strange that those changes are related. I just happened to have โžค as my prompt char.

@floam
Copy link
Member

floam commented Jun 14, 2016

Cool!

I am closing this as a duplicate of #3124 - we'll follow it there.

@krader1961 krader1961 added bug Something that's not working as intended and removed needs more info labels Jun 15, 2016
krader1961 added a commit that referenced this issue Jun 15, 2016
Minor cleanup related to issue #2199.
@krader1961
Copy link
Contributor

Remaining places that call wcwidth() rather than fish_wcwidth() have been fixed by commit d70be18.

@floam
Copy link
Member

floam commented Jun 15, 2016

Handling of e.g. some color emojis does seem wrong. Do we simply need to update our rules for newer unicode specs?

@zanchey zanchey modified the milestones: next-2.x, 2.3.0 Jun 15, 2016
@krader1961
Copy link
Contributor

Do we simply need to update our rules for newer unicode specs?

Maybe. I checked the source of our wcwidth implementation and it hasn't been modified since we copied it. Googling "markus kuhn wcwidth" didn't turn up any documents saying it was out of date. I don't think we need to spend more effort on this until someone reports a problem that isn't due to my screwup in the 2.3.0 release setting the locale when a local locale var goes out of scope.

@floam
Copy link
Member

floam commented Jun 15, 2016

Googling "markus kuhn wcwidth" didn't turn up any documents saying it was out of date.

Google "Unicode 7" or crib off https://github.com/jquast/wcwidth/ - we're several revisions out of date I think?

I think it'd serve us well to follow what they figured out in the above github project.

@floam
Copy link
Member

floam commented Jun 15, 2016

CharWidths.txt

@krader1961
Copy link
Contributor

I looked at the changes in https://github.com/jquast/wcwidth/ and they appear to be refinements to how combining characters are handled. An edge case for us that probably only affects Cygwin users using characters outside the BMP. I'm also quite leery of using random sources of "the truth" such as the CharWidths.txt link you added.

If you're passionate about this please open a new issue and take ownership of it. Personally I think we should stop using our own implementation of wcwidth. If someone is using a distro that has a broken implementation they should be demanding the distro maintainer to fix their implementation. Unicode is no longer bleeding edge technology. Having our own implementation might have made sense in 2007. It doesn't make any sense now.

@ghost
Copy link

ghost commented Jun 15, 2016

This issue is not fixed at all. I cannot understand why this issue is eagerly closed for another issue just because nobody keeps complaining and for someone's indifference. It is just lucky that those emojis do not cause layout problems on some terminals that some people are using. An update to the hard-coded width table may cause problems again, for the terminal may use another table.

If someone is using a distro that has a broken implementation they should be demanding the distro maintainer to fix their implementation.

I don't see any hope for such an infamous bug about wcwidth.

Unicode is no longer bleeding edge technology.

But it is a monster, hence the ICU project. Do you know I mentioned another similar project utf8proc before?

The solution mentioned by @floam in #2484 is quite flexible, although fish needs to update the width table once the users change the font settings of their terminals. I think that gives a good direction.

(EDIT: fix typos)

@krader1961
Copy link
Contributor

@jakwings, I don't know what you expect us to do. We should be able to rely on the unicode implementation provided by a given distro at this point in time. As you said in your last update to issue #2484, which is still open,

it is impossible to solve the spacing problem perfectly.

I'm happy to review any changes you, or anyone else, submits to improve how fish handles the problematic characters. Since issue #2484 is still open I don't understand what you're complaining about.

@krader1961
Copy link
Contributor

Also, this issue was not "eagerly closed". It's been open for 11 months. And the original problem was fixed seven days after the issue was opened. That the fix may not solve all unicode character width problems isn't relevant to this issue.

@ghost
Copy link

ghost commented Jun 15, 2016

Sorry, in #2484, I was just replying to floam for that multibyte issue. Now I've appended the name floam to it.

We should be able to rely on the unicode implementation provided by a given distro at this point in time.

I don't think so, even @ridiculousfish (he added the code for mk_wcwidth).

And the original problem was fixed seven days after the issue was opened. That the fix may not solve all unicode character width problems isn't relevant to this issue.

Really? I am suspicious.

@ghost
Copy link

ghost commented Jun 15, 2016

Ok I remember the fix submitted by the reporter. I've no complaint for closing this issue now. But wcwidth is definitely a nasty thing.

@krader1961
Copy link
Contributor

The issue of character cell width reported by wcwidth() is definitely "a nasty thing". It is not the responsibility of fish to get every single case correct if the libraries we depend on get it wrong. The only reason for fish to have its own private implementation is in the very early days when we could not reasonably expect the platforms we run on to have a correct implementation.

Given that various terminals, including pseudo terminals like tmux, have their own private wcwidth implementations it is impossible for fish to do the right thing in every single circumstance.

I'm more than happy to review and merge any changes to make fish friendlier to people using characters not in the ASCII or European char sets. If you're someone who cares about this issue and has expertise on the topic I encourage you to create pull-requests for review or open issues with enough information to allow someone else to make the necessary changes.

@ghost
Copy link

ghost commented Jun 15, 2016

Not decided yet, the solution from floam may not be universal. Need investigation.

@llllvvuu
Copy link

Forgive me for my ignorance, but might there a way to have the right prompt positioned correctly without getting wcwidth right? zsh somehow does this (I've managed to trip up their wcwidth but the rprompt is still positioned correctly). Is there an obvious reason that whatever they're doing is infeasible here, or would I have a fair shot of figuring it out?

@faho
Copy link
Member

faho commented Sep 20, 2017

zsh somehow does this (I've managed to trip up their wcwidth but the rprompt is still positioned correctly)

Did you trip it up in the right prompt? Was the total width wrong (i.e. you didn't have two wrong characters, one over- and one undercounted)?

Here's what fish does to write the right prompt:

s_move(scr, &output, (int)(screen_width - right_prompt_width), (int)i);
s_set_color(scr, &output, 0xffffffff);
s_write_str(&output, right_prompt);

i.e. it moves the cursor to the starting position of the right prompt (which is screen_width - right_prompt_width), sets the color (which by definition has a width of 0 since no character shows up on screen) and then writes the prompt.

This depends on three things:

  • The color sequence needs to be one that the terminal either recognizes or throws away

  • The screen-width needs to be correct.

  • The right_prompt_width needs to be correct.

Now, there might be other ways of doing it (not that I can come up with one), but that's essentially what zsh does as well (at least that's what an asciinema recording tells me).

However, note that this bug is closed. I'd direct you to #2652 instead. Or open a new one and explain your actual problem there in detail.

@tsujigiri
Copy link

FWIW, I had an issue with color emoji in the left prompt, which messed things up, but turned out to be an issue with tmux. Making tmux use libutf8proc version >2 solved that one for me. The right prompt is still not aligned with the right border, though, so I searched here for leads. Maybe this helps.

@JoshCheek
Copy link

JoshCheek commented Jun 20, 2018

Heeeeeey, everybody!!

So yeah, I hit this issue, in iTerm2 on a Mac. Eventually figured out that iTerm has an option under "Text" to "Use Unicode version 9 widths", which is now checked by default. After unchecking, the problem was fixed. I imagine this sort of thing will keep coming up until everyone standardizes on the same widths, but at least I understand why it was wonky now and will have a sense of where to go looking next time it hits me.

Oh, and for more context, read iTerm's tooltip, which I captured in the screenshot.

screen shot 2018-06-19 at 9 49 25 pm

@ridiculousfish
Copy link
Member

ridiculousfish commented Jun 23, 2018

We will never standardize on the same widths ๐Ÿ˜ข

fish 3.0 has the fish_emoji_width variable which is effectively that same setting (Unicode 9 widths). So you can either tweak the iTerm2 setting or tweak fish, depending on whether you prefer 1 or 2 cell emoji. This should probably be a FAQ.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something that's not working as intended
Projects
None yet
Development

Successfully merging a pull request may close this issue.