Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2-width characters (CJK languages) break command suggestion and completion #5729

Closed
goranmoomin opened this issue Mar 9, 2019 · 16 comments
Closed
Labels
bug Something that's not working as intended
Milestone

Comments

@goranmoomin
Copy link

$ fish --version
fish, version 3.0.2

$ echo $version
3.0.2

Fish shell doesn't calculate CJK characters as two characters, which makes CJK characters mangle the shell output(for example, autosuggestions make characters erase the prompt, completions are not correctly displayed... etc).

As (almost) all CJK mono fonts display them as two characters, fish shell should calculate the locations with the assumption that CJK characters have a width of 2.

Below is an asciinema record of what's happening. (It should be viewed in macOS, as Windows or Linux cannot display HFS style NFD filenames properly.)

asciicast

@faho
Copy link
Member

faho commented Mar 11, 2019

Fish shell doesn't calculate CJK characters as two characters

I'm quite sure fish does!

We use https://github.com/ridiculousfish/widecharwidth to compute the widths, which gets them by parsing the relevant standard document (EastAsianWidth.txt).

The question is if your terminal does (as always, the major issue about character width in a terminal is that both the terminal and the client application have to agree), or if those characters are more complicated - a bunch of combiners?

Would it be possible for you to copy that output, paste it into a file and upload it here?

@goranmoomin
Copy link
Author

Well, as I could replicate the same problem in two Terminal emulators, Terminal.app and Alacritty, I had no idea that fish was already doing that!

For the files, here's them: it's from Terminal.app. It's captured output is going to a directory full of Korean named directories, and pressed tab to trigger completions.

fish-error.txt

@goranmoomin
Copy link
Author

Installed iTerm 2, and could also replicate the error.

fish-error-from-iterm2.txt

@faho
Copy link
Member

faho commented Mar 12, 2019

Okay, so one of the offending filenames is 국어. Which in my terminal displays as

`ᄀ ᅮ ᆨ ᄋ ᅥ' (without the spaces, I've added those so the browser doesn't compose them).

The components have, in our wcwidth, a width of 2, 1, 1, 2, 1, respectively. That's a total of 6, which is also how it displays in my terminal - uncomposed.

The system wcwidth, on the other hand, assigns ᄀ and ᄋ a width of 2 and the others 0, which works out to 4, the correct width for the combined version.

@ridiculousfish: Any ideas for how this is supposed to work?

@goranmoomin
Copy link
Author

@faho No, that’s not how it’s supposed to work. The offending file name is shown 국어, not ㄱㅜㄱㅇㅓ, but it looks like it’s being decomposed in your terminal because of Unicode normalization problems. (It always happens :-()

Sent with GitHawk

@goranmoomin
Copy link
Author

@faho It is correctly displayed (as 4 width characters), I will post a screencast ASAP.

Sent with GitHawk

@faho
Copy link
Member

faho commented Mar 12, 2019

The offending file name is shown 국어, not ㄱㅜㄱㅇㅓ, but it looks like it’s being decomposed in your terminal because of Unicode normalization problems.

I am aware. The problem is that we basically can't deal with these differences in presentation. There is no communication between the terminal and the application, so both need to come up with the width independently, and the typical thing is to come up with the width for each codepoint independently. Which in this case works if the codepoint is only ever used in the same way - either only in the composed form, or only in the uncomposed one.

So we need to pick a "side" here.

It is correctly displayed (as 4 width characters), I will post a screencast ASAP.

Not necessary.

@faho
Copy link
Member

faho commented Mar 12, 2019

Okay, so let's consult the definitive standards document, which is... a .txt file: https://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt.

It says:

1160..11FF;N # Lo [160] HANGUL JUNGSEONG FILLER..HANGUL JONGSEONG SSANGNIEUN

That means the codepoints from U+1160 to U+11FF, the Hangul Jamo medial vowels and final consonants, have a "N"arrow width (i.e. 1), and are classified as "Letter, other" (Lo).

For reference, "ᅮ" is U+116E HANGUL JUNGSEONG U, so it's in that block.

So this seems like a broken definition.

Unless there's some bit I'm missing that supercedes that (and @ridiculousfish also missed it when writing widecharwidth) or I'm just misreading it (which is quite likely), it seems like we might have to break the standard, though that breaks it in terminals where it currently works. However, considering that the current way is wrong, and that korean speakers, presumably, would pick a terminal that is right, that seems worth it.

@goranmoomin
Copy link
Author

@faho No, you don’t have to deal that problem because Korean is always composed. It’s not something like we have a composed form and decomposed form, it’s always composed(and could be handled with the wcwidth function, since it should already handle as 0 width). Actually, I’m pretty sure most of fish is already handling this well. It’s the auto suggestion and completion alignment part which is buggy.

Sent with GitHawk

@faho
Copy link
Member

faho commented Mar 12, 2019

Which means, the definition isn’t broken(and I’m using a terminal that displays it properly too)

No, the standard is indeed broken. Or at the very least for the usecases in a terminal it's basically unusable.

There is no way for us to tell if the terminal will display it as composed or not, and the codepoint by itself doesn't make much sense (as far as I can tell, anyway).

Okay, I'm going to paper over this now, by treating U+1160 to U+11FF as combiners. That means we refer to your system's width table, which will most likely give their width as 0, so the total will add up to the correct width. If it doesn't give a width of 0, it most likely doesn't combine them either.

If they present in a different width standing by themselves, that's not something we can currently express. I refer to #5583 on that.

faho added a commit to faho/fish-shell that referenced this issue Mar 12, 2019
Hangul uses three codepoints to combine to one glyph. The first has a
width of 2 (like the final glyph), but the second and third were
assigned a width of 1, which seems to match EastAsianWidth.txt:

> 1160..11FF;N # Lo [160] HANGUL JUNGSEONG FILLER..HANGUL JONGSEONG SSANGNIEUN

Instead, we override that and treat the middle and end codepoint as combiners,
always, because there's no way to figure out what the terminal will
think and that's the way it's supposed to work.

If they stand by themselves or in another combination, they'll indeed
show up with a width of 1 so we'll get it wrong, but that's less
likely and not expressible with wcwidth().

Fixes fish-shell#5729.
@faho faho closed this as completed in b318ab1 Mar 12, 2019
@faho faho added bug Something that's not working as intended and removed needs more info labels Mar 12, 2019
@faho faho added this to the fish 3.1.0 milestone Mar 12, 2019
@goranmoomin
Copy link
Author

@faho I tried to compile in your fork, and it's not fixed :-(...
A screencast is below...

Video

@goranmoomin
Copy link
Author

goranmoomin commented Mar 12, 2019

@faho Also,

If they stand by themselves or in another combination, they'll indeed
show up with a width of 1 so we'll get it wrong

FYI, is that if they stand by themselves, they are shown as a non composite , while if they are inside a composition like , then they can be input in two ways: a whole code-point that represents or two code points that represents and a composite- (So it's not that, because that character is a non-composite).

@faho
Copy link
Member

faho commented Mar 13, 2019

I tried to compile in your fork, and it's not fixed :

Okay, so that means one of two things:

  • there's another problem here

or

  • your system returns != 0 for these but effectively shows them as 0, so deferring to the system's wcwidth here is still wrong

Please compile and run the following:

#include <locale.h>
#include <initializer_list>
#include <stdio.h>
#include <wchar.h>

int main(int argc, char** argv) {
    // This initializes the locale according to the environment variables.
    // That means if the environment isn't set up for unicode, this whole excercise is pointless.
    setlocale(LC_ALL, "");
    wchar_t chars[] = {
                        L'a',
                        L'',
                        L'',
                        L'',
                        L'',
                        L'',
                        L'b',
                        L''
    };

    for (wchar_t c: chars) {
        int w = wcwidth(c);
        // Print check/cross mark on the left, because it can't be after the character.
        // The format is:
        // [CHECK] [SYSW] = [MKW] : [CHAR]
        wprintf(L"%3d  : '%lc'\n", w, c);
    }
}

@faho faho reopened this Mar 13, 2019
@goranmoomin
Copy link
Author

@faho Okay, it looks weird.
First, the results are

  1  : 'a'
  2  : 'ᄀ'
  1  : 'ᅮ'
  1  : 'ᆨ'
  2  : 'ᄋ'
  1  : 'ᅥ'
  1  : 'b'
  1  : '╭'

I tested in 3 terminal emulators (Terminal.app, Alacritty, iTerm2).
iTerm2 displayed using the right width (compositional characters are displayed as one character). However the other two(Terminal.app and Alacritty) displays compositional characters as zero width.

Below are the images of the screenshot.

Alacritty:
unicode-test-alacritty

iTerm2:
unicode-test-iterm2

Terminal.app:
unicode-test-terminal

(BTW, 1. Alacritty has super-bad CJK support; so I'm pretty sure iTerm2 is handling this situation best 2. So I checked if iTerm2 works(again), but the same output is generated (with other terminal emulators))

@faho
Copy link
Member

faho commented Mar 13, 2019

Okay, so your system wcwidth calls these 1 width, so it defaults to giving you the standalone width.

Since wcwidth literally can't take context into account, it has to give you one of these, and changing to an API that can is a bigger thing and tracked in #5583.

So let's just return 0 for these instead, since the usual way they show up is in the composed form.

@faho faho closed this as completed in 05b9c07 Mar 13, 2019
@goranmoomin
Copy link
Author

@faho Okay, I confirmed that it works on all terminal emulators! Thanks :-)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something that's not working as intended
Projects
None yet
Development

No branches or pull requests

2 participants