New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2-width characters (CJK languages) break command suggestion and completion #5729
Comments
I'm quite sure fish does! We use https://github.com/ridiculousfish/widecharwidth to compute the widths, which gets them by parsing the relevant standard document (EastAsianWidth.txt). The question is if your terminal does (as always, the major issue about character width in a terminal is that both the terminal and the client application have to agree), or if those characters are more complicated - a bunch of combiners? Would it be possible for you to copy that output, paste it into a file and upload it here? |
Well, as I could replicate the same problem in two Terminal emulators, For the files, here's them: it's from |
Installed iTerm 2, and could also replicate the error. |
Okay, so one of the offending filenames is `ᄀ ᅮ ᆨ ᄋ ᅥ' (without the spaces, I've added those so the browser doesn't compose them). The components have, in our wcwidth, a width of 2, 1, 1, 2, 1, respectively. That's a total of 6, which is also how it displays in my terminal - uncomposed. The system wcwidth, on the other hand, assigns ᄀ and ᄋ a width of 2 and the others 0, which works out to 4, the correct width for the combined version. @ridiculousfish: Any ideas for how this is supposed to work? |
I am aware. The problem is that we basically can't deal with these differences in presentation. There is no communication between the terminal and the application, so both need to come up with the width independently, and the typical thing is to come up with the width for each codepoint independently. Which in this case works if the codepoint is only ever used in the same way - either only in the composed form, or only in the uncomposed one. So we need to pick a "side" here.
Not necessary. |
Okay, so let's consult the definitive standards document, which is... a .txt file: https://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt. It says:
That means the codepoints from U+1160 to U+11FF, the Hangul Jamo medial vowels and final consonants, have a "N"arrow width (i.e. 1), and are classified as "Letter, other" (Lo). For reference, "ᅮ" is U+116E HANGUL JUNGSEONG U, so it's in that block. So this seems like a broken definition. Unless there's some bit I'm missing that supercedes that (and @ridiculousfish also missed it when writing widecharwidth) or I'm just misreading it (which is quite likely), it seems like we might have to break the standard, though that breaks it in terminals where it currently works. However, considering that the current way is wrong, and that korean speakers, presumably, would pick a terminal that is right, that seems worth it. |
@faho No, you don’t have to deal that problem because Korean is always composed. It’s not something like we have a composed form and decomposed form, it’s always composed(and could be handled with the wcwidth function, since it should already handle Sent with GitHawk |
No, the standard is indeed broken. Or at the very least for the usecases in a terminal it's basically unusable. There is no way for us to tell if the terminal will display it as composed or not, and the codepoint by itself doesn't make much sense (as far as I can tell, anyway). Okay, I'm going to paper over this now, by treating U+1160 to U+11FF as combiners. That means we refer to your system's width table, which will most likely give their width as 0, so the total will add up to the correct width. If it doesn't give a width of 0, it most likely doesn't combine them either. If they present in a different width standing by themselves, that's not something we can currently express. I refer to #5583 on that. |
Hangul uses three codepoints to combine to one glyph. The first has a width of 2 (like the final glyph), but the second and third were assigned a width of 1, which seems to match EastAsianWidth.txt: > 1160..11FF;N # Lo [160] HANGUL JUNGSEONG FILLER..HANGUL JONGSEONG SSANGNIEUN Instead, we override that and treat the middle and end codepoint as combiners, always, because there's no way to figure out what the terminal will think and that's the way it's supposed to work. If they stand by themselves or in another combination, they'll indeed show up with a width of 1 so we'll get it wrong, but that's less likely and not expressible with wcwidth(). Fixes fish-shell#5729.
@faho Also,
FYI, is that if they stand by themselves, they are shown as a non composite |
Okay, so that means one of two things:
or
Please compile and run the following: #include <locale.h>
#include <initializer_list>
#include <stdio.h>
#include <wchar.h>
int main(int argc, char** argv) {
// This initializes the locale according to the environment variables.
// That means if the environment isn't set up for unicode, this whole excercise is pointless.
setlocale(LC_ALL, "");
wchar_t chars[] = {
L'a',
L'ᄀ',
L'ᅮ',
L'ᆨ',
L'ᄋ',
L'ᅥ',
L'b',
L'╭'
};
for (wchar_t c: chars) {
int w = wcwidth(c);
// Print check/cross mark on the left, because it can't be after the character.
// The format is:
// [CHECK] [SYSW] = [MKW] : [CHAR]
wprintf(L"%3d : '%lc'\n", w, c);
}
} |
@faho Okay, it looks weird.
I tested in 3 terminal emulators ( Below are the images of the screenshot. (BTW, 1. Alacritty has super-bad CJK support; so I'm pretty sure iTerm2 is handling this situation best 2. So I checked if iTerm2 works(again), but the same output is generated (with other terminal emulators)) |
Okay, so your system wcwidth calls these 1 width, so it defaults to giving you the standalone width. Since wcwidth literally can't take context into account, it has to give you one of these, and changing to an API that can is a bigger thing and tracked in #5583. So let's just return 0 for these instead, since the usual way they show up is in the composed form. |
@faho Okay, I confirmed that it works on all terminal emulators! Thanks :-) |
Fish shell doesn't calculate CJK characters as two characters, which makes CJK characters mangle the shell output(for example, autosuggestions make characters erase the prompt, completions are not correctly displayed... etc).
As (almost) all CJK mono fonts display them as two characters, fish shell should calculate the locations with the assumption that CJK characters have a width of 2.
Below is an asciinema record of what's happening. (It should be viewed in macOS, as Windows or Linux cannot display HFS style NFD filenames properly.)
The text was updated successfully, but these errors were encountered: