Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Another implementation for utf8 support. #187

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yhirose
Copy link

@yhirose yhirose commented Apr 19, 2020

Since I saw #186, I would like to post my implementation for utf8 support, too. It supports Emoji and CJK characters which have wide width as well. When the Unicode standard gets updated, the only thing we need is to update tables in `encodings/utf8.c'. We can also easily add another encodings by implementing by providing the following functions:

typedef size_t (linenoisePrevCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseNextCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseReadCode)(int fd, char *buf, size_t buf_len, int* c);

You can see more detail from my comments in #25.

I don't have any intention to promote my implementation at all with this pull request. But hope this implementation helps you when designing the utf8 support in linenoise in addition to @ManzoniGiuseppe's excellent implementation.

Thank you!

@ManzoniGiuseppe
Copy link

Nah, mine was just a temporary fix for basic utf-8 support until someone wrote something more complete (which apparently was already written long ago). Anyway, I tried yours out, and I found two things. It seems I can't create issues in yours, so, I'll put them here:

  • The wideCharTable seems bugged to me. Things like °, ¡, €, ¤, ×, ¹, á, ¶ are shown as single column in my xterm, but your impl considers them as two columns large. Maybe it's my xterm to be bugged, but consider the following: Your impl considers the lower case á to be two columns wide, but its uppercase Á is a single column... same strange thing for the pairs ǎ/Ǎ, à/À, while ã/Ã, ä/Ä, ȧ/Ȧ are all single column.
  • You put this codepoint in your example, and you said to support unicode 12.1, but the Zero Width Joiner (U+200D, added in unicode 1.1.5) is not supported width-wise. I mean, it's neither a zero width nor a joiner.

@yhirose yhirose force-pushed the utf8-support branch 3 times, most recently from 56859ae to 72bf4bb Compare April 20, 2020 16:20
@yhirose
Copy link
Author

yhirose commented Apr 20, 2020

@ManzoniGiuseppe, thanks for the thorough checking!

The wideCharTable seems bugged to me. Things like °, ¡, €, ¤, ×, ¹, á, ¶ are shown as single column in my xterm, but your impl considers them as two columns large. Maybe it's my xterm to be bugged, but consider the following: Your impl considers the lower case á to be two columns wide, but its uppercase Á is a single column... same strange thing for the pairs ǎ/Ǎ, à/À, while ã/Ã, ä/Ä, ȧ/Ȧ are all single column.

You are right. My the wideCharTable generator (generate_wide_char_table.rb) treated 'Ambiguous' (A) entries as Wide chars. The new generator code now excludes them from wideCharTable. Please check with the latest code.

The reason why I considered Ambiguous entries as Wide chars is based on the following information from UAX #11 'East Asian Width' document for the Ambiguous characters:

In a broad sense, wide characters include W, F, and A (when in East Asian context), and narrow characters include N, Na, H, and A (when not in East Asian context).

But it seems like there are only a few cases where the ambiguous entries should be treated as wide chars. So I think this change is ok, and hope it fixes the problems that you reported.

You put this codepoint in your example, and you said to support unicode 12.1, but the Zero Width Joiner (U+200D, added in unicode 1.1.5) is not supported width-wise. I mean, it's neither a zero width nor a joiner.

In the EastAsianWIdth.txt in Unicode 13.0 Database has the following entry:

200B..200F;N     # Cf     [5] ZERO WIDTH SPACE..RIGHT-TO-LEFT MARK

U+200D is declared as 'Narrow' (N). So I am not sure if U+200D should be treated as a wide character at this point...

Anyway I didn't make any decision by myself as to whether characters should be wide or narrow, rather the Unicode East Asian Width database does all. The benefit of this approach is very easy to catch up to the latest Unicode standard. (Just download UnicodeData.txt and EastAsianWidth.txt from Unicode.org, and re-generate combiningCharTable and WideCharTable with the ruby scripts.)

But if there are cases that this auto-generated-table-driven way cannot handle, it seems like such cases should be handled manually...

I included tools directory where you can see the latest Unicode 13.0 UnicodeData.txt and EastAsianWidth.txt and ruby scripts to generate the C++ tables. Hope it helps you to understand my code better. :)

@craigbarnes
Copy link

craigbarnes commented Apr 28, 2020

U+200D is declared as 'Narrow' (N). So I am not sure if U+200D should be treated as a wide character at this point...

N means neutral, not narrow. U+200D is a default ignorable codepoint, so in the context of a terminal it should be zero width.

@yhirose
Copy link
Author

yhirose commented Apr 28, 2020

@craigbarnes, thanks for finding my incorrect understanding for N. :)

@craigbarnes
Copy link

craigbarnes commented Apr 28, 2020

It's the same difference really. It's the default ignorable part that matters. The most straight forward way to support those codepoints is to parse DerivedCoreProperties.txt and make all Default_Ignorable_Code_Point entries have a width of 0.

@antirez
Copy link
Owner

antirez commented Apr 28, 2020

Hey folks, thanks for your efforts. What do you think the best support is so far among the ones contributed? I would like to spend some time to review.

@yhirose
Copy link
Author

yhirose commented May 1, 2020

@antirez, thank you for your interest in adding UTF-8 support.

I don't say that my implementation is the best though, you might find something helpful in it when supporting UTF-8 in linenoise.

Here is the diff between the latest master and my branch:
yhirose@72bf4bb#diff-72af1757e65272a44fc4aa3b479e2f3b

Most changes in 'linenoise.c' are basically for calculation of correct column position for multi-bytes characters. (Size of a character could now be multi-bytes, and it also could be a 'wide' char which consumes 2 column widths like Emoji and CJK characters.)

The actual UTF-8 specific code is separated in `encodings/utf8.h and c'. It implements the following interfaces which are used in 'linenoise.c'.

typedef size_t (linenoisePrevCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseNextCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseReadCode)(int fd, char *buf, size_t buf_len, int* c);

So if you want to support another encoding, what we need to do is to implement the above interfaces for the encoding and integrate them to linenoise by calling linenoiseSetEncodingFunctions function in 'linenoise.h'.

Hope it helps when you review the code. Please let me know if you have any questions.
Thank you!

@ManzoniGiuseppe
Copy link

Of the two I know of, this has an obviously better utf-8 support then mine.
If anything I think a function to detect if utf-8 is supported would be nice and easy. For example mine internally (didn't want to change the API) checks it by printing a U+1EE4 and detecting of how many positions the cursor was moved. Many other encodings like all ISO 8859,Windows-1252, and Mac OS Roman would print it as 3 valid chars instead of 1. I think it'd be useful for many applications as they may not know which encoding their terminal is using.

@antirez
Copy link
Owner

antirez commented Jul 3, 2020

Ok I took some time to read both the implementations. They are both nice in different ways. But my feeling is that this implementation is too complex for the linenoise design idea of minimality, and @ManzoniGiuseppe implementation, which is more in the right direction, is too conservative in the abstractions, so it ends making a lot of math with buffer positions. I want to attempt a more minimal implementation, and I'll post it as a PR as well. Let's see if there is a possible, really minimal approach. Thanks.

@yhirose
Copy link
Author

yhirose commented Jul 3, 2020

@antirez, sounds good to me!

@dzwdz
Copy link

dzwdz commented Aug 9, 2023

In case anyone cares, I've updated this branch to work with the multiplexing API / resolved the current merge conflicts. yhirose#1

@yhirose
Copy link
Author

yhirose commented Aug 29, 2023

I have rebased the utf8-support branch on yhirose/linenoise with the latest master branch, and I confirmed that the utf8-support branch works with Japanese and Emoji.

apainintheneck added a commit to apainintheneck/crystal-linenoise that referenced this pull request Feb 27, 2024
This is the first batch of changes from an upstream PR that provides
UTF8 support. As a side effect, it means that we now will have the
correct support for escape sequences in single and multiline mode.

We previously merged in a smaller change which provided support
for single line mode with escape sequences by fixing the cursor
position but this is more comprehensive.

The next commit will add the UTF8 library.

Author: https://github.com/yhirose
PR: antirez/linenoise#187
apainintheneck added a commit to apainintheneck/crystal-linenoise that referenced this pull request Feb 27, 2024
This is the second half of the PR that I'm merging in from upstream.
It adds UTF8 13.0 support by including an extra library to get the
char width of those characters (the visual width that is). During
testing things seem to work quite well. I'm impressed.

Author: https://github.com/yhirose
PR: antirez/linenoise#187
@yhirose
Copy link
Author

yhirose commented Mar 27, 2024

Rebased the utf8-support branch on yhirose/linenoise with the latest master branch.

@zlaazlaa
Copy link

What a nice PR!, is that means we can input Chinese and delete Chinese correctly? In master branch I even need to press Backspace for three times to delete Chinese!

@yhirose
Copy link
Author

yhirose commented Aug 19, 2024

Yes, you can. Here is my branch which includes the UTF-8 support. Hope you enjoy it!
https://github.com/yhirose/linenoise/tree/utf8-support

@zlaazlaa
Copy link

Yes, you can. Here is my branch which includes the UTF-8 support. Hope you enjoy it! https://github.com/yhirose/linenoise/tree/utf8-support

Thanks for your reply, I'll try it.👍

@yhirose
Copy link
Author

yhirose commented Oct 11, 2024

Catch up to Unicode 16.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants