-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another implementation for utf8 support. #187
base: master
Are you sure you want to change the base?
Conversation
Nah, mine was just a temporary fix for basic utf-8 support until someone wrote something more complete (which apparently was already written long ago). Anyway, I tried yours out, and I found two things. It seems I can't create issues in yours, so, I'll put them here:
|
56859ae
to
72bf4bb
Compare
@ManzoniGiuseppe, thanks for the thorough checking!
You are right. My the wideCharTable generator (generate_wide_char_table.rb) treated 'Ambiguous' (A) entries as Wide chars. The new generator code now excludes them from wideCharTable. Please check with the latest code. The reason why I considered Ambiguous entries as Wide chars is based on the following information from UAX #11 'East Asian Width' document for the Ambiguous characters:
But it seems like there are only a few cases where the ambiguous entries should be treated as wide chars. So I think this change is ok, and hope it fixes the problems that you reported.
In the EastAsianWIdth.txt in Unicode 13.0 Database has the following entry:
U+200D is declared as 'Narrow' (N). So I am not sure if U+200D should be treated as a wide character at this point... Anyway I didn't make any decision by myself as to whether characters should be wide or narrow, rather the Unicode East Asian Width database does all. The benefit of this approach is very easy to catch up to the latest Unicode standard. (Just download UnicodeData.txt and EastAsianWidth.txt from Unicode.org, and re-generate But if there are cases that this auto-generated-table-driven way cannot handle, it seems like such cases should be handled manually... I included |
|
@craigbarnes, thanks for finding my incorrect understanding for |
It's the same difference really. It's the default ignorable part that matters. The most straight forward way to support those codepoints is to parse |
Hey folks, thanks for your efforts. What do you think the best support is so far among the ones contributed? I would like to spend some time to review. |
@antirez, thank you for your interest in adding UTF-8 support. I don't say that my implementation is the best though, you might find something helpful in it when supporting UTF-8 in linenoise. Here is the diff between the latest master and my branch: Most changes in 'linenoise.c' are basically for calculation of correct column position for multi-bytes characters. (Size of a character could now be multi-bytes, and it also could be a 'wide' char which consumes 2 column widths like Emoji and CJK characters.) The actual UTF-8 specific code is separated in `encodings/utf8.h and c'. It implements the following interfaces which are used in 'linenoise.c'. typedef size_t (linenoisePrevCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseNextCharLen)(const char *buf, size_t buf_len, size_t pos, size_t *col_len);
typedef size_t (linenoiseReadCode)(int fd, char *buf, size_t buf_len, int* c); So if you want to support another encoding, what we need to do is to implement the above interfaces for the encoding and integrate them to linenoise by calling Hope it helps when you review the code. Please let me know if you have any questions. |
Of the two I know of, this has an obviously better utf-8 support then mine. |
Ok I took some time to read both the implementations. They are both nice in different ways. But my feeling is that this implementation is too complex for the linenoise design idea of minimality, and @ManzoniGiuseppe implementation, which is more in the right direction, is too conservative in the abstractions, so it ends making a lot of math with buffer positions. I want to attempt a more minimal implementation, and I'll post it as a PR as well. Let's see if there is a possible, really minimal approach. Thanks. |
@antirez, sounds good to me! |
8d05ca3
to
8139f7a
Compare
In case anyone cares, I've updated this branch to work with the multiplexing API / resolved the current merge conflicts. yhirose#1 |
I have rebased the utf8-support branch on yhirose/linenoise with the latest master branch, and I confirmed that the utf8-support branch works with Japanese and Emoji. |
This is the first batch of changes from an upstream PR that provides UTF8 support. As a side effect, it means that we now will have the correct support for escape sequences in single and multiline mode. We previously merged in a smaller change which provided support for single line mode with escape sequences by fixing the cursor position but this is more comprehensive. The next commit will add the UTF8 library. Author: https://github.com/yhirose PR: antirez/linenoise#187
This is the second half of the PR that I'm merging in from upstream. It adds UTF8 13.0 support by including an extra library to get the char width of those characters (the visual width that is). During testing things seem to work quite well. I'm impressed. Author: https://github.com/yhirose PR: antirez/linenoise#187
Rebased the utf8-support branch on yhirose/linenoise with the latest master branch. |
What a nice PR!, is that means we can input Chinese and delete Chinese correctly? In master branch I even need to press Backspace for three times to delete Chinese! |
Yes, you can. Here is my branch which includes the UTF-8 support. Hope you enjoy it! |
Thanks for your reply, I'll try it.👍 |
Catch up to Unicode 16.0.0. |
Since I saw #186, I would like to post my implementation for utf8 support, too. It supports Emoji and CJK characters which have wide width as well. When the Unicode standard gets updated, the only thing we need is to update tables in `encodings/utf8.c'. We can also easily add another encodings by implementing by providing the following functions:
You can see more detail from my comments in #25.
I don't have any intention to promote my implementation at all with this pull request. But hope this implementation helps you when designing the utf8 support in linenoise in addition to @ManzoniGiuseppe's excellent implementation.
Thank you!