Add support to access unicode bytes or alternative `ord` formats #851

rautesamtr · 2019-08-24T11:09:48Z

Hi,
it would be nice to be able to either directly access byte values of unicode runes or have ord support utf-8 encoding.

e.g. ord currently outputs utf-32

ord 🐈€
▶ 0x1f408
▶ 0x20ac

how about to add an option to output utf-8? e.g.

ord &encoding=utf8 🐈€
▶ [0xf0 0x9f 0x90 0x88]
▶ [0xe2 0x82 0xac]

or

ord &encoding=utf8 🐈€
▶ 0xf09f9088
▶ 0xe282ac

The text was updated successfully, but these errors were encountered:

krader1961 · 2020-04-19T01:08:21Z

I don't think there should be an &encoding option as that implies that other encodings, such as UTF-16, might be supported. Worse, it implies non-Unicode encodings such as GB2312 might be supported by transcoding the value. It's a Unicode world which means UTF-32 and UTF-8 are the only sensible choices; especially in the context of Go and thus Elvish. So, instead, implement a &utf8 bool option. Also, the chr command should also support a &utf8 option so composition of the functions is possible.

Resolves elves#851

krader1961 · 2020-04-20T01:28:14Z

I opted for a &bytes option rather than &utf8 because the string might not be valid Unicode.

xiaq · 2020-04-26T16:20:36Z

These functionalities sound useful, but implementing them as options of the chr and ord commands is stretching the analogy with their Python counterparts too far. Better to add them as separate functions:

str:to-utf8-bytes converts a string to a series of numbers, each one representing a byte in the UTF-8 encoding.
str:from-utf8-bytes converts a series of numbers to a string, treating each one as a byte

It's also worthwhile to give chr and ord more self-documenting names: chr -> str:from-codepoints and ord -> str:to-codepoints, and deprecate the builtin chr and ord.

krader1961 · 2020-04-27T03:11:49Z

I can see the point of moving the builtin chr and ord commands to the str: namespace under more descriptive names and deprecating the existing builtins. But it's not clear to me that introducing companion commands that work on UTF-8 byte sequences is preferable to adding a &utf8 option to the proposed str:... commands. This is a niche feature that does not deserve distinct commands. Better to modify the behavior of the two related commands via an option, IMHO.

BTW, I can't see anything in the Elvish documentation that makes it crystal clear that text strings are assumed to be Unicode. Are these commands expected to work if the locale is something like GB2312?

xiaq · 2020-05-06T21:38:13Z

@krader1961 niche maybe, but converting a string from and to codepoints and bytes are pretty fundamental operations that can be building blocks for higher level functions.

Go treats UTF-8 specially; here is a a good introduction. Elvish doesn't work well with other encodings.

krader1961 · 2020-05-08T04:53:59Z

I read the Go article you linked to long ago. Albeit before I had contributed any Go language changes to any project -- private or public. It seems like this is a "here be dragons" situation that would benefit from some clarification in the documentation.

Move builtin string function ord and chr to the str module and rename to to str:to-codepoints and str:from-codepoints respectively as suggested in elves#851.

Add from-utf8-bytes and to-utf8-bytes functions to the str module. This functions differ from their *-codepoints in that they handle utf8 bytes instead of whole codepoints. Closes elves#851

Move builtin string function ord and chr to the str module and rename to to str:to-codepoints and str:from-codepoints respectively as suggested in elves#851.

Add from-utf8-bytes and to-utf8-bytes functions to the str module. This functions differ from their *-codepoints in that they handle utf8 bytes instead of whole codepoints. Closes elves#851

Move builtin string function ord and chr to the str module and rename to to str:to-codepoints and str:from-codepoints respectively as suggested in elves#851.

Add from-utf8-bytes and to-utf8-bytes functions to the str module. This functions differ from their *-codepoints in that they handle utf8 bytes instead of whole codepoints. Closes elves#851

Move builtin string function ord and chr to the str module and rename to to str:to-codepoints and str:from-codepoints respectively as suggested in elves#851.

Add from-utf8-bytes and to-utf8-bytes functions to the str module. This functions differ from their *-codepoints in that they handle utf8 bytes instead of whole codepoints. Closes elves#851

xiaq added the C:Stdlib label Dec 31, 2019

krader1961 added a commit to krader1961/elvish that referenced this issue Apr 20, 2020

Implement &bytes option for chr and ord commands

2a7a1f2

Resolves elves#851

xiaq closed this as completed in 1124c10 Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to access unicode bytes or alternative `ord` formats #851

Add support to access unicode bytes or alternative `ord` formats #851

rautesamtr commented Aug 24, 2019

krader1961 commented Apr 19, 2020 •

edited

Loading

krader1961 commented Apr 20, 2020

xiaq commented Apr 26, 2020

krader1961 commented Apr 27, 2020

xiaq commented May 6, 2020

krader1961 commented May 8, 2020

Add support to access unicode bytes or alternative ord formats #851

Add support to access unicode bytes or alternative ord formats #851

Comments

rautesamtr commented Aug 24, 2019

krader1961 commented Apr 19, 2020 • edited Loading

krader1961 commented Apr 20, 2020

xiaq commented Apr 26, 2020

krader1961 commented Apr 27, 2020

xiaq commented May 6, 2020

krader1961 commented May 8, 2020

Add support to access unicode bytes or alternative `ord` formats #851

Add support to access unicode bytes or alternative `ord` formats #851

krader1961 commented Apr 19, 2020 •

edited

Loading