Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to access unicode bytes or alternative ord formats #851

Closed
rautesamtr opened this issue Aug 24, 2019 · 6 comments
Closed

Add support to access unicode bytes or alternative ord formats #851

rautesamtr opened this issue Aug 24, 2019 · 6 comments

Comments

@rautesamtr
Copy link
Contributor

Hi,
it would be nice to be able to either directly access byte values of unicode runes or have ord support utf-8 encoding.

e.g. ord currently outputs utf-32

ord 🐈€
▶ 0x1f408
▶ 0x20ac

how about to add an option to output utf-8? e.g.

ord &encoding=utf8 🐈€
▶ [0xf0 0x9f 0x90 0x88]
▶ [0xe2 0x82 0xac]

or

ord &encoding=utf8 🐈€
▶ 0xf09f9088
▶ 0xe282ac
@xiaq xiaq added the C:Stdlib label Dec 31, 2019
@krader1961
Copy link
Contributor

krader1961 commented Apr 19, 2020

I don't think there should be an &encoding option as that implies that other encodings, such as UTF-16, might be supported. Worse, it implies non-Unicode encodings such as GB2312 might be supported by transcoding the value. It's a Unicode world which means UTF-32 and UTF-8 are the only sensible choices; especially in the context of Go and thus Elvish. So, instead, implement a &utf8 bool option. Also, the chr command should also support a &utf8 option so composition of the functions is possible.

krader1961 added a commit to krader1961/elvish that referenced this issue Apr 20, 2020
@krader1961
Copy link
Contributor

I opted for a &bytes option rather than &utf8 because the string might not be valid Unicode.

@xiaq
Copy link
Member

xiaq commented Apr 26, 2020

These functionalities sound useful, but implementing them as options of the chr and ord commands is stretching the analogy with their Python counterparts too far. Better to add them as separate functions:

  • str:to-utf8-bytes converts a string to a series of numbers, each one representing a byte in the UTF-8 encoding.
  • str:from-utf8-bytes converts a series of numbers to a string, treating each one as a byte

It's also worthwhile to give chr and ord more self-documenting names: chr -> str:from-codepoints and ord -> str:to-codepoints, and deprecate the builtin chr and ord.

@krader1961
Copy link
Contributor

I can see the point of moving the builtin chr and ord commands to the str: namespace under more descriptive names and deprecating the existing builtins. But it's not clear to me that introducing companion commands that work on UTF-8 byte sequences is preferable to adding a &utf8 option to the proposed str:... commands. This is a niche feature that does not deserve distinct commands. Better to modify the behavior of the two related commands via an option, IMHO.

BTW, I can't see anything in the Elvish documentation that makes it crystal clear that text strings are assumed to be Unicode. Are these commands expected to work if the locale is something like GB2312?

@xiaq
Copy link
Member

xiaq commented May 6, 2020

@krader1961 niche maybe, but converting a string from and to codepoints and bytes are pretty fundamental operations that can be building blocks for higher level functions.

Go treats UTF-8 specially; here is a a good introduction. Elvish doesn't work well with other encodings.

@krader1961
Copy link
Contributor

I read the Go article you linked to long ago. Albeit before I had contributed any Go language changes to any project -- private or public. It seems like this is a "here be dragons" situation that would benefit from some clarification in the documentation.

rautesamtr added a commit to rautesamtr/elvish that referenced this issue Jul 23, 2020
Move builtin string function ord and chr to the str module and rename to
to str:to-codepoints and str:from-codepoints respectively as suggested
in elves#851.
rautesamtr added a commit to rautesamtr/elvish that referenced this issue Jul 23, 2020
Move builtin string function ord and chr to the str module and rename to
to str:to-codepoints and str:from-codepoints respectively as suggested
in elves#851.
rautesamtr added a commit to rautesamtr/elvish that referenced this issue Jul 23, 2020
Add from-utf8-bytes and to-utf8-bytes functions to the str module. This
functions differ from their *-codepoints in that they handle utf8 bytes
instead of whole codepoints. Closes elves#851
rautesamtr added a commit to rautesamtr/elvish that referenced this issue Jul 23, 2020
Move builtin string function ord and chr to the str module and rename to
to str:to-codepoints and str:from-codepoints respectively as suggested
in elves#851.
rautesamtr added a commit to rautesamtr/elvish that referenced this issue Jul 23, 2020
Add from-utf8-bytes and to-utf8-bytes functions to the str module. This
functions differ from their *-codepoints in that they handle utf8 bytes
instead of whole codepoints. Closes elves#851
rautesamtr added a commit to rautesamtr/elvish that referenced this issue Jul 23, 2020
Move builtin string function ord and chr to the str module and rename to
to str:to-codepoints and str:from-codepoints respectively as suggested
in elves#851.
rautesamtr added a commit to rautesamtr/elvish that referenced this issue Jul 23, 2020
Add from-utf8-bytes and to-utf8-bytes functions to the str module. This
functions differ from their *-codepoints in that they handle utf8 bytes
instead of whole codepoints. Closes elves#851
rautesamtr added a commit to rautesamtr/elvish that referenced this issue Jul 24, 2020
Move builtin string function ord and chr to the str module and rename to
to str:to-codepoints and str:from-codepoints respectively as suggested
in elves#851.
rautesamtr added a commit to rautesamtr/elvish that referenced this issue Jul 24, 2020
Add from-utf8-bytes and to-utf8-bytes functions to the str module. This
functions differ from their *-codepoints in that they handle utf8 bytes
instead of whole codepoints. Closes elves#851
@xiaq xiaq closed this as completed in 1124c10 Aug 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants