New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle characters outside the BMP #387

Closed
ianbollinger opened this Issue Sep 2, 2015 · 7 comments

Comments

Projects
None yet
4 participants
@ianbollinger
Contributor

ianbollinger commented Sep 2, 2015

Manipulating characters outside the Unicode BMP will produce a crash on the REPL (on Fedora but not Mac OS X):

> String.toList (String.fromList [ '😣' ])

...or...

> Char.fromCode (Char.toCode '😣')

results in

elm-repl: fd:7: hGetContents: invalid argument (invalid byte sequence)

It seems that either:

  1. The Elm compiler should issue a syntax error if a character literal falls outside of the BMP.
  2. The Elm core library should be made to work with characters outside the BMP (harder, since JavaScript doesn't work well with them either.)
@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Sep 2, 2015

Member

edit: I missed the point about JS doing a bad job as well. Revised this comment heavily!

Is this an Elm problem or a elm-repl problem? From the description, it sounds like maybe this even belongs on the elm-repl repo and we need to be doing unicode stuff better there. Or is this something that happens in core in typical usage in a browser as well?

There have been some "my system does not default to utf8" kind of issues going around, but it's typically an issue where elm-make does not force a certain encoding in the last platform release.

Member

evancz commented Sep 2, 2015

edit: I missed the point about JS doing a bad job as well. Revised this comment heavily!

Is this an Elm problem or a elm-repl problem? From the description, it sounds like maybe this even belongs on the elm-repl repo and we need to be doing unicode stuff better there. Or is this something that happens in core in typical usage in a browser as well?

There have been some "my system does not default to utf8" kind of issues going around, but it's typically an issue where elm-make does not force a certain encoding in the last platform release.

@ianbollinger

This comment has been minimized.

Show comment
Hide comment
@ianbollinger

ianbollinger Sep 2, 2015

Contributor

Well, I'd say it's an Elm problem because the underlying problem is that you can create Chars that are four-bytes wide (instead of two-bytes wide) and then do weird things with them. For instance, '😣' is codepoint 0x1F623, but Char.toCode '😣' /= 0x1F623. Thus fromCode and toCode aren't strictly inverse. Another example is String.toList, which will split a 4-byte codepoint down the middle. Finally, List.length ['😣'] /= String.length (String.fromList ['😣']).

Java "solves" the problem by disallowing characters outside the BMP (4 bytes) in character literals. I mention Java specifically because it and JavaScript both use 2-byte encodings (to their detriment.)

Contributor

ianbollinger commented Sep 2, 2015

Well, I'd say it's an Elm problem because the underlying problem is that you can create Chars that are four-bytes wide (instead of two-bytes wide) and then do weird things with them. For instance, '😣' is codepoint 0x1F623, but Char.toCode '😣' /= 0x1F623. Thus fromCode and toCode aren't strictly inverse. Another example is String.toList, which will split a 4-byte codepoint down the middle. Finally, List.length ['😣'] /= String.length (String.fromList ['😣']).

Java "solves" the problem by disallowing characters outside the BMP (4 bytes) in character literals. I mention Java specifically because it and JavaScript both use 2-byte encodings (to their detriment.)

@evancz evancz referenced this issue Sep 23, 2016

Closed

String and Char #725

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Sep 23, 2016

Member

Okay, going to continue tracking in #725 so we can consider all the string oddities in one place. Thanks for reporting this!

Member

evancz commented Sep 23, 2016

Okay, going to continue tracking in #725 so we can consider all the string oddities in one place. Thanks for reporting this!

@martinmodrak

This comment has been minimized.

Show comment
Hide comment
@martinmodrak

martinmodrak Nov 23, 2016

I think the core should give up on Unicode code points/characters and treat strings as sequences of 16bit words. The thing is that the mapping between Unicode code points and characters as perceived by the user is neither total, nor injective, nor surjective and thus determining character boundaries in Unicode strings is difficult and expensive.

Further, having the correct character boundaries is of little practical importance. For almost all tasks in web development, I simply pass strings around. Further, all common operations (search for substring, search for a BMP character etc.) work without any special consideration for non-BMP characters. In particular, parsing XML, JSON, etc. works just fine.

The implication for the case above is that '😣' would be an invalid character constant, while "😣" would be a valid 2-char string. If anyone needs fancy character-boundaries-aware text processing and is willing to pay the performance penalty, it should be dealt with in a specialized UTF package.

Sourcing a bit from http://utf8everywhere.org/ :-)

martinmodrak commented Nov 23, 2016

I think the core should give up on Unicode code points/characters and treat strings as sequences of 16bit words. The thing is that the mapping between Unicode code points and characters as perceived by the user is neither total, nor injective, nor surjective and thus determining character boundaries in Unicode strings is difficult and expensive.

Further, having the correct character boundaries is of little practical importance. For almost all tasks in web development, I simply pass strings around. Further, all common operations (search for substring, search for a BMP character etc.) work without any special consideration for non-BMP characters. In particular, parsing XML, JSON, etc. works just fine.

The implication for the case above is that '😣' would be an invalid character constant, while "😣" would be a valid 2-char string. If anyone needs fancy character-boundaries-aware text processing and is willing to pay the performance penalty, it should be dealt with in a specialized UTF package.

Sourcing a bit from http://utf8everywhere.org/ :-)

@feihong

This comment has been minimized.

Show comment
Hide comment
@feihong

feihong Apr 2, 2017

In JavaScript, there is an easy way to split strings containing non-BMP characters, but it is not yet widely supported. It's true that "😀😁😂".split("") will unfortunately return an array containing 6 elements. However, for browsers that support the spread operator, [..."😀😁😂"] always returns ["😀", "😁", "😂"]. According to MDN, browser support for this use of the spread syntax is overall pretty good, with IE Mobile and Opera being notable exceptions.

There is an alternate, but related, method, which is to define a function like this:

function toList(str) {
    let result = []; 
    for (let c of str) { result.push(c) }
    return result;
}

Unfortunately, the for...of syntax is only a little better supported, and will fail on IE Mobile and Opera Mobile.

So a possible solution for fixing String.toList is to simply wait for browsers to catch up to the relevant parts of the ES2015 spec, and then use [...str] instead of str.split().

feihong commented Apr 2, 2017

In JavaScript, there is an easy way to split strings containing non-BMP characters, but it is not yet widely supported. It's true that "😀😁😂".split("") will unfortunately return an array containing 6 elements. However, for browsers that support the spread operator, [..."😀😁😂"] always returns ["😀", "😁", "😂"]. According to MDN, browser support for this use of the spread syntax is overall pretty good, with IE Mobile and Opera being notable exceptions.

There is an alternate, but related, method, which is to define a function like this:

function toList(str) {
    let result = []; 
    for (let c of str) { result.push(c) }
    return result;
}

Unfortunately, the for...of syntax is only a little better supported, and will fail on IE Mobile and Opera Mobile.

So a possible solution for fixing String.toList is to simply wait for browsers to catch up to the relevant parts of the ES2015 spec, and then use [...str] instead of str.split().

@feihong

This comment has been minimized.

Show comment
Hide comment
@feihong

feihong Apr 2, 2017

I also looked into the possibility of implementing Char.fromCode using String.prototype.codePointAt(). The merit of doing that would be because String.fromCodePoint("😀".codePointAt(0)) === "😀". However, at the time of writing, this is rather poorly supported, since not even Chrome on Android has that function.

I should note that there are polyfills:

feihong commented Apr 2, 2017

I also looked into the possibility of implementing Char.fromCode using String.prototype.codePointAt(). The merit of doing that would be because String.fromCodePoint("😀".codePointAt(0)) === "😀". However, at the time of writing, this is rather poorly supported, since not even Chrome on Android has that function.

I should note that there are polyfills:

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Apr 3, 2017

Member

The benefit of my fixes, like in elm-lang@bf16ca8, is that it has great browser support. No need for any new stuff really. Just gotta learn a bunch of stuff about UTF-16.

Member

evancz commented Apr 3, 2017

The benefit of my fixes, like in elm-lang@bf16ca8, is that it has great browser support. No need for any new stuff really. Just gotta learn a bunch of stuff about UTF-16.

@elm elm locked and limited conversation to collaborators Apr 3, 2017

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.