Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upHow to handle characters outside the BMP #387
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
evancz
Sep 2, 2015
Member
edit: I missed the point about JS doing a bad job as well. Revised this comment heavily!
Is this an Elm problem or a elm-repl problem? From the description, it sounds like maybe this even belongs on the elm-repl repo and we need to be doing unicode stuff better there. Or is this something that happens in core in typical usage in a browser as well?
There have been some "my system does not default to utf8" kind of issues going around, but it's typically an issue where elm-make does not force a certain encoding in the last platform release.
|
edit: I missed the point about JS doing a bad job as well. Revised this comment heavily! Is this an Elm problem or a elm-repl problem? From the description, it sounds like maybe this even belongs on the elm-repl repo and we need to be doing unicode stuff better there. Or is this something that happens in core in typical usage in a browser as well? There have been some "my system does not default to utf8" kind of issues going around, but it's typically an issue where elm-make does not force a certain encoding in the last platform release. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianbollinger
Sep 2, 2015
Contributor
Well, I'd say it's an Elm problem because the underlying problem is that you can create Chars that are four-bytes wide (instead of two-bytes wide) and then do weird things with them. For instance, 'Char.toCode '😣' /= 0x1F623. Thus fromCode and toCode aren't strictly inverse. Another example is String.toList, which will split a 4-byte codepoint down the middle. Finally, List.length ['😣'] /= String.length (String.fromList ['😣']).
Java "solves" the problem by disallowing characters outside the BMP (4 bytes) in character literals. I mention Java specifically because it and JavaScript both use 2-byte encodings (to their detriment.)
|
Well, I'd say it's an Elm problem because the underlying problem is that you can create Java "solves" the problem by disallowing characters outside the BMP (4 bytes) in character literals. I mention Java specifically because it and JavaScript both use 2-byte encodings (to their detriment.) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
evancz
Sep 23, 2016
Member
Okay, going to continue tracking in #725 so we can consider all the string oddities in one place. Thanks for reporting this!
|
Okay, going to continue tracking in #725 so we can consider all the string oddities in one place. Thanks for reporting this! |
evancz
closed this
Sep 23, 2016
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
martinmodrak
Nov 23, 2016
I think the core should give up on Unicode code points/characters and treat strings as sequences of 16bit words. The thing is that the mapping between Unicode code points and characters as perceived by the user is neither total, nor injective, nor surjective and thus determining character boundaries in Unicode strings is difficult and expensive.
Further, having the correct character boundaries is of little practical importance. For almost all tasks in web development, I simply pass strings around. Further, all common operations (search for substring, search for a BMP character etc.) work without any special consideration for non-BMP characters. In particular, parsing XML, JSON, etc. works just fine.
The implication for the case above is that '
Sourcing a bit from http://utf8everywhere.org/ :-)
martinmodrak
commented
Nov 23, 2016
•
|
I think the core should give up on Unicode code points/characters and treat strings as sequences of 16bit words. The thing is that the mapping between Unicode code points and characters as perceived by the user is neither total, nor injective, nor surjective and thus determining character boundaries in Unicode strings is difficult and expensive. Further, having the correct character boundaries is of little practical importance. For almost all tasks in web development, I simply pass strings around. Further, all common operations (search for substring, search for a BMP character etc.) work without any special consideration for non-BMP characters. In particular, parsing XML, JSON, etc. works just fine. The implication for the case above is that ' Sourcing a bit from http://utf8everywhere.org/ :-) |
added a commit
that referenced
this issue
Mar 25, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
feihong
Apr 2, 2017
In JavaScript, there is an easy way to split strings containing non-BMP characters, but it is not yet widely supported. It's true that "😀😁😂".split("") will unfortunately return an array containing 6 elements. However, for browsers that support the spread operator, [..."😀😁😂"] always returns ["😀", "😁", "😂"]. According to MDN, browser support for this use of the spread syntax is overall pretty good, with IE Mobile and Opera being notable exceptions.
There is an alternate, but related, method, which is to define a function like this:
function toList(str) {
let result = [];
for (let c of str) { result.push(c) }
return result;
}Unfortunately, the for...of syntax is only a little better supported, and will fail on IE Mobile and Opera Mobile.
So a possible solution for fixing String.toList is to simply wait for browsers to catch up to the relevant parts of the ES2015 spec, and then use [...str] instead of str.split().
feihong
commented
Apr 2, 2017
|
In JavaScript, there is an easy way to split strings containing non-BMP characters, but it is not yet widely supported. It's true that There is an alternate, but related, method, which is to define a function like this: function toList(str) {
let result = [];
for (let c of str) { result.push(c) }
return result;
}Unfortunately, the So a possible solution for fixing |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
feihong
Apr 2, 2017
I also looked into the possibility of implementing Char.fromCode using String.prototype.codePointAt(). The merit of doing that would be because String.fromCodePoint("😀".codePointAt(0)) === "😀". However, at the time of writing, this is rather poorly supported, since not even Chrome on Android has that function.
I should note that there are polyfills:
feihong
commented
Apr 2, 2017
|
I also looked into the possibility of implementing I should note that there are polyfills: |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
evancz
Apr 3, 2017
Member
The benefit of my fixes, like in elm-lang@bf16ca8, is that it has great browser support. No need for any new stuff really. Just gotta learn a bunch of stuff about UTF-16.
|
The benefit of my fixes, like in elm-lang@bf16ca8, is that it has great browser support. No need for any new stuff really. Just gotta learn a bunch of stuff about UTF-16. |
ianbollinger commentedSep 2, 2015
Manipulating characters outside the Unicode BMP will produce a crash on the REPL (on Fedora but not Mac OS X):
...or...
results in
It seems that either: