Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upFix String.length for chars outside of BMP #626
Conversation
eoftedal
referenced this pull request
May 26, 2016
Closed
String.reverse doesn't handle unicode characters properly #625
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
eoftedal
commented
May 26, 2016
|
Partial fix of: https://github.com/elm-lang/core/issues/625 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Hurtak
May 27, 2016
Array.from is not well supported by the browsers at the moment (https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Array/from)
Does Elm have polyfill for that?
If not maybe we could use Array.prototype.slice.call(str).length
Hurtak
commented
May 27, 2016
•
|
Array.from is not well supported by the browsers at the moment (https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Array/from) Does Elm have polyfill for that? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmackenzie
Sep 6, 2016
For what it's worth, I recently had elm-community/string-extra#6 merged in to add toCodePoints and fromCodePoints, so in a future release of elm-community/string-extra you should be able to use those (although it will of course be much slower than a native solution).
ianmackenzie
commented
Sep 6, 2016
|
For what it's worth, I recently had elm-community/string-extra#6 merged in to add |
evancz
changed the title from
Properly count 4-byte unicode (astral characters)
to
Fix String.length for chars outside of BMP
Mar 25, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
evancz
Mar 26, 2017
Member
This implementation is not viable because of browser support. The fastest "correct" implementation is probably something like this:
function length(str)
{
var len = str.length;
var realLength = len;
var i = 0;
while (i < len)
{
var word = str.charCodeAt(i);
if (0xD800 <= word && word <= 0xDBFF)
{
i += 2;
realLength--;
continue;
}
i++;
}
return realLength;
}This is O(n) in the length of the string though.
In normal circumstances, a language could keep two numbers with each string. One for the length in bytes (which JS does) and one for the length in characters (which JS does not do). When you append, the language runtime can just add both numbers. That way it is O(1) to look them up with an overhead of 32-bits extra, which is not too bad.
To add this info in JS means (1) wrapping all strings in objects with the extra info or (2) adding a field to all string objects and making sure to update them. So our practical choices are an O(n) length or an O(1) length with slower everything else. I think both of these cases are bad enough that the current behavior is "better" overall given the information I have now.
If people have specific scenarios that are causing trouble, I would be interested to hear about that. That said, I expect it'll only be viable to make length work properly when we have our own string representation someday.
|
This implementation is not viable because of browser support. The fastest "correct" implementation is probably something like this: function length(str)
{
var len = str.length;
var realLength = len;
var i = 0;
while (i < len)
{
var word = str.charCodeAt(i);
if (0xD800 <= word && word <= 0xDBFF)
{
i += 2;
realLength--;
continue;
}
i++;
}
return realLength;
}This is O(n) in the length of the string though. In normal circumstances, a language could keep two numbers with each string. One for the length in bytes (which JS does) and one for the length in characters (which JS does not do). When you append, the language runtime can just add both numbers. That way it is O(1) to look them up with an overhead of 32-bits extra, which is not too bad. To add this info in JS means (1) wrapping all strings in objects with the extra info or (2) adding a field to all string objects and making sure to update them. So our practical choices are an O(n) If people have specific scenarios that are causing trouble, I would be interested to hear about that. That said, I expect it'll only be viable to make |
evancz
closed this
Mar 26, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
evancz
Mar 26, 2017
Member
With some recent commits, foldl and foldr both support characters outside of the BMP, so it will be possible to get the real length with an expression like this.
String.foldl (\_ n -> n + 1) 0 "😃😃😃"This has the O(n) performance, but it will give 3. This should be out with 0.19.
|
With some recent commits, String.foldl (\_ n -> n + 1) 0 "😃😃😃"This has the O(n) performance, but it will give 3. This should be out with 0.19. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
jvoigtlaender
Mar 26, 2017
Contributor
But shouldn't it at least be mentioned in #725 that this String.length correctness issue exists?
|
But shouldn't it at least be mentioned in #725 that this String.length correctness issue exists? |
eoftedal commentedMay 26, 2016
No description provided.