New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix String.length for chars outside of BMP #626

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
5 participants
@eoftedal

eoftedal commented May 26, 2016

No description provided.

@eoftedal

This comment has been minimized.

Show comment
Hide comment
@Hurtak

This comment has been minimized.

Show comment
Hide comment
@Hurtak

Hurtak May 27, 2016

Array.from is not well supported by the browsers at the moment (https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Array/from)

Does Elm have polyfill for that?
If not maybe we could use Array.prototype.slice.call(str).length

Hurtak commented May 27, 2016

Array.from is not well supported by the browsers at the moment (https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Global_Objects/Array/from)

Does Elm have polyfill for that?
If not maybe we could use Array.prototype.slice.call(str).length

@ianmackenzie

This comment has been minimized.

Show comment
Hide comment
@ianmackenzie

ianmackenzie Sep 6, 2016

For what it's worth, I recently had elm-community/string-extra#6 merged in to add toCodePoints and fromCodePoints, so in a future release of elm-community/string-extra you should be able to use those (although it will of course be much slower than a native solution).

ianmackenzie commented Sep 6, 2016

For what it's worth, I recently had elm-community/string-extra#6 merged in to add toCodePoints and fromCodePoints, so in a future release of elm-community/string-extra you should be able to use those (although it will of course be much slower than a native solution).

@evancz evancz changed the title from Properly count 4-byte unicode (astral characters) to Fix String.length for chars outside of BMP Mar 25, 2017

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Mar 26, 2017

Member

This implementation is not viable because of browser support. The fastest "correct" implementation is probably something like this:

function length(str)
{
    var len = str.length;
    var realLength = len;
    var i = 0;
    while (i < len)
    {
        var word = str.charCodeAt(i);
        if (0xD800 <= word && word <= 0xDBFF)
        {
            i += 2;
            realLength--;
            continue;
        }
        i++;
    }
    return realLength;
}

This is O(n) in the length of the string though.

In normal circumstances, a language could keep two numbers with each string. One for the length in bytes (which JS does) and one for the length in characters (which JS does not do). When you append, the language runtime can just add both numbers. That way it is O(1) to look them up with an overhead of 32-bits extra, which is not too bad.

To add this info in JS means (1) wrapping all strings in objects with the extra info or (2) adding a field to all string objects and making sure to update them. So our practical choices are an O(n) length or an O(1) length with slower everything else. I think both of these cases are bad enough that the current behavior is "better" overall given the information I have now.

If people have specific scenarios that are causing trouble, I would be interested to hear about that. That said, I expect it'll only be viable to make length work properly when we have our own string representation someday.

Member

evancz commented Mar 26, 2017

This implementation is not viable because of browser support. The fastest "correct" implementation is probably something like this:

function length(str)
{
    var len = str.length;
    var realLength = len;
    var i = 0;
    while (i < len)
    {
        var word = str.charCodeAt(i);
        if (0xD800 <= word && word <= 0xDBFF)
        {
            i += 2;
            realLength--;
            continue;
        }
        i++;
    }
    return realLength;
}

This is O(n) in the length of the string though.

In normal circumstances, a language could keep two numbers with each string. One for the length in bytes (which JS does) and one for the length in characters (which JS does not do). When you append, the language runtime can just add both numbers. That way it is O(1) to look them up with an overhead of 32-bits extra, which is not too bad.

To add this info in JS means (1) wrapping all strings in objects with the extra info or (2) adding a field to all string objects and making sure to update them. So our practical choices are an O(n) length or an O(1) length with slower everything else. I think both of these cases are bad enough that the current behavior is "better" overall given the information I have now.

If people have specific scenarios that are causing trouble, I would be interested to hear about that. That said, I expect it'll only be viable to make length work properly when we have our own string representation someday.

@evancz evancz closed this Mar 26, 2017

@evancz

This comment has been minimized.

Show comment
Hide comment
@evancz

evancz Mar 26, 2017

Member

With some recent commits, foldl and foldr both support characters outside of the BMP, so it will be possible to get the real length with an expression like this.

String.foldl (\_ n -> n + 1) 0 "😃😃😃"

This has the O(n) performance, but it will give 3. This should be out with 0.19.

Member

evancz commented Mar 26, 2017

With some recent commits, foldl and foldr both support characters outside of the BMP, so it will be possible to get the real length with an expression like this.

String.foldl (\_ n -> n + 1) 0 "😃😃😃"

This has the O(n) performance, but it will give 3. This should be out with 0.19.

@jvoigtlaender

This comment has been minimized.

Show comment
Hide comment
@jvoigtlaender

jvoigtlaender Mar 26, 2017

Contributor

But shouldn't it at least be mentioned in #725 that this String.length correctness issue exists?

Contributor

jvoigtlaender commented Mar 26, 2017

But shouldn't it at least be mentioned in #725 that this String.length correctness issue exists?

@evancz evancz referenced this pull request Mar 27, 2017

Closed

String and Char #725

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment