Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String#single_byte_optimizable? should be public API #12022

Open
straight-shoota opened this issue Apr 22, 2022 · 4 comments
Open

String#single_byte_optimizable? should be public API #12022

straight-shoota opened this issue Apr 22, 2022 · 4 comments

Comments

@straight-shoota
Copy link
Member

#single_byte_optimizable? is a private helper method that returns true when the size and bytesize of a string are identical. It is similar to #ascii_only? but does not guarantee that the byte values are valid code points (ASCII characters use only 7 bits).

This method is commonly used to optimize algorithms that need to iterate a string's contents and can be more efficient when it does not account for multibyte characters.
Such algorithms are not limited to stdlib applications and can be found in user code as well. So I think it would be helpful to expose this method in the public API to make it usable elsewhere. Otherwise, shard authors tend to use the less efficient #ascii_only? (unless #12020 gets implemented, but even then the semantics of #single_byte_optimizable? might be preferable).

@asterite
Copy link
Member

Now maybe might be a good point to consider embedding this information in a string. I forgot ascii_only? is not just checking for @bytesize == size 😞

That said, what prevents users from checking string.bytesize == string.size? Nothing. I don't think we should expose single_byte_optimizable? (there's no such thing in Ruby either)

@jhass
Copy link
Member

jhass commented Apr 24, 2022

I feel like exposing this explicitly would also further proliferate the dual use of String as byte array in addition to a character array. I still would highly prefer making Bytes the most convenient and efficient for those.

@straight-shoota
Copy link
Member Author

@jhass That's a good point. Only currently it's not just about supporting invalid UTF-8 data, but ascii_only? is just inefficient.

@jhass
Copy link
Member

jhass commented Apr 24, 2022

In my ideal world we would return to validating input to be valid UTF-8 on String creation, then ascii_only? could indeed just be size == bytesize. A path toward that could indeed be tracking whether the string is valid UTF-8 as a, let's call it dirty flag for now. Then there could be a happy path of return size == bytesize unless dirty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants