`String#single_byte_optimizable?` should be public API #12022

straight-shoota · 2022-04-22T10:06:46Z

#single_byte_optimizable? is a private helper method that returns true when the size and bytesize of a string are identical. It is similar to #ascii_only? but does not guarantee that the byte values are valid code points (ASCII characters use only 7 bits).

This method is commonly used to optimize algorithms that need to iterate a string's contents and can be more efficient when it does not account for multibyte characters.
Such algorithms are not limited to stdlib applications and can be found in user code as well. So I think it would be helpful to expose this method in the public API to make it usable elsewhere. Otherwise, shard authors tend to use the less efficient #ascii_only? (unless #12020 gets implemented, but even then the semantics of #single_byte_optimizable? might be preferable).

The text was updated successfully, but these errors were encountered:

asterite · 2022-04-22T11:31:02Z

Now maybe might be a good point to consider embedding this information in a string. I forgot ascii_only? is not just checking for @bytesize == size 😞

That said, what prevents users from checking string.bytesize == string.size? Nothing. I don't think we should expose single_byte_optimizable? (there's no such thing in Ruby either)

jhass · 2022-04-24T08:09:41Z

I feel like exposing this explicitly would also further proliferate the dual use of String as byte array in addition to a character array. I still would highly prefer making Bytes the most convenient and efficient for those.

straight-shoota · 2022-04-24T09:01:18Z

@jhass That's a good point. Only currently it's not just about supporting invalid UTF-8 data, but ascii_only? is just inefficient.

jhass · 2022-04-24T09:22:13Z

In my ideal world we would return to validating input to be valid UTF-8 on String creation, then ascii_only? could indeed just be size == bytesize. A path toward that could indeed be tracking whether the string is valid UTF-8 as a, let's call it dirty flag for now. Then there could be a happy path of return size == bytesize unless dirty.

straight-shoota added kind:feature topic:stdlib:text labels Apr 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`String#single_byte_optimizable?` should be public API #12022

`String#single_byte_optimizable?` should be public API #12022

straight-shoota commented Apr 22, 2022

asterite commented Apr 22, 2022

jhass commented Apr 24, 2022 •

edited

straight-shoota commented Apr 24, 2022

jhass commented Apr 24, 2022

String#single_byte_optimizable? should be public API #12022

String#single_byte_optimizable? should be public API #12022

Comments

straight-shoota commented Apr 22, 2022

asterite commented Apr 22, 2022

jhass commented Apr 24, 2022 • edited

straight-shoota commented Apr 24, 2022

jhass commented Apr 24, 2022

`String#single_byte_optimizable?` should be public API #12022

`String#single_byte_optimizable?` should be public API #12022

jhass commented Apr 24, 2022 •

edited