Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to_<num_type> from String to Slice #6389

Open
bew opened this issue Jul 15, 2018 · 23 comments
Open

Move to_<num_type> from String to Slice #6389

bew opened this issue Jul 15, 2018 · 23 comments

Comments

@bew
Copy link
Contributor

bew commented Jul 15, 2018

While working on a terminfo parsing library, where I manipulate Slice of bytes (Bytes type), I needed to convert digits to an actual number.
I realized that all the to_<num_type> conversion methods (like to_i64, to_u8, to_f, ..) are in String, and I needed to allocate a String just to convert the bytes to a number.

I think thoses methods should be available in Slice too. The number conversion methods don't need anything from String as they directly work with a pointer and a size. They can almost directly be copy/pasted to Slice!

I have a working branch with this change, I can make a PR when it's accepted!

Note: the only problem I see here is that those methods will now exists for all kind of Slice, not only Slice(UInt8).. But maybe it's ok? or maybe we can add a macro check inside the to_<num_type> methods that raise when T != UInt8 ?
Another possibility would be to make Bytes a new type (not just an alias) that inherits from Slice(UInt8) so it's easier to add methods that works specifically on bytes.

WDYT?

@jhass
Copy link
Member

jhass commented Jul 15, 2018

Crystal strings are always UTF-8 encoded, so when it comes to number conversion they work in ASCII so to say. You cannot assume that any Slice(UInt8) is a valid ASCII encoded sequence or that the "right" conversion between "a number" and a "byte sequence" is interpreting that byte sequence as ASCII, there are many more ways to encode a number to a byte sequence.

We should rather try to find an allocation free way to turn a Bytes into a String assuming the Bytes is valid ASCII or UTF-8 already. Or we could support ASCII number conversion as part of the IO::ByteFormat interface I guess. As a last resort I would extract the number conversion methods to static methods that work with a Bytes, but remain on String or a new util class whose name makes it clear that it's about string based number conversion.

@RX14
Copy link
Contributor

RX14 commented Jul 15, 2018

I agree that String.unsafe_from_slice is probably a better solution.

Unfortunately, it can't be implemented because strings are layed out differently from every other class. Strings don't have a pointer to their data, they have their data directly after their class in memory, so it's impossible to turn a Bytes into a String without allocating.

I think that String.to_i(slice, ...) is the best interface thats workable.

@asterite
Copy link
Member

I wouldn't mind having to_i, etc., directly in Slice.

@bew
Copy link
Contributor Author

bew commented Jul 23, 2018

Having to_i on Slice is only logic for Slice(UInt8), right @asterite ? What do you think should be done for a Slice(Person) ?

@asterite
Copy link
Member

asterite commented Jul 23, 2018

Give an error. We already have methods on slice that only work for byte slices. And byte slices are the most common slice type (in fact, but I would happily change Slice(T) to a non generic Bytes type and remove the Slice type from the std, or maybe keep the two but return Bytes instead of Slice(UInt8) in all cases).

@bew
Copy link
Contributor Author

bew commented Jul 23, 2018

Ok, that's what I thought too. And I was also thinking about a Bytes type which inherits from Slice(UInt8) and would have Bytes-specific methods like to_* methods.

@asterite
Copy link
Member

Can't inherit from struct. They would be two different types with a few similar methods.

@bew
Copy link
Contributor Author

bew commented Jul 23, 2018

oh right, then maybe make it as a wrapper like:

struct Bytes
  def initialize(@raw : Slice(UInt8))
  end

  forward_missing_to @raw

  def to_i(...)
    # ...
  end
end

@asterite
Copy link
Member

A little copying is better than a little dependency. I won't get tired of saying that.

@RX14
Copy link
Contributor

RX14 commented Jul 23, 2018

just put it on string

@bew
Copy link
Contributor Author

bew commented Aug 10, 2018

@asterite by "a little copying" you mean the 550 lines of Slice ?

@asterite
Copy link
Member

It's bit less than that, because there's hexdump and others in Slice that should only belong to Bytes.

Maybe adding such methods to Slice but only having them compile for a slide of bytes isn't that bad, though.

@straight-shoota
Copy link
Member

I guess this could be a nice use case for extension methods as described in #6438

@RX14
Copy link
Contributor

RX14 commented Aug 11, 2018

it's easy enough to throw a {% raise unless T == UInt8 %} in the aforementioned methods and you'll hardly be able to tell the difference

@bew
Copy link
Contributor Author

bew commented Sep 3, 2018

@jhass what do you mean in this paragraph:

You cannot assume that any Slice(UInt8) is a valid ASCII encoded sequence or that the "right" conversion between "a number" and a "byte sequence" is interpreting that byte sequence as ASCII, there are many more ways to encode a number to a byte sequence.

By there are many more ways to encode a number to a byte sequence ? What other ways? (I'm probably just not seeing it)

Note that in this case to_<num-type> methods works with raw bytes directly, interpreting each byte as an ASCII char.

@jhass
Copy link
Member

jhass commented Sep 10, 2018

@bew A random example would be BCD, the main point is that it would be confusing for to_i to be to_i_assuming_this_is_really_ascii_here.

@straight-shoota
Copy link
Member

I would very much like to make basic string processing methods available for plain byte slices.
This would certainly help for some low level implementations to avoid unnecessary string allocations.

Maybe an option for this could be the introduction of a special type (perhaps StringBuffer?). It would wrap a Bytes instance and provide additional string related functionality.1
This type clearly dedicates it to hold string contents. It makes string related methods available.
Another benefit is that it can be used in other places as a replacement for String with string-like semantics. For example if a StringBuffer and String with the same contents compare equal, it can be used for hash lookups on string keys, without allocating the search string (ref #13844 (comment)).

Example:

record StringBuffer, data : Bytes

class String
  def ==(other : StringBuffer)
    to_slice == other.data
  end
end

hash = {"foo" => "bar"}
hash[StringBuffer.new("xfoox".to_slice[1..-2])] # => "bar"

Footnotes

  1. If it was possible to inherit from Slice, it would be a subtype of Slice(UInt8).

@straight-shoota
Copy link
Member

A relevant consideration when working with arbitrary byte slices interpreted as string content is that there's no guarantee for a trailing null byte.
The payload of String always ends with a null byte. Many algorithms implicitly rely on that when reading through a string. For example, a loop consuming whitespace will automatically break on a null byte. There's no need for explicit bounds checking.

With arbitrary byte slices, we cannot be sure that there's a trailing null byte. And in most practical use cases, there won't be any.
So algorithms working on a byte slice must employ explicit bounds checking.

@HertzDevil
Copy link
Contributor

Any attempt to expose these operations at the instance level would lead to the same conclusions as https://forum.crystal-lang.org/t/more-slices-less-alloc/4332/6, even if the type is only used internally, so I think they should only be available as class methods.

@straight-shoota
Copy link
Member

I don't follow what difference class methods would make in this context compared to a typed instance with instance methods?

@HertzDevil
Copy link
Contributor

Having StringBuffer#to_i instead of StringBuffer.to_i(Bytes) means having the possibility of storing the StringBuffer as part of another object, hence prolonging the original string's lifetime, which we should avoid (especially after working with related memory issues recently).

@straight-shoota
Copy link
Member

Isn't it the same with storing Bytes that is passed to .to_i? I don't see any difference whether the type that can potentially be stored is StringBuffer or Bytes. The result would be the same.

I wouldn't worry about this too much though either way. We're talking about a very low level instrument intended for optimizing heavily used algorithms. It should be fine to require some careful handling in this context.

@HertzDevil
Copy link
Contributor

HertzDevil commented Sep 27, 2023

IMO it is the intent that matters. If only StringBuffer.to_i exists, it is more likely that stdlib consumers will call it like to_i(str.to_slice[...]). On the other hand, having Bytes#to_i or StringBuffer#to_i necessitates the creation of a new object, and people will come back to their code thinking they could refactor things by moving the Bytes or StringBuffer into an object, without fully realizing its consequences on the GC.

Of course, if the answer to that is "people are equally likely to move entire original Strings into objects", then even the class methods should not be public API. The entire standard library is supposed to give a very clear message that Strings are not intended for this subslice-sharing capacity. I have no objections if StringBuffer is exclusive to stdlib internals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants