Move to_<num_type> from String to Slice #6389

bew · 2018-07-15T03:28:01Z

While working on a terminfo parsing library, where I manipulate Slice of bytes (Bytes type), I needed to convert digits to an actual number.
I realized that all the to_<num_type> conversion methods (like to_i64, to_u8, to_f, ..) are in String, and I needed to allocate a String just to convert the bytes to a number.

I think thoses methods should be available in Slice too. The number conversion methods don't need anything from String as they directly work with a pointer and a size. They can almost directly be copy/pasted to Slice!

I have a working branch with this change, I can make a PR when it's accepted!

Note: the only problem I see here is that those methods will now exists for all kind of Slice, not only Slice(UInt8).. But maybe it's ok? or maybe we can add a macro check inside the to_<num_type> methods that raise when T != UInt8 ?
Another possibility would be to make Bytes a new type (not just an alias) that inherits from Slice(UInt8) so it's easier to add methods that works specifically on bytes.

WDYT?

The text was updated successfully, but these errors were encountered:

jhass · 2018-07-15T08:27:12Z

Crystal strings are always UTF-8 encoded, so when it comes to number conversion they work in ASCII so to say. You cannot assume that any Slice(UInt8) is a valid ASCII encoded sequence or that the "right" conversion between "a number" and a "byte sequence" is interpreting that byte sequence as ASCII, there are many more ways to encode a number to a byte sequence.

We should rather try to find an allocation free way to turn a Bytes into a String assuming the Bytes is valid ASCII or UTF-8 already. Or we could support ASCII number conversion as part of the IO::ByteFormat interface I guess. As a last resort I would extract the number conversion methods to static methods that work with a Bytes, but remain on String or a new util class whose name makes it clear that it's about string based number conversion.

RX14 · 2018-07-15T10:40:40Z

I agree that String.unsafe_from_slice is probably a better solution.

Unfortunately, it can't be implemented because strings are layed out differently from every other class. Strings don't have a pointer to their data, they have their data directly after their class in memory, so it's impossible to turn a Bytes into a String without allocating.

I think that String.to_i(slice, ...) is the best interface thats workable.

asterite · 2018-07-22T22:24:57Z

I wouldn't mind having to_i, etc., directly in Slice.

bew · 2018-07-23T00:58:45Z

Having to_i on Slice is only logic for Slice(UInt8), right @asterite ? What do you think should be done for a Slice(Person) ?

asterite · 2018-07-23T01:35:57Z

Give an error. We already have methods on slice that only work for byte slices. And byte slices are the most common slice type (in fact, but I would happily change Slice(T) to a non generic Bytes type and remove the Slice type from the std, or maybe keep the two but return Bytes instead of Slice(UInt8) in all cases).

bew · 2018-07-23T02:46:42Z

Ok, that's what I thought too. And I was also thinking about a Bytes type which inherits from Slice(UInt8) and would have Bytes-specific methods like to_* methods.

asterite · 2018-07-23T02:50:48Z

Can't inherit from struct. They would be two different types with a few similar methods.

bew · 2018-07-23T02:55:25Z

oh right, then maybe make it as a wrapper like:

struct Bytes
  def initialize(@raw : Slice(UInt8))
  end

  forward_missing_to @raw

  def to_i(...)
    # ...
  end
end

asterite · 2018-07-23T02:56:19Z

A little copying is better than a little dependency. I won't get tired of saying that.

RX14 · 2018-07-23T09:57:52Z

just put it on string

bew · 2018-08-10T06:40:09Z

@asterite by "a little copying" you mean the 550 lines of Slice ?

asterite · 2018-08-10T13:21:05Z

It's bit less than that, because there's hexdump and others in Slice that should only belong to Bytes.

Maybe adding such methods to Slice but only having them compile for a slide of bytes isn't that bad, though.

straight-shoota · 2018-08-10T13:32:19Z

I guess this could be a nice use case for extension methods as described in #6438

RX14 · 2018-08-11T15:43:22Z

it's easy enough to throw a {% raise unless T == UInt8 %} in the aforementioned methods and you'll hardly be able to tell the difference

bew · 2018-09-03T23:19:53Z

@jhass what do you mean in this paragraph:

You cannot assume that any Slice(UInt8) is a valid ASCII encoded sequence or that the "right" conversion between "a number" and a "byte sequence" is interpreting that byte sequence as ASCII, there are many more ways to encode a number to a byte sequence.

By there are many more ways to encode a number to a byte sequence ? What other ways? (I'm probably just not seeing it)

Note that in this case to_<num-type> methods works with raw bytes directly, interpreting each byte as an ASCII char.

jhass · 2018-09-10T13:58:13Z

@bew A random example would be BCD, the main point is that it would be confusing for to_i to be to_i_assuming_this_is_really_ascii_here.

straight-shoota · 2023-09-27T12:06:42Z

I would very much like to make basic string processing methods available for plain byte slices.
This would certainly help for some low level implementations to avoid unnecessary string allocations.

Maybe an option for this could be the introduction of a special type (perhaps StringBuffer?). It would wrap a Bytes instance and provide additional string related functionality.¹
This type clearly dedicates it to hold string contents. It makes string related methods available.
Another benefit is that it can be used in other places as a replacement for String with string-like semantics. For example if a StringBuffer and String with the same contents compare equal, it can be used for hash lookups on string keys, without allocating the search string (ref #13844 (comment)).

Example:

record StringBuffer, data : Bytes

class String
  def ==(other : StringBuffer)
    to_slice == other.data
  end
end

hash = {"foo" => "bar"}
hash[StringBuffer.new("xfoox".to_slice[1..-2])] # => "bar"

If it was possible to inherit from Slice, it would be a subtype of Slice(UInt8). ↩

straight-shoota · 2023-09-27T12:12:34Z

A relevant consideration when working with arbitrary byte slices interpreted as string content is that there's no guarantee for a trailing null byte.
The payload of String always ends with a null byte. Many algorithms implicitly rely on that when reading through a string. For example, a loop consuming whitespace will automatically break on a null byte. There's no need for explicit bounds checking.

With arbitrary byte slices, we cannot be sure that there's a trailing null byte. And in most practical use cases, there won't be any.
So algorithms working on a byte slice must employ explicit bounds checking.

HertzDevil · 2023-09-27T12:30:41Z

Any attempt to expose these operations at the instance level would lead to the same conclusions as https://forum.crystal-lang.org/t/more-slices-less-alloc/4332/6, even if the type is only used internally, so I think they should only be available as class methods.

straight-shoota · 2023-09-27T12:42:17Z

I don't follow what difference class methods would make in this context compared to a typed instance with instance methods?

HertzDevil · 2023-09-27T12:45:51Z

Having StringBuffer#to_i instead of StringBuffer.to_i(Bytes) means having the possibility of storing the StringBuffer as part of another object, hence prolonging the original string's lifetime, which we should avoid (especially after working with related memory issues recently).

straight-shoota · 2023-09-27T13:08:49Z

Isn't it the same with storing Bytes that is passed to .to_i? I don't see any difference whether the type that can potentially be stored is StringBuffer or Bytes. The result would be the same.

I wouldn't worry about this too much though either way. We're talking about a very low level instrument intended for optimizing heavily used algorithms. It should be fine to require some careful handling in this context.

HertzDevil · 2023-09-27T13:30:19Z

IMO it is the intent that matters. If only StringBuffer.to_i exists, it is more likely that stdlib consumers will call it like to_i(str.to_slice[...]). On the other hand, having Bytes#to_i or StringBuffer#to_i necessitates the creation of a new object, and people will come back to their code thinking they could refactor things by moving the Bytes or StringBuffer into an object, without fully realizing its consequences on the GC.

Of course, if the answer to that is "people are equally likely to move entire original Strings into objects", then even the class methods should not be public API. The entire standard library is supposed to give a very clear message that Strings are not intended for this subslice-sharing capacity. I have no objections if StringBuffer is exclusive to stdlib internals.

jhass added kind:feature topic:stdlib status:discussion labels Jul 15, 2018

bew mentioned this issue Jul 17, 2018

(WIP) Optimize String#gsub with block #6397

Closed

straight-shoota mentioned this issue Jan 8, 2022

Remove string allocation from GC_set_warn_proc #11729

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to_<num_type> from String to Slice #6389

Move to_<num_type> from String to Slice #6389

bew commented Jul 15, 2018 •

edited

jhass commented Jul 15, 2018 •

edited

RX14 commented Jul 15, 2018 •

edited

asterite commented Jul 22, 2018

bew commented Jul 23, 2018

asterite commented Jul 23, 2018 •

edited

bew commented Jul 23, 2018

asterite commented Jul 23, 2018

bew commented Jul 23, 2018

asterite commented Jul 23, 2018

RX14 commented Jul 23, 2018

bew commented Aug 10, 2018

asterite commented Aug 10, 2018

straight-shoota commented Aug 10, 2018

RX14 commented Aug 11, 2018

bew commented Sep 3, 2018 •

edited

jhass commented Sep 10, 2018 •

edited

straight-shoota commented Sep 27, 2023

straight-shoota commented Sep 27, 2023

HertzDevil commented Sep 27, 2023

straight-shoota commented Sep 27, 2023

HertzDevil commented Sep 27, 2023

straight-shoota commented Sep 27, 2023

HertzDevil commented Sep 27, 2023 •

edited

Move to_<num_type> from String to Slice #6389

Move to_<num_type> from String to Slice #6389

Comments

bew commented Jul 15, 2018 • edited

jhass commented Jul 15, 2018 • edited

RX14 commented Jul 15, 2018 • edited

asterite commented Jul 22, 2018

bew commented Jul 23, 2018

asterite commented Jul 23, 2018 • edited

bew commented Jul 23, 2018

asterite commented Jul 23, 2018

bew commented Jul 23, 2018

asterite commented Jul 23, 2018

RX14 commented Jul 23, 2018

bew commented Aug 10, 2018

asterite commented Aug 10, 2018

straight-shoota commented Aug 10, 2018

RX14 commented Aug 11, 2018

bew commented Sep 3, 2018 • edited

jhass commented Sep 10, 2018 • edited

straight-shoota commented Sep 27, 2023

Footnotes

straight-shoota commented Sep 27, 2023

HertzDevil commented Sep 27, 2023

straight-shoota commented Sep 27, 2023

HertzDevil commented Sep 27, 2023

straight-shoota commented Sep 27, 2023

HertzDevil commented Sep 27, 2023 • edited

bew commented Jul 15, 2018 •

edited

jhass commented Jul 15, 2018 •

edited

RX14 commented Jul 15, 2018 •

edited

asterite commented Jul 23, 2018 •

edited

bew commented Sep 3, 2018 •

edited

jhass commented Sep 10, 2018 •

edited

HertzDevil commented Sep 27, 2023 •

edited