Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Literals for Slice(UInt8) #2886

Open
asterite opened this issue Jun 21, 2016 · 33 comments
Open

Literals for Slice(UInt8) #2886

asterite opened this issue Jun 21, 2016 · 33 comments

Comments

@asterite
Copy link
Member

We need a way to express binary data embedded in the data section of the program. We can do this right now for strings, but there's no way to create a non-UTF8 string with a string literal.

There are several ways we can fix this:

  1. We can add back the \x... escape to string literals, to add a byte with a specific hexadecimal value. Right now strings can hold non-UTF8 data, they just raise when using those strings as UTF-8 data (for example, iterating them), so it's strange that they can hold non-UTF8 data but one can't create them with a literal. From there, one could take a slice. This will also solve Remove macro methods from the language #2565 because inspecting a string with non-valid codepoints will output \x... for those values.
  2. We add a literal for something like a Slice(UInt8). It could just be Slice(UInt8), but these are not read-only. Or maybe they can be read-only and they can crash the program when written. One shouldn't write them, the same way as one doesn't get a slice from a string literal and writes to it. There was the idea of introducing const [...] for this, with which we could create static data for any kind of integer value.
  3. Other options...?

This doesn't have a big priority right now, but I'm leaving it here so there's a place to discuss this.

@jhass
Copy link
Member

jhass commented Jun 21, 2016

I would tend towards 2 with something like #2791 (comment) as the preferred alternative. Either way we need to make sure to not run into issues similar to #2485.

@ysbaddaden
Copy link
Contributor

Same here: Slice(UInt8) is the de-facto type for binary data whereas String may only contain UTF-8 data. I don't think it's a good idea to push the idea that it's okay to put arbitrary bytes into a String.

inspecting a string with non-valid codepoints will output \x... for those values

But I like that.

@asterite
Copy link
Member Author

One issue we found the other day is that we needed to do a POST in the http client with binary data. We made it work by simply creating a String with that data and then invoking HTTP::Client.post. I think I like that, it's pretty convenient. Otherwise we'd need to add overloads or restrictions for Slice(UInt8), and HTTP::Request will have the body as String | Slice(UInt8), etc.

To compare with other statically typed languaged, Go's strings are also just byte chunks that can hold arbitrary bytes, but can also be treated as UTF-8 strings when needed: https://blog.golang.org/strings

Java's String class is supposed to be UTF-16, but can hold arbitrary bytes as well.

@jhass
Copy link
Member

jhass commented Jun 21, 2016

I very much like that String is supposed to handle UTF-8 valid data and operations on that. And nothing else. I would hate to loose that property and rather prefer convenience API added to other interfaces for handling Slice(UInt8). In the HTTP example binary data would need some form of content encoding to valid ASCII values anyway. Detecting to do that encoding automatically upon receiving a Slice(UInt8) vs a String seems actually easier than always second guessing whether String needs it or not.

@bcardiff
Copy link
Member

I am with @jhass here. I would keep String as valid UTF-8.

I would rather add overloads in the http client to send/receive blobs.
And I definitely want to be able to embed binary resources (Slice or StaticArray and then a convenient api to wrap it)

@mperham
Copy link
Contributor

mperham commented Jun 21, 2016

Would it be helpful or more performant to have Base64.decode(str, io) : Nil so I can decode the asset and stream it out with the response?

@asterite
Copy link
Member Author

@mperham Good idea, an overload that writes directly to an IO is missing. Should be easy to add.

@mperham
Copy link
Contributor

mperham commented Jun 21, 2016

Just a side note, I'm trying to write the Slice(UInt8) out to the Kemal response:

    def self.serve(filename, resp)
      resp.status_code = 200
      resp.write Base64.decode(WEB_ASSETS[filename])
    end

I verified that the Slice size is exactly the same size as the file on disk but the response only has about half the expected bytes. Anyone know why the server response is not writing the entire Slice to the client?

@asterite
Copy link
Member Author

@mperham we'd probably need a concrete code that we can reproduce to check if something works wrong. I tried creating a slice of 5000~50000 bytes and it works well.

@mperham
Copy link
Contributor

mperham commented Jun 21, 2016

Looks like the problem is related to me not setting the content-type header. The browser prints out the PNG contents as text/html but serves it correctly when I set it to "application/octet-stream".

@ozra
Copy link
Contributor

ozra commented Jun 22, 2016

Just throwing thoughts in to the mixture here: How about a literal that generates a View(UInt8) which would be a read only type derived from Slice(UInt8)? If it's known at compile time an area is unwritable, we should be helped at compile time, avoiding a crash where possible.

@david50407
Copy link
Contributor

How about provide users to create their own literal types (maybe in %data{ ... } format, data can be any words for each type, and {} can be [] or ()) like C++11 does?

Then we can create some custom literals for Slice(UInt8), StaticArray(UInt8) or other types we want? (use macro to define these works in compile-time, maybe?)

@maxpowa
Copy link
Contributor

maxpowa commented Dec 15, 2016

Any progress on this? Usecase in my scenario is writing bytes to an IO, as one might do when using low level packets on the wire. io.write_bytes(0x00000000, IO::ByteFormat::BigEndian) doesn't provide 4 empty bytes as one might expect, but rather outputs a single empty byte.

@JacobUb
Copy link
Contributor

JacobUb commented Dec 16, 2016

@maxpowa
It works for me 😕

io = IO::Memory.new
io.write UInt8.slice(1, 1, 1, 1, 1, 1, 1, 1)
io.rewind
io.write_bytes(0x00000000, IO::ByteFormat::BigEndian)
io.to_slice # Bytes[0, 0, 0, 0, 1, 1, 1, 1]

@maxpowa
Copy link
Contributor

maxpowa commented Dec 16, 2016

Yep nevermind, it is indeed working... I must have done something wrong when I was testing. Thanks @exilor

@oprypin
Copy link
Member

oprypin commented Jul 10, 2017

We can add back the \x... escape to string literals

This has been implemented in cd8296b, by the way.

I think it's a really bad idea to allow broken string literals in the language's core syntax. I noticed that some people are already doing hideous things with it, without really understanding the situation...
This should only be possible through an unsafe operation.

The alternative solution is the way to go. Bytes literals should definitely be a thing.

And "\xff" syntax should give an explicit error like "strings are for UTF-8 encoded text, not for arbitrary bytes".

Side note: in Python "\x**" means "\u{**}", but they do have bytes literals where it means what you'd expect: b'\xff'

@bararchy
Copy link
Contributor

@oprypin but sometimes people need to do hideous things for hideous causes :)
this feature is important, it's heavily relaid on in fuzzers and exploit development (yes FFS using Crystal ! :) )
https://www.offensive-security.com/metasploit-unleashed/shell/

 def exploit
        connected = connect_login
        nopes = "\x90"*(payload_space-payload.encoded.length) # to be fixed with make_nops()
        sjump = "\xEB\xF9\x90\x90"     # Jmp Back
        njump = "\xE9\xDD\xD7\xFF\xFF" # And Back Again Baby  ;)         
        evil = nopes + payload.encoded + njump + sjump + [target.ret].pack("A3")
        print_status("Sending payload")
        sploit = '0002 LIST () "/' + evil + '" "PWNED"' + "\r\n"
        sock.put(sploit)
        handler
        disconnect
    end

etc....

@RX14
Copy link
Contributor

RX14 commented Jul 11, 2017

It's still easy enough to construct a string with invalid data, I just don't think it should be part of the syntax.

@oprypin
Copy link
Member

oprypin commented Jul 11, 2017

@bararchy, thanks for a good demonstration of the point I was making... All of these should have been Bytes

@oprypin
Copy link
Member

oprypin commented Sep 26, 2017

I forgot that this issue existed and just started writing a new one.
Anyway... I'm just still appalled that there's a literal for invalid strings.

So, ping

Putting bytes literals in read-only data is a must-have, and so if the literal produces a writable Slice(UInt8), that's a problem. Or it used to be, not anymore! Now we even have read-only slices.
So there are really no blockers now.

@asterite
Copy link
Member Author

asterite commented Sep 26, 2017

Right now this is solved because one can use a String for this, because a String can now have arbitrary bytes.

I know it's not the most elegant solution, but for now it works. We can postpone a real solution for this for later.

@oprypin
Copy link
Member

oprypin commented Sep 26, 2017

pls

@asterite
Copy link
Member Author

asterite commented Sep 26, 2017

What if we add:

b"some content"

For now that would be equivalent to:

"some content".to_slice

and of course you can use \xAB for specific byte values.

We could also have:

b'x'

to be the same as 'x'.ord.to_u8 and not have it compile if it doesn't fit in an UInt8, so that would be a byte literal.

I think Rust uses the same notation.

@oprypin
Copy link
Member

oprypin commented Sep 26, 2017

My suggestion that I started to write:


It would be a literal that does not allow \u escapes, and allows only ASCII characters, supplemented by the \xff syntax for arbitrary bytes. The literal would produce Slice(UInt8).

I propose the syntax b"foo\x12fsdfg", like in Python and Rust.

Side note, Bytes[] macro probably should be rewritten to produce a literal.

I would also suggest removing the hexadecimal notation from strings. Obviously, to replace the use case, the bytes literal would need to store the data in the read-only data section. I don't know whether that means that the size of the slice would need to be moved there as well, like it is with strings.


"some content".to_slice is impossible to do if hexadecimal escapes are removed from strings, which is the main problem I have

@asterite
Copy link
Member Author

Oh, with "some content".to_slice I meant it would be equivalent to that. We could probably type b"hello" as a read-only Slice(UInt8) and put that in the ROM section of the program.

For that we'll probably need Slice to be part of the known types for the compiler, and have @pointer, @size and @read_only laid out accordingly in memory.

But for now I'd leave the ability to have \x.. escape sequences in a String. Later we can remove them, but we'll have to make sure that there's no way to create strings that are not valid in UTF-8. Maybe that will slow down everything a bit, but, well, correct code is better than fast code.

@oprypin
Copy link
Member

oprypin commented Sep 26, 2017

@pointer, @size and @read_only could be directly followed by the data itself, with @pointer being equal to its own address + offset.

I don't think it's that important to prevent strings that are not valid UTF-8. The only way to create them is String.new(bytes or pointer), just raise the awareness. The problem is that people see nothing wrong with '\x' string literals and then intentionally seek out a way to recreate such strings "programmatically".

@RX14
Copy link
Contributor

RX14 commented Oct 6, 2017

@asterite We wouldn't need to know about Slice's internal layout, unlike String, because we can simply define that Slice needs to have a constructor taking a pointer and a size (which we already have). We only need to know String's layout because we put the data contiguously. With Slice we don't need to do that. And it's probably not worth doing it since it's a struct and LLVM will optimize since both the constructor arguments are literals.

@oprypin
Copy link
Member

oprypin commented Oct 6, 2017

@RX14, are you sure you understand the part about putting this in read-only data section?

@RX14
Copy link
Contributor

RX14 commented Oct 6, 2017

@oprypin yes.... you pass a pointer to the data in the RO section to the slice contructor. The slice instance itself has to live on the stack anyway, so can't be in ROdata.

@oprypin
Copy link
Member

oprypin commented May 28, 2018

@RX14 Please reopen this

@straight-shoota
Copy link
Member

Why was this even closed and all those other issues which are most definitely not fixed?

@HertzDevil
Copy link
Contributor

I would go one step further and use a completely new syntax similar to Elixir's bitstrings, rather than simply borrowing the one for string literals:

<<0x12>>             # => <<0x12>>
<<0x21>>             # => "!"
<<0xCF, 0x83>>       # => "σ"
"\xCF\x83"           # => "σ"
<<0x12, 0xCF, 0x83>> # => <<0x12, 0xCF, 0x83>>
"\x12\xCF\x83"       # => <<0x12, 0xCF, 0x83>>
<<0x12, "σ">>        # => <<0x12, 0xCF, 0x83>>

(Every double-quoted string literal in Elixir denotes a bitstring. Single-quoted ones produce charlists.)

An attractive feature about them is they can handle multibyte sequences:

<<0x12345678::32>>        # => <<18, 52, 86, 120>>
<<0x12345678::32-little>> # => <<18, 52, 86, 120>>

<<1.0::little>>    # => <<0, 0, 0, 0, 0, 0, 240, 63>>
<<1.0::32-little>> # => <<0, 0, 128, 63>>

<<0xCF83::16>>        # => "σ"
<<0x83CF::16-little>> # => "σ"
<<"σ"::utf8>>         # => "σ"
<<"σ"::utf16-little>> # => <<195, 3>>
<<0x03C3::utf8>>      # => "σ"
<<0x03C3::utf16-big>> # => <<3, 195>>

It emphasizes the fact that byte arrays are a more general concept than string-like byte sequences.

It is important that both the Bytes itself and the data it refers to are stored in read-only memory; the Slice constructor that accepts a pointer is unsafe, so the data must be encapsulated behind a read-only Bytes, with no other way to access it.

If we have an extremely fast String#valid_encoding?, say even faster than #each_char(&), then the performance penalties should be very minimal. So as a starter I think we should incorporate one of the algorithms in #11873. (In fact, the standard library has never used that method since its introduction.)

@philipp-kempgen
Copy link

Just to throw yet another an idea in:
Ruby has the .b method for strings.
https://docs.ruby-lang.org/en/3.2/String.html#method-i-b
Maybe "bytestring\x00\x01".b could be treated as a byteslice literal in Crystal?
(I prefer to say "a ByteSlice" rather than "a Bytes".)

Ruby also has ?… for character literals (or rather single-character strings), even supporting control characters.
https://docs.ruby-lang.org/en/3.2/syntax/literals_rdoc.html#label-Strings

?\C-g == ?\a  # => true

Then again, b"…" and b'…' or Elixir bitstrings are probably better, if they could maybe use b(…) or b[…] or %b(…) instead of <<…>>, provided they let you write things like:

b( "filemagic", 0x01, 0x02, '\a', '\C-g' )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests