Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read/read_fully/read_partial/unbuffered_read behavior inconsistent and design unclear #1276

Closed
technorama opened this issue Aug 25, 2015 · 20 comments

Comments

@technorama
Copy link
Contributor

Do IO#read methods wait for all requested bytes or return with any available data? The current implementation varies in behavior depending the number of bytes requested. What is the intended behavior?

read(slice, count) # loops until count bytes are read ONLY if slice < BUFFER_SIZE, otherwise returns BUFFERSIZE bytes
read(slice) # loops until slice.count bytes are read ONLY if slice < BUFFER_SIZE, otherwise returns BUFFERSIZE bytes
read() # loops until EOF
read(count) # blocks until count bytes are read (not sure how)
read_fully(slice) loops until read returns slice.count bytes. (double loop)

Completely missing is read_partial functionality - a method that returns any available bytes <= the requested size.

@asterite
Copy link
Member

We'll probably remove read(slice, count) and use read(slice) instead, and it will read at most slice.length bytes and return the amount read. That would be the same as read_partial, right?

For the others:

  • read: reads everything
  • read(count): reads at most count bytes, looping until the count is reached or until EOF
  • read_fully(slice): tries to read slice.length bytes and raises EOFError if it can't

But it's true that we need to better design these methods.

@technorama
Copy link
Contributor Author

There is an inconsistency with the above proposal which would be confusing for anyone trying to learn the API.

  • read: everything
  • read(count): everything
  • read(slice): NOT everything

I propose that read always reads the full amount requested

  • read: everything
  • read(count): everything
  • read(slice): everything

And read_partial returns whatever data is available

  • read_partial: first available data
  • read_partial(count): first available data
  • read_partial(slice): first available data

read_fully can be retired.

@Kilobyte22
Copy link

Let me annotate that read_partial should block if there are 0 bytes available.

@asterite
Copy link
Member

asterite commented Sep 4, 2015

We were thinking something along this line. These two methods, the ones that every IO must implement, read partially:

- read(slice : Slice(UInt8)) : Int32
- read(slice : Slice(Char)) : Int32

Every other read-related method reads fully, and this is the consistent part:

- read_fully(slice : Slice(UInt8)) : Nil
- read_fully(slice : Slice(Char)) : Nil
- read_to_end : Slice(UInt8)
- read_chars_to_end : Slice(Char)
- read_string_to_end : String # the current read()
- read(count) : Slice(UInt8)
- read_chars(count) : Slice(Char)
- read_string(count) : String # the current read(count)

The convention is that read, without suffix, means bytes. Then if there's an argument for overloading, like Slice(UInt8) or Slice(Char), we can use that. Otherwise, we use chars for Slice(Char) and string for String.

We will also remove read_nonblock.

gets will still be around, as reading a line only makes sense for strings.

@asterite
Copy link
Member

asterite commented Sep 8, 2015

@technorama Any opinion on the above methods, types and names?

@technorama
Copy link
Contributor Author

I am confused about the need for Slice(Char) vs Slice(UInt8) vs String. There seem to be too many types but maybe I am ignorant on the intended usage.

Please correct me where I'm wrong.

  • String is for utf8 strings
  • Slice(Char) is for ?
  • Slice(Uint8) is for binary/ascii, data, or passing to C functions.

@technorama
Copy link
Contributor Author

read_nonblock can be useful with multiprocessing.

One example is nginx and forking. The loop is similiar to:

# in each process
loop do
  multiprocess_lock do
    # multiple sockets are accepted by same process until failure to avoid context switch time.
    while sock = accept_nonblock
      add_to_event_queue sock
    end
  end

Currently it can only be done with forking. Similar things can be done in the future with websocket or other servers/services when multithreading is available.

@technorama
Copy link
Contributor Author

That's a case for accept_nonblock, perhaps not read_nonblock. Maybe there isn't one.

@asterite
Copy link
Member

asterite commented Sep 9, 2015

The methods that accept Slice(Char) would work similar to Slice(UInt8): they would fill the slice with characters. We based this functionality from C#'s TextReader. Although now that I think of it, I don't know how to implement that, because read(slice : Slice(Char)) will probably invoke read(slice : Slice(UInt8)) and if you don't read the necessary bytes to "finish" a char I don't know how to handle that.

But read_chars : Slice(Char) can be implemented fine, and you'd get a slice, instead of a String.

@waj noted that in Go there's no read_nonblock so we might want to try removing it and later see if it's needed at all.

@asterite
Copy link
Member

We finally decided to do it like this:

# partial read
- read(slice : Slice(UInt8)) : Int32
- read(slice : Slice(Char)) : Int32

# full read
- read_fully(slice : Slice(UInt8)) : Nil
- read_fully(slice : Slice(Char)) : Nil
- read_to_end : String
- read(count) : Slice(UInt8)
- read_chars(count) : Slice(Char)
- read_string(count) : String

@technorama
Copy link
Contributor Author

There's still a discrepancy in behavior and naming. read should be partial or full for all types of arguments.

 # partial read
- read(slice : Slice(UInt8)) : Int32
- read(slice : Slice(Char)) : Int32

# full read
- read(count) : Slice(UInt8)

Here's an example of what can go wrong. Expect lots of confusion even when not refactoring.

# first iteration
while slice = io.read(FRAME_SIZE)
end
# optimized version reusing the same buffer.
slice = Bytes.new(FRAME_SIZE)
while io.read(slice) > 0
  # BREAKS HORRIBLY
end

@technorama
Copy link
Contributor Author

You could make read(count) a partial read and add read_fully(count).

read_to_end as a Slice(UInt8) is useful in a variety of protocols.

# HTTP read without chunking or a known content-length
read_headers(io)
bytes = io.read_to_end # contains file data
io.close
# FTP
TCPSocket.open(host, data_port) do
  bytes = io.read_to_end
end 
# Unix programs
Process.run("mysqlbinlog $filename") do |pr|
  process_binlog(pr.output.read_to_end)
end

@waj
Copy link
Member

waj commented Sep 15, 2015

I think I agree with the read overload that gets a fixed amount of bytes. We can name it read_bytes(count : Int32) : Slice(UInt8) instead.

Regarding methods to get the full binary content into memory, we'd like to discourage that and that's why we removed those overloads from the list. Reading as string is also questionable but sometimes convenient.

@asterite
Copy link
Member

I'm closing this, the only partial behaviour if that of read(slice), read_fully(slice) reads fully, and then we don't have any other overloads for read.

@jhass
Copy link
Member

jhass commented Jul 23, 2016

This is still an issue basically, it's still inconsistent, read_nonblock is still there, broken and has no replacement.

Because of https://github.com/crystal-lang/crystal/blob/master/src/io/file_descriptor.cr#L225 it's not possible to request up to n bytes and return immediately. In other words there's no way to check whether there's data to read. Not even IO.select is usable to check for it, since the data may have been consumed by IO::Buffered already.

So we do lack an interface that

  • returns immediately in any case.
  • returns immediately if there's no pending data in any buffer, kernel or userspace.
  • tries to return as much data as possible but chooses to return less than requested instead of blocking the caller in any way.
  • does this independently of the type of the IO or its underlying file descriptor, and independently of any configuration/flags set on said file descriptor.

Alternatively we lack an interface that returns the number of bytes that the next read call can request without being blocked in any way (again independently of the type of the IO or its underlying file descriptor, and independently of any configuration/flags set on said file descriptor).

@asterite
Copy link
Member

I honestly don't know a lot about these subjects, but if all our IOs are non-blocking by default, why do we need a read_nonblock? If the IO is not available, give the control to another fiber until this fiber is done. In what case do you need to read what's avaialable, without switching fibers?

@asterite
Copy link
Member

Maybe @waj could comment more on this...

@jhass
Copy link
Member

jhass commented Jul 23, 2016

It quickly becomes important in TUI applications when reading keypresses, you may want to redraw continuously while no key is pressed, so not block on waiting for the next keypress, but also have precise control over when to read the next keypress.

@asterite
Copy link
Member

Maybe it's about configuring something in IO::FileDescriptor to return the read bytes instead of switching to another fiber? The check should be done here:

https://github.com/crystal-lang/crystal/blob/master/src/io/file_descriptor.cr#L216-L234

Instead of calling wait_readable we would return.

How can we test this, if we want to support such feature?

@jhass
Copy link
Member

jhass commented Jul 23, 2016

Reading from an empty pipe should work I guess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants