-
-
Notifications
You must be signed in to change notification settings - Fork 40
BYOB ("Bring Your Own Buffer") read interface #160
Comments
Interesting, thanks for providing this! In fact, we always know the expected size of the buffer from the known file-size, current location and requested read length. Therefore, I would imagine an optional As far as libhdfs3 is concerned, yes, ContinuumIO/libhdfs3-downstream is the correct place. You will see that there has not been a whole lot of development, and there remains confusion over how to handle all the security possibilities. If you submit a PR there which improves performance, I'll be sure to merge and release it. |
I created a simple self-contained benchmark: https://github.com/LiberTEM/LiberTEM/blob/master/benchmarks/hdfs/bench_buffering.py On my laptop, this results in:
Or, with the mentioned libhdfs3 tweaks:
(of course reads are served from fs cache) So, in perspective, the reallocation costs are not nearly that bad as the buffering and copying. As to the downside: to communicate short reads (i.e. at the end of the file), the method needs to return the number of bytes read, and not the buffer itself. Thinking about it, there is an alternative: it could return a sliced memory view of the 'valid' part of the buffer, but that may not be obvious for the user, who could just use the whole buffer he passed in... So, new proposal: add a method People who need the performance would need to use |
Yes, agree that I propose
|
Small note on naming, tornado uses read_into, socket uses recv_into
…On Wed, Apr 25, 2018 at 9:46 AM, Martin Durant ***@***.***> wrote:
Yes, agree that readinto (no underscore!) is exactly the right method
name, "Read up to len(b) bytes into b", whereb is a bytearray or
memoryview.
I propose read(length=None, buffer=None) where
- buffer=None (default) produces bytes objects. Note that from what
you say, it is faster to create a bytearray of size (self.size -
self.loc) if length in [None, -1] else length and fill it, and convert
to bytes when returning
- buffer=True, create the bytearray as above, and return it without
conversion to bytes
- buffer=b same as readinto, except raise exception if len(b) doesn't
fix the data.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#160 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszEJ7LSLh6kEyQvCPiQtm9i2EGCLSks5tsH4jgaJpZM4TiEGa>
.
|
Hah. I suppose we are best of keeping to the builtin file standard. |
So this is what I came up with: def readinto(self, length, out):
"""
Read up to ``length`` bytes from the file into the ``out`` buffer,
which can be of any type that implements the buffer protocol (example: bytearray,
memoryview, numpy array, ...).
Returns the number of bytes read.
"""
_lib = hdfs3.core._lib
if not _lib.hdfsFileIsOpenForRead(self._handle):
raise IOError('File not read mode')
bufsize = length
bufpos = 0
# convert from buffer protocol to ctypes-compatible type
out = memoryview(out)
buf_for_ctypes = (ctypes.c_byte * out.nbytes).from_buffer(out)
while length:
bufp = ctypes.byref(buf_for_ctypes, bufpos)
ret = _lib.hdfsRead(
self._fs, self._handle, bufp, ctypes.c_int32(bufsize - bufpos))
if ret == 0: # EOF
break
if ret > 0:
length -= ret
bufpos += ret
else:
raise IOError('Read file %s Failed:' % self.path, -ret)
return bufpos
@property
def size(self):
if self._size is None:
self._size = self.info()['size']
return self._size
def read(self, length=None, buffer_=None):
return_buffer = True
max_read = self.size - self.tell()
read_length = max_read if length in [None, -1] else length
if buffer_ is None:
return_buffer = False
if buffer_ is None or buffer_ is False:
buffer_ = memoryview(bytearray(read_length))
else:
buffer_ = memoryview(buffer_)
if buffer_.nbytes < read_length:
raise IOError('buffer too small (%d < %d)' % (buffer_.nbytes, read_length))
bytes_read = self.read_into(length=read_length, out=buffer_)
if bytes_read < buffer_.nbytes:
buffer_ = buffer_[:bytes_read]
if return_buffer:
return buffer_
return buffer_.tobytes() Some notes:
Let me know if I should start a pull request for this, if it is easier to review. |
Yes, I think this is definitely along the right lines. My thoughts
Some tests covering the different options would be nice. |
I think I addressed all your notes in #162 . I think I liked the docker container for testing, it worked well, at least on my work laptop. My personal laptop didn't really enjoy the stress testing :) |
Currently,
HDFile.read(...)
involves both allocation and copying of buffers. When reading locally, with read short circuiting enabled, this can become a bottleneck. Here isHDFile.read
annotated with the allocations and copies:I suggest adding the possibility to specify the output buffer, without doing any additional copying/buffering. Here is a prototype implementation that, in one of my tests, speeds up reading large binary data by a factor of about 4:
The final interface should probably be something like
read(self, length=None, out=None)
to support both modes of operation and not pollute the API namespace, and there should be some range checks to prevent overflows.Thoughts?
By the way, there are still copies happening inside of
libhdfs3
which, when patched out/worked around, give another nice speedup (in short: settinginput.localread.default.buffersize=1
and patching out checksumming → reads go directly into the buffer given by the user). Where is thelibhdfs3
development happening these days? Is ContinuumIO/libhdfs3-downstream the right place to work on this?The text was updated successfully, but these errors were encountered: