ARROW-2816: [Python] Make NativeFile BufferedIOBase-compliant#2253
ARROW-2816: [Python] Make NativeFile BufferedIOBase-compliant#2253alendit wants to merge 2 commits intoapache:masterfrom
Conversation
9b48f28 to
aef1647
Compare
|
I'm not sure why the CI build is failing... |
|
Super weird error: |
|
Oh, it was an unescaped backslash in the |
Codecov Report
@@ Coverage Diff @@
## master #2253 +/- ##
==========================================
+ Coverage 84.29% 86.65% +2.35%
==========================================
Files 292 236 -56
Lines 44559 41907 -2652
==========================================
- Hits 37562 36314 -1248
+ Misses 6966 5593 -1373
+ Partials 31 0 -31
Continue to review full report at Codecov.
|
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
I would say PyBUF_CONTIG, to mandate writability.
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
It's simpler to just say buf = <uint8_t*> py_buffer.buf (assuming that works).
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
You need to call PyBuffer_Release(&py_buffer) at the end of the function, otherwise there will be a leak.
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
So if I'm calling readline on a 2GB file, this will allocate a 2GB bytestring? That doesn't sound like a very good idea.
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
At some point you should check whether bytes_read is 0.
python/pyarrow/tests/test_io.py
Outdated
There was a problem hiding this comment.
I don't think this test is necessary.
python/pyarrow/tests/test_io.py
Outdated
There was a problem hiding this comment.
You can merge this test and the previous one.
python/pyarrow/tests/test_io.py
Outdated
There was a problem hiding this comment.
It's dubious that this should be RawIOBase, since a BytesIO is buffered. OTOH this could be considered a best-effort registration.
There was a problem hiding this comment.
Is it specified somewhere? There is BufferedIOBase, but the doc strings on (Raw)IOBase don't seem to prescribe implementation details.
There was a problem hiding this comment.
The best reference is the io module docs: https://docs.python.org/3/library/io.html#class-hierarchy
In short, RawIOBase is the ABC for raw I/O, which has different semantics from buffered I/O. One important difference is that raw I/O can return "short reads", e.g. read(5) may return less than 5 bytes and it doesn't mean that EOF was reached.
There was a problem hiding this comment.
Ok, I see what you mean. What do you think is the best course here: keep RawIOBase as ABC and test with, say, FileIO or switch to IOBase as ABC, since PythonFile can wrap any file-like object.
There was a problem hiding this comment.
We should choose between either RawIOBase and BufferedIOBase. I think it should depend mainly on the "short reads" question.
There was a problem hiding this comment.
Do I understand it correctly, that "short reads" without EOF can only occur, when we reading from a TTY (i.e. interactive file)? Right now we always return false for isatty, so the user should expect a short read to mean EOF. This might be wrong, if someone wraps a TTY in PythonFile, but this use case should be rare enough that a warning in a comment would be sufficient.
So, I think RawIOBase describes the current use case better.
There was a problem hiding this comment.
At the system level, short reads can occur when reading from any file. When doing disk I/O, they are rather rare (but can occur if e.g. the system call is interrupted by a signal). They are quite common from a socket or any other character device.
However, our FileRead routine loops until the entire read is satisfied, so our IO classes should satisfy as buffered IO.
There was a problem hiding this comment.
I changed it to BufferedIOBase.
python/pyarrow/tests/test_io.py
Outdated
There was a problem hiding this comment.
Instead of using hasattr, how about just iterating on the file?
There was a problem hiding this comment.
I wanted to test the failure mentioned in the JIRA ticket explicitly, i.e. have the exact same check as pandas does. The iterating over file is tested in test_python_file_iterable, too.
There was a problem hiding this comment.
If the file is iterable, then by definition it will have a __iter__, so this specific test isn't needed.
(also, Pandas testing for IO by checking for iterability is quite broken IMHO, but that's Pandas' problem)
python/pyarrow/tests/test_io.py
Outdated
There was a problem hiding this comment.
It strikes me that you added a lot of tests for PythonFile but none for NativeFile (which is the focus of this PR). Did you forget them?
There was a problem hiding this comment.
NativeFile ist kind of an abstract base and cannot used directly (it doesn't set rd_file/wr_file, for example). PythonFile is a close descendant which just specifies a custom initializer. Looking at the existing tests, none seem to test NativeFile directly.
There was a problem hiding this comment.
Hmm... wow, you're right, I had forgotten about that.
There was a problem hiding this comment.
Yes, the purpose of NativeFile is to provide the abstract interfaces to say "I have a C++ object implementing one of the Arrow C++ IO interfaces" so that C++ code can acquire the internals of a Python-created file resource and use it without interacting with the GIL, or having to pass through PyBytes object in some cases
|
Hi Antoine, thanks for looking into this so quickly! I made some adjustments and comments. |
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
You could use the Buffer object to acquire the buffer interface
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
Maybe use Read here? This will prevent additional seeks on the underlying file handle
There was a problem hiding this comment.
Ist there a reason why Readable::Read(int64_t, shared_ptr<Buffer>) ist called ReadB in Cython? In other places we use method overloading.
There was a problem hiding this comment.
Hmm, no idea. Can you try fixing it?
There was a problem hiding this comment.
There's apparently a cython bug which prevents proper overload resolution with multiple layers of inheritance. I put a comment explaining the name change.
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
We might leave a bookmark to move this implementation into libarrow
python/pyarrow/tests/test_io.py
Outdated
python/pyarrow/tests/test_io.py
Outdated
There was a problem hiding this comment.
Yes, the purpose of NativeFile is to provide the abstract interfaces to say "I have a C++ object implementing one of the Arrow C++ IO interfaces" so that C++ code can acquire the internals of a Python-created file resource and use it without interacting with the GIL, or having to pass through PyBytes object in some cases
2f1d94d to
70dc959
Compare
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
Would be nice to move all the cdefs in the cdef: block above.
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
I don't think this is useful. bytes_to_read == 0 is the real EOF condition.
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
Ok... I really don't like that we're reinventing our own suboptimal readline implementation, just because Pandas somewhat requires it. @wesm do you think there's a way to please Pandas without doing this? Perhaps raise NotImplementedError?
There was a problem hiding this comment.
Yeah, I don't like it too much either. Why don't we delete this implementation for now, raise NotImplementedError, and open a JIRA about adding a more optimal implementation within libarrow. @alendit is this OK with you? I know you already spent a bunch of effort on this
We can leave it, too. I will defer to @pitrou judgment on this since he's more familiar with CPython internals
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
We're not using ReadAt anymore, are we? The comment needs updating.
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
A default 4kB buffer for reading a single line is definitely not the right size... Also I'm not sure it makes sense to destroy and reallocate a new buffer every time.
There was a problem hiding this comment.
Also there's no need to read that much if an explicit size was passed.
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
So we're reading again data that we've already read during the loop. Depending on the kind of file, this may be a system call or some other costly operation.
There was a problem hiding this comment.
Ideally we would ready piecewise into the bytes object and resize it at the end...
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
Hmm... do we need the Native hack?
wesm
left a comment
There was a problem hiding this comment.
I'm ambivalent about the readline issue. I will defer to @pitrou and happy to see this merged whenever he is satisfied. I think that pandas will be appeased if it appears to be the right kind of object, but they don't need to be implemented necessarily
python/pyarrow/io.pxi
Outdated
There was a problem hiding this comment.
Yeah, I don't like it too much either. Why don't we delete this implementation for now, raise NotImplementedError, and open a JIRA about adding a more optimal implementation within libarrow. @alendit is this OK with you? I know you already spent a bunch of effort on this
We can leave it, too. I will defer to @pitrou judgment on this since he's more familiar with CPython internals
|
Would be good to get this one into 0.10.0 |
|
Got swamped today, but I'll look into it tomorrow. I'm ok with removing the readline implementation for now, since libarrow is the better place for it. |
|
OK, great. I'll be working some this weekend so should have no problem getting this in |
Minor cleanup PEP8 adjustments Fix flake8 warning Escape backslash Refactor readline Adjust tests Fix constant syntax Remove ducktyping test Switch abc to BufferedIOBase Set mutable_data_ for mutable python buffers Use arrow pybuffer for readinto Deal with a too long line Seek before reading Add note about moving readline to libarrow Use ReadB instead of ReadAt Add a comment explaining name change in cython WIP update readline algo Vector based readline Add unsupported operation instead of readline
70dc959 to
21e8c94
Compare
|
OK, I made readline raise unsupported operation, and made PythonFile delegate readline to the underlying handle. There two small fixes in there, too. First is the signature or the read method and second ist setting of mutable_data pointer for mutable buffers. |
|
Going to review again. |
|
Seconded, thank you! |
See JIRA.
This PR adds implements some of the missing methods from BufferedIOBase and registers NativeFile (and so all it's descendants) with BufferedIOBase ABC.