ARROW-2816: [Python] Make NativeFile BufferedIOBase-compliant by alendit · Pull Request #2253 · apache/arrow

alendit · 2018-07-11T12:38:05Z

This PR adds implements some of the missing methods from BufferedIOBase and registers NativeFile (and so all it's descendants) with BufferedIOBase ABC.

alendit · 2018-07-11T14:26:59Z

I'm not sure why the CI build is failing...

wesm · 2018-07-11T16:28:40Z

Super weird error:

byte-compiling /home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/types.py to types.cpython-36.pyc
running install_egg_info
Copying pyarrow.egg-info to /home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow-0.1.dev55+gb701ad3-py3.6.egg-info
running install_scripts
Installing plasma_store script to /home/travis/build/apache/arrow/pyarrow-test-3.6/bin
writing list of installed files to 'record.text'

�[31;01mWarning, treated as error:�[39;49;00m
docstring of pyarrow.BufferOutputStream.readline:6:Block quote ends without a blank line; unexpected unindent.

travis_time:end:01410426:start=1531317164436301464,finish=1531317673168573260,duration=508732271796
�[0K
�[31;1mThe command "$TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6" exited with 2.�[0m

alendit · 2018-07-11T16:41:37Z

Oh, it was an unescaped backslash in the readline docstring of NativeFile which confused sphinx.

codecov-io · 2018-07-11T17:27:52Z

Codecov Report

Merging #2253 into master will increase coverage by 2.35%.
The diff coverage is 70%.

@@            Coverage Diff             @@
##           master    #2253      +/-   ##
==========================================
+ Coverage   84.29%   86.65%   +2.35%     
==========================================
  Files         292      236      -56     
  Lines       44559    41907    -2652     
==========================================
- Hits        37562    36314    -1248     
+ Misses       6966     5593    -1373     
+ Partials       31        0      -31

Impacted Files	Coverage Δ
python/pyarrow/tests/test_io.py	`99.12% <100%> (+0.06%)`	⬆️
cpp/src/arrow/python/common.cc	`90.62% <100%> (+0.3%)`	⬆️
python/pyarrow/io.pxi	`60.56% <42.55%> (-1.63%)`	⬇️
cpp/src/arrow/util/thread-pool-test.cc	`98.91% <0%> (-0.55%)`	⬇️
rust/src/error.rs
go/arrow/array/booleanbuilder.go
go/arrow/internal/testing/tools/bits.go
rust/src/buffer.rs
go/arrow/memory/memory_sse4_amd64.go
go/arrow/math/int64_amd64.go
... and 50 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d2fbeb...81ecf22. Read the comment docs.

pitrou · 2018-07-12T07:53:09Z

python/pyarrow/io.pxi

I would say PyBUF_CONTIG, to mandate writability.

pitrou · 2018-07-12T07:56:18Z

python/pyarrow/io.pxi

It's simpler to just say buf = <uint8_t*> py_buffer.buf (assuming that works).

pitrou · 2018-07-12T07:56:50Z

python/pyarrow/io.pxi

You need to call PyBuffer_Release(&py_buffer) at the end of the function, otherwise there will be a leak.

pitrou · 2018-07-12T07:59:01Z

python/pyarrow/io.pxi

So if I'm calling readline on a 2GB file, this will allocate a 2GB bytestring? That doesn't sound like a very good idea.

pitrou · 2018-07-12T07:59:44Z

python/pyarrow/io.pxi

At some point you should check whether bytes_read is 0.

pitrou · 2018-07-12T08:02:34Z

python/pyarrow/tests/test_io.py

I don't think this test is necessary.

pitrou · 2018-07-12T08:03:17Z

python/pyarrow/tests/test_io.py

You can merge this test and the previous one.

pitrou · 2018-07-12T08:04:48Z

python/pyarrow/tests/test_io.py

It's dubious that this should be RawIOBase, since a BytesIO is buffered. OTOH this could be considered a best-effort registration.

Is it specified somewhere? There is BufferedIOBase, but the doc strings on (Raw)IOBase don't seem to prescribe implementation details.

The best reference is the io module docs: https://docs.python.org/3/library/io.html#class-hierarchy

In short, RawIOBase is the ABC for raw I/O, which has different semantics from buffered I/O. One important difference is that raw I/O can return "short reads", e.g. read(5) may return less than 5 bytes and it doesn't mean that EOF was reached.

Ok, I see what you mean. What do you think is the best course here: keep RawIOBase as ABC and test with, say, FileIO or switch to IOBase as ABC, since PythonFile can wrap any file-like object.

We should choose between either RawIOBase and BufferedIOBase. I think it should depend mainly on the "short reads" question.

Do I understand it correctly, that "short reads" without EOF can only occur, when we reading from a TTY (i.e. interactive file)? Right now we always return false for isatty, so the user should expect a short read to mean EOF. This might be wrong, if someone wraps a TTY in PythonFile, but this use case should be rare enough that a warning in a comment would be sufficient.

So, I think RawIOBase describes the current use case better.

At the system level, short reads can occur when reading from any file. When doing disk I/O, they are rather rare (but can occur if e.g. the system call is interrupted by a signal). They are quite common from a socket or any other character device.

However, our FileRead routine loops until the entire read is satisfied, so our IO classes should satisfy as buffered IO.

I changed it to BufferedIOBase.

pitrou · 2018-07-12T08:05:13Z

python/pyarrow/tests/test_io.py

Instead of using hasattr, how about just iterating on the file?

I wanted to test the failure mentioned in the JIRA ticket explicitly, i.e. have the exact same check as pandas does. The iterating over file is tested in test_python_file_iterable, too.

If the file is iterable, then by definition it will have a __iter__, so this specific test isn't needed.

(also, Pandas testing for IO by checking for iterability is quite broken IMHO, but that's Pandas' problem)

I removed the test.

pitrou · 2018-07-12T08:07:14Z

python/pyarrow/tests/test_io.py

It strikes me that you added a lot of tests for PythonFile but none for NativeFile (which is the focus of this PR). Did you forget them?

NativeFile ist kind of an abstract base and cannot used directly (it doesn't set rd_file/wr_file, for example). PythonFile is a close descendant which just specifies a custom initializer. Looking at the existing tests, none seem to test NativeFile directly.

Hmm... wow, you're right, I had forgotten about that.

Yes, the purpose of NativeFile is to provide the abstract interfaces to say "I have a C++ object implementing one of the Arrow C++ IO interfaces" so that C++ code can acquire the internals of a Python-created file resource and use it without interacting with the GIL, or having to pass through PyBytes object in some cases

alendit · 2018-07-12T11:20:34Z

Hi Antoine,

thanks for looking into this so quickly! I made some adjustments and comments.

wesm · 2018-07-12T15:59:39Z

python/pyarrow/io.pxi

You could use the Buffer object to acquire the buffer interface

wesm · 2018-07-12T16:01:13Z

python/pyarrow/io.pxi

Maybe use Read here? This will prevent additional seeks on the underlying file handle

Ist there a reason why Readable::Read(int64_t, shared_ptr<Buffer>) ist called ReadB in Cython? In other places we use method overloading.

Hmm, no idea. Can you try fixing it?

There's apparently a cython bug which prevents proper overload resolution with multiple layers of inheritance. I put a comment explaining the name change.

wesm · 2018-07-12T16:05:02Z

python/pyarrow/io.pxi

We might leave a bookmark to move this implementation into libarrow

wesm · 2018-07-12T16:07:43Z

python/pyarrow/tests/test_io.py

This will fail lint checks

wesm · 2018-07-12T16:10:14Z

python/pyarrow/tests/test_io.py

Yes, the purpose of NativeFile is to provide the abstract interfaces to say "I have a C++ object implementing one of the Arrow C++ IO interfaces" so that C++ code can acquire the internals of a Python-created file resource and use it without interacting with the GIL, or having to pass through PyBytes object in some cases

pitrou · 2018-07-16T16:23:24Z

python/pyarrow/io.pxi

Would be nice to move all the cdefs in the cdef: block above.

pitrou · 2018-07-16T16:41:18Z

python/pyarrow/io.pxi

I don't think this is useful. bytes_to_read == 0 is the real EOF condition.

pitrou · 2018-07-16T16:44:54Z

python/pyarrow/io.pxi

Ok... I really don't like that we're reinventing our own suboptimal readline implementation, just because Pandas somewhat requires it. @wesm do you think there's a way to please Pandas without doing this? Perhaps raise NotImplementedError?

Yeah, I don't like it too much either. Why don't we delete this implementation for now, raise NotImplementedError, and open a JIRA about adding a more optimal implementation within libarrow. @alendit is this OK with you? I know you already spent a bunch of effort on this

We can leave it, too. I will defer to @pitrou judgment on this since he's more familiar with CPython internals

pitrou · 2018-07-16T16:47:49Z

python/pyarrow/io.pxi

We're not using ReadAt anymore, are we? The comment needs updating.

pitrou · 2018-07-16T16:49:40Z

python/pyarrow/io.pxi

A default 4kB buffer for reading a single line is definitely not the right size... Also I'm not sure it makes sense to destroy and reallocate a new buffer every time.

Also there's no need to read that much if an explicit size was passed.

pitrou · 2018-07-16T16:51:17Z

python/pyarrow/io.pxi

So we're reading again data that we've already read during the loop. Depending on the kind of file, this may be a system call or some other costly operation.

Ideally we would ready piecewise into the bytes object and resize it at the end...

pitrou · 2018-07-16T16:51:52Z

python/pyarrow/io.pxi

Hmm... do we need the Native hack?

wesm

I'm ambivalent about the readline issue. I will defer to @pitrou and happy to see this merged whenever he is satisfied. I think that pandas will be appeased if it appears to be the right kind of object, but they don't need to be implemented necessarily

wesm · 2018-07-17T03:25:43Z

python/pyarrow/io.pxi

Yeah, I don't like it too much either. Why don't we delete this implementation for now, raise NotImplementedError, and open a JIRA about adding a more optimal implementation within libarrow. @alendit is this OK with you? I know you already spent a bunch of effort on this

We can leave it, too. I will defer to @pitrou judgment on this since he's more familiar with CPython internals

wesm · 2018-07-20T00:12:45Z

Would be good to get this one into 0.10.0

alendit · 2018-07-20T14:05:28Z

Got swamped today, but I'll look into it tomorrow. I'm ok with removing the readline implementation for now, since libarrow is the better place for it.

wesm · 2018-07-20T14:32:37Z

OK, great. I'll be working some this weekend so should have no problem getting this in

Minor cleanup PEP8 adjustments Fix flake8 warning Escape backslash Refactor readline Adjust tests Fix constant syntax Remove ducktyping test Switch abc to BufferedIOBase Set mutable_data_ for mutable python buffers Use arrow pybuffer for readinto Deal with a too long line Seek before reading Add note about moving readline to libarrow Use ReadB instead of ReadAt Add a comment explaining name change in cython WIP update readline algo Vector based readline Add unsupported operation instead of readline

alendit · 2018-07-21T19:18:38Z

OK, I made readline raise unsupported operation, and made PythonFile delegate readline to the underlying handle. There two small fixes in there, too. First is the signature or the read method and second ist setting of mutable_data pointer for mutable buffers.

pitrou · 2018-07-23T10:18:11Z

Going to review again.

pitrou

Thank you very much @alendit. This looks good to me now.

wesm · 2018-07-23T17:48:59Z

Seconded, thank you!

alendit force-pushed the nativefile-rawiobase branch from 9b48f28 to aef1647 Compare July 11, 2018 13:30

pitrou requested changes Jul 12, 2018

View reviewed changes

wesm reviewed Jul 12, 2018

View reviewed changes

alendit force-pushed the nativefile-rawiobase branch 2 times, most recently from 2f1d94d to 70dc959 Compare July 16, 2018 10:36

pitrou reviewed Jul 16, 2018

View reviewed changes

wesm reviewed Jul 17, 2018

View reviewed changes

alendit force-pushed the nativefile-rawiobase branch from 70dc959 to 21e8c94 Compare July 21, 2018 12:50

Cleanup

81ecf22

pitrou approved these changes Jul 23, 2018

View reviewed changes

pitrou changed the title ~~ARROW-2816: [Python] Make NativeFile RawIOBase-compliant~~ ARROW-2816: [Python] Make NativeFile BufferedIOBase-compliant Jul 23, 2018

pitrou closed this in 355ff08 Jul 23, 2018

Conversation

alendit commented Jul 11, 2018 • edited by pitrou Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alendit commented Jul 11, 2018

Uh oh!

wesm commented Jul 11, 2018

Uh oh!

alendit commented Jul 11, 2018

Uh oh!

codecov-io commented Jul 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alendit Jul 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alendit Jul 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alendit commented Jul 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

alendit commented Jul 11, 2018 •

edited by pitrou

Loading

codecov-io commented Jul 11, 2018 •

edited

Loading

alendit Jul 12, 2018 •

edited

Loading

alendit Jul 12, 2018 •

edited

Loading

alendit commented Jul 12, 2018 •

edited

Loading