Skip to content

ARROW-2816: [Python] Make NativeFile BufferedIOBase-compliant#2253

Closed
alendit wants to merge 2 commits intoapache:masterfrom
alendit:nativefile-rawiobase
Closed

ARROW-2816: [Python] Make NativeFile BufferedIOBase-compliant#2253
alendit wants to merge 2 commits intoapache:masterfrom
alendit:nativefile-rawiobase

Conversation

@alendit
Copy link
Copy Markdown

@alendit alendit commented Jul 11, 2018

See JIRA.

This PR adds implements some of the missing methods from BufferedIOBase and registers NativeFile (and so all it's descendants) with BufferedIOBase ABC.

@alendit alendit force-pushed the nativefile-rawiobase branch from 9b48f28 to aef1647 Compare July 11, 2018 13:30
@alendit
Copy link
Copy Markdown
Author

alendit commented Jul 11, 2018

I'm not sure why the CI build is failing...

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 11, 2018

Super weird error:

byte-compiling /home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow/types.py to types.cpython-36.pyc
running install_egg_info
Copying pyarrow.egg-info to /home/travis/build/apache/arrow/pyarrow-test-3.6/lib/python3.6/site-packages/pyarrow-0.1.dev55+gb701ad3-py3.6.egg-info
running install_scripts
Installing plasma_store script to /home/travis/build/apache/arrow/pyarrow-test-3.6/bin
writing list of installed files to 'record.text'

�[31;01mWarning, treated as error:�[39;49;00m
docstring of pyarrow.BufferOutputStream.readline:6:Block quote ends without a blank line; unexpected unindent.

travis_time:end:01410426:start=1531317164436301464,finish=1531317673168573260,duration=508732271796
�[0K
�[31;1mThe command "$TRAVIS_BUILD_DIR/ci/travis_script_python.sh 3.6" exited with 2.�[0m

@alendit
Copy link
Copy Markdown
Author

alendit commented Jul 11, 2018

Oh, it was an unescaped backslash in the readline docstring of NativeFile which confused sphinx.

@codecov-io
Copy link
Copy Markdown

codecov-io commented Jul 11, 2018

Codecov Report

Merging #2253 into master will increase coverage by 2.35%.
The diff coverage is 70%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2253      +/-   ##
==========================================
+ Coverage   84.29%   86.65%   +2.35%     
==========================================
  Files         292      236      -56     
  Lines       44559    41907    -2652     
==========================================
- Hits        37562    36314    -1248     
+ Misses       6966     5593    -1373     
+ Partials       31        0      -31
Impacted Files Coverage Δ
python/pyarrow/tests/test_io.py 99.12% <100%> (+0.06%) ⬆️
cpp/src/arrow/python/common.cc 90.62% <100%> (+0.3%) ⬆️
python/pyarrow/io.pxi 60.56% <42.55%> (-1.63%) ⬇️
cpp/src/arrow/util/thread-pool-test.cc 98.91% <0%> (-0.55%) ⬇️
rust/src/error.rs
go/arrow/array/booleanbuilder.go
go/arrow/internal/testing/tools/bits.go
rust/src/buffer.rs
go/arrow/memory/memory_sse4_amd64.go
go/arrow/math/int64_amd64.go
... and 50 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7d2fbeb...81ecf22. Read the comment docs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say PyBUF_CONTIG, to mandate writability.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simpler to just say buf = <uint8_t*> py_buffer.buf (assuming that works).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to call PyBuffer_Release(&py_buffer) at the end of the function, otherwise there will be a leak.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I'm calling readline on a 2GB file, this will allocate a 2GB bytestring? That doesn't sound like a very good idea.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point you should check whether bytes_read is 0.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test is necessary.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can merge this test and the previous one.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dubious that this should be RawIOBase, since a BytesIO is buffered. OTOH this could be considered a best-effort registration.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it specified somewhere? There is BufferedIOBase, but the doc strings on (Raw)IOBase don't seem to prescribe implementation details.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best reference is the io module docs: https://docs.python.org/3/library/io.html#class-hierarchy

In short, RawIOBase is the ABC for raw I/O, which has different semantics from buffered I/O. One important difference is that raw I/O can return "short reads", e.g. read(5) may return less than 5 bytes and it doesn't mean that EOF was reached.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see what you mean. What do you think is the best course here: keep RawIOBase as ABC and test with, say, FileIO or switch to IOBase as ABC, since PythonFile can wrap any file-like object.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should choose between either RawIOBase and BufferedIOBase. I think it should depend mainly on the "short reads" question.

Copy link
Copy Markdown
Author

@alendit alendit Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand it correctly, that "short reads" without EOF can only occur, when we reading from a TTY (i.e. interactive file)? Right now we always return false for isatty, so the user should expect a short read to mean EOF. This might be wrong, if someone wraps a TTY in PythonFile, but this use case should be rare enough that a warning in a comment would be sufficient.

So, I think RawIOBase describes the current use case better.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the system level, short reads can occur when reading from any file. When doing disk I/O, they are rather rare (but can occur if e.g. the system call is interrupted by a signal). They are quite common from a socket or any other character device.

However, our FileRead routine loops until the entire read is satisfied, so our IO classes should satisfy as buffered IO.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to BufferedIOBase.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using hasattr, how about just iterating on the file?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to test the failure mentioned in the JIRA ticket explicitly, i.e. have the exact same check as pandas does. The iterating over file is tested in test_python_file_iterable, too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file is iterable, then by definition it will have a __iter__, so this specific test isn't needed.

(also, Pandas testing for IO by checking for iterability is quite broken IMHO, but that's Pandas' problem)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the test.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It strikes me that you added a lot of tests for PythonFile but none for NativeFile (which is the focus of this PR). Did you forget them?

Copy link
Copy Markdown
Author

@alendit alendit Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NativeFile ist kind of an abstract base and cannot used directly (it doesn't set rd_file/wr_file, for example). PythonFile is a close descendant which just specifies a custom initializer. Looking at the existing tests, none seem to test NativeFile directly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... wow, you're right, I had forgotten about that.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the purpose of NativeFile is to provide the abstract interfaces to say "I have a C++ object implementing one of the Arrow C++ IO interfaces" so that C++ code can acquire the internals of a Python-created file resource and use it without interacting with the GIL, or having to pass through PyBytes object in some cases

@alendit
Copy link
Copy Markdown
Author

alendit commented Jul 12, 2018

Hi Antoine,

thanks for looking into this so quickly! I made some adjustments and comments.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the Buffer object to acquire the buffer interface

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use Read here? This will prevent additional seeks on the underlying file handle

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ist there a reason why Readable::Read(int64_t, shared_ptr<Buffer>) ist called ReadB in Cython? In other places we use method overloading.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, no idea. Can you try fixing it?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's apparently a cython bug which prevents proper overload resolution with multiple layers of inheritance. I put a comment explaining the name change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might leave a bookmark to move this implementation into libarrow

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail lint checks

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the purpose of NativeFile is to provide the abstract interfaces to say "I have a C++ object implementing one of the Arrow C++ IO interfaces" so that C++ code can acquire the internals of a Python-created file resource and use it without interacting with the GIL, or having to pass through PyBytes object in some cases

@alendit alendit force-pushed the nativefile-rawiobase branch 2 times, most recently from 2f1d94d to 70dc959 Compare July 16, 2018 10:36
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to move all the cdefs in the cdef: block above.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is useful. bytes_to_read == 0 is the real EOF condition.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok... I really don't like that we're reinventing our own suboptimal readline implementation, just because Pandas somewhat requires it. @wesm do you think there's a way to please Pandas without doing this? Perhaps raise NotImplementedError?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't like it too much either. Why don't we delete this implementation for now, raise NotImplementedError, and open a JIRA about adding a more optimal implementation within libarrow. @alendit is this OK with you? I know you already spent a bunch of effort on this

We can leave it, too. I will defer to @pitrou judgment on this since he's more familiar with CPython internals

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using ReadAt anymore, are we? The comment needs updating.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A default 4kB buffer for reading a single line is definitely not the right size... Also I'm not sure it makes sense to destroy and reallocate a new buffer every time.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there's no need to read that much if an explicit size was passed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we're reading again data that we've already read during the loop. Depending on the kind of file, this may be a system call or some other costly operation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would ready piecewise into the bytes object and resize it at the end...

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... do we need the Native hack?

Copy link
Copy Markdown
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent about the readline issue. I will defer to @pitrou and happy to see this merged whenever he is satisfied. I think that pandas will be appeased if it appears to be the right kind of object, but they don't need to be implemented necessarily

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't like it too much either. Why don't we delete this implementation for now, raise NotImplementedError, and open a JIRA about adding a more optimal implementation within libarrow. @alendit is this OK with you? I know you already spent a bunch of effort on this

We can leave it, too. I will defer to @pitrou judgment on this since he's more familiar with CPython internals

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 20, 2018

Would be good to get this one into 0.10.0

@alendit
Copy link
Copy Markdown
Author

alendit commented Jul 20, 2018

Got swamped today, but I'll look into it tomorrow. I'm ok with removing the readline implementation for now, since libarrow is the better place for it.

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 20, 2018

OK, great. I'll be working some this weekend so should have no problem getting this in

Minor cleanup

PEP8 adjustments

Fix flake8 warning

Escape backslash

Refactor readline

Adjust tests

Fix constant syntax

Remove ducktyping test

Switch abc to BufferedIOBase

Set mutable_data_ for mutable python buffers

Use arrow pybuffer for readinto

Deal with a too long line

Seek before reading

Add note about moving readline to libarrow

Use ReadB instead of ReadAt

Add a comment explaining name change in cython

WIP update readline algo

Vector based readline

Add unsupported operation instead of readline
@alendit alendit force-pushed the nativefile-rawiobase branch from 70dc959 to 21e8c94 Compare July 21, 2018 12:50
@alendit
Copy link
Copy Markdown
Author

alendit commented Jul 21, 2018

OK, I made readline raise unsupported operation, and made PythonFile delegate readline to the underlying handle. There two small fixes in there, too. First is the signature or the read method and second ist setting of mutable_data pointer for mutable buffers.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Jul 23, 2018

Going to review again.

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @alendit. This looks good to me now.

@pitrou pitrou changed the title ARROW-2816: [Python] Make NativeFile RawIOBase-compliant ARROW-2816: [Python] Make NativeFile BufferedIOBase-compliant Jul 23, 2018
@pitrou pitrou closed this in 355ff08 Jul 23, 2018
@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 23, 2018

Seconded, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants