Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Appending to streamable table file format doesn't seem to work #18955

Open
asfimport opened this issue May 14, 2018 · 6 comments
Open

Comments

@asfimport
Copy link

As far as I can tell it looks like appending to a streaming file format isn’t currently supported, is that right?

RecordBatchStreamWriter always writes the schema up front, and it doesn’t look like a schema is expected mid file ( assuming im doing this append test correctly, this is the error I hit when I try to read back this file into python:

 Traceback (most recent call last):

  File "/home/ra7293/rba_arrow_mmap.py", line 9, in

    table = reader.read_all()

  File "ipc.pxi", line 302, in pyarrow.lib._RecordBatchReader.read_all

  File "error.pxi", line 79, in pyarrow.lib.check_status

pyarrow.lib.ArrowIOError: Message not expected type: record batch, was: 1

 

This reader script works fine if I write once / don’t append.

Seeing as IO interfaces support Append, streaming should support it as well ( if for whatever reason this cant be supported, RecordBatchStreamWriter should throw if configured with an OutputStreamer that is attempting to append )

Reporter: Rob Ambalu / @robambalu

Note: This issue was originally created as ARROW-2579. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Thanks for reporting. Is it possible for you to post a complete piece of code reproducing the problem?

@asfimport
Copy link
Author

Rob Ambalu / @robambalu:
Unfortunately the repro is intertwined in a lot of existing code and wrapper code that I have.  If you really need me to do it I'll put together a stand alone version but it'll take a bit of time

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
It would be better with a standalone reproducer. Otherwise we don't know whether it depends on using specific datatypes or features.

@asfimport
Copy link
Author

Rob Ambalu / @robambalu:
So I was just about to write a simple repro but I realize I cant because the append mode patch for FileOutputStream hasnt been merged yet.

You need a streamer with an "append" option to repro the issue.  As far as I can see thers only FileOutputStream ( broken ) and hadoop, which im not sure how to bootstrap.  I hit this issue in my code because I wrote my own MMapStreamer that supports append.

Ive actually worked around this at this point by creating new .idx extension files if I already find an existing file and want to append ( reader then concats them all )

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I think that the appendable output stream is available now, any chance of a repro for this?

@asfimport
Copy link
Author

Tim Cooijmans:
I found this issue through a Google search. At this time, it's not clear how to do appends from Python as FileOutputStream does not seem to be exposed by pyarrow.

(A naive use of pyarrow's OSFile fails because it rejects the "ab" file mode, expecting either read or write, and a naive use of the cat utility results in only the most recent data being readable from the file.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant