Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error writing record batches to IPC streaming format #25984

Closed
asfimport opened this issue Sep 10, 2020 · 2 comments
Closed

Error writing record batches to IPC streaming format #25984

asfimport opened this issue Sep 10, 2020 · 2 comments

Comments

@asfimport
Copy link

Writing record batches to the Arrow IPC streaming format with on-the-fly compression generally raises errors of one type or the other when reading it back.

PFA the code producing each of the below errors. I can't reproduce it for smaller batch sizes, so it probably has to do with size of each record batch. It does not seem specific to pyarrow since I see a similar issue with the C-Glib API.

#Error case 1

~/py376/lib/python3.7/site-packages/pyarrow/ipc.pxi in pyarrow.lib._CRecordBatchReader.read_next_batch()

~/py376/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

OSError: Truncated compressed stream

#Error case 2

~/py376/lib/python3.7/site-packages/pyarrow/ipc.pxi in pyarrow.lib._RecordBatchStreamReader._open()

~/py376/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

~/py376/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Tried reading schema message, was null or length 0

Environment: pyarrow - Version: 1.0.1
python - version 3.7.6
Operating system - CentOS Linux release 7.8.2003 (Core)
Reporter: Ishan

Original Issue Attachments:

Note: This issue was originally created as ARROW-9958. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Kouhei Sutou / @kou:
You need to ensure closing writers and output streams.

For example1.py:


sink = pa.output_stream(FILE, COMPRESSION_TYPE)
writer = pa.RecordBatchStreamWriter(sink, batch.schema)
for _ in range(5):
    writer.write_batch(batch)
writer.close()
sink.close()

For example2.py:


sink = pa.output_stream(FILE, COMPRESSION_TYPE)
writer = pa.RecordBatchStreamWriter(sink, batch.schema)
for _ in range(5):
    writer.write_batch(batch)
writer.close()
sink.close()

@asfimport
Copy link
Author

Ishan:
Thank you. While using the C-Glib API too, I did close the writer but missed the sink.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant