Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] zlib deflate exception when writing Parquet file #19831

Closed
asfimport opened this issue Oct 15, 2018 · 12 comments
Closed

[Python] zlib deflate exception when writing Parquet file #19831

asfimport opened this issue Oct 15, 2018 · 12 comments

Comments

@asfimport
Copy link
Collaborator

The below Python code throws an exception in 0.11.0, but not in 0.10.0.

I was able to reproduce the issue in Amazon Linux, CentOS 7, and Ubuntu 16.04, but not in Windows 7.

The Amazon and CentOS machines are both running zlib 1.2.7, and the Ubuntu machine is using 1.2.8.

Tested with CPython 3.6 in all cases.

import io
import pyarrow
from pyarrow import parquet

tbl = pyarrow.Table.from_arrays([pyarrow.array(['abc', 'def'])], ['some_col'])

f = io.BytesIO()
parquet.write_table(tbl, f, compression='gzip')

Following is the exception:

Traceback (most recent call last):
  File "test_pyarrow.py", line 8, in <module>
    parquet.write_table(tbl, f, compression='gzip')
  File "/home/adam/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 1125, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "/home/adam/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 376, in write_table
    self.writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 934, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: IOError: zlib deflate failed, output buffer too small

Environment: Amazon Linux, CentOS 7, Ubuntu 16.04, zlib 1.2.7/1.2.8, CPython 3.6.
Reporter: Adam Machanic
Assignee: Antoine Pitrou / @pitrou

Externally tracked issue: #2756

PRs and other links:

Note: This issue was originally created as ARROW-3514. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:

The below Python code throws an exception in 0.11.0, but not in 0.10.0.

Please, when you report an exception, can you paste the exception and traceback?

@asfimport
Copy link
Collaborator Author

Adam Machanic:
Sorry about that! All set.

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Thanks. For the record, I can't reproduce here. Also, there don't seem to be any significant changes in zlib compression during 0.10.0 and 0.11.0, so I'm a bit surprised.

How did you install Arrow and Parquet exactly?

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Ah, I can reproduce using the pip wheel for Arrow 0.11.0 (but not when building from source). It seems that the pip wheel may be linked with an old zlib ("pyarrow/libz-a147dcb0.so.1.2.3"). @xhochy

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
According to https://www.zlib.net/ChangeLog.txt, the following issue was fixed in zlib 1.2.3.1. I'm not sure it's the culprit.

Fix compressBound(), was low for some pathological cases

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
We probably will need to vendor a newer zlib in the wheels.

This is a pretty problematic issue – I am not sure we can wait until the next Arrow release to fix this

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I downloaded a 0.10.0 wheel and it does not seem to bundle a shared zlib. So perhaps zlib was built statically back then (automatically getting whichever version we were building against). Switching to dynamic linking was at my suggestion IIRC, though I didn't know that manylinux1 would automatically bundle an old version within the wheel...

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Yeah. Should manylinux1 not be bundling zlib at all? In any case we should not allow this issue to linger long, either we should post fixed p1 wheels or do an 0.11.1 Arrow release

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I've posted a workaround PR that fixes the issue on the posted test case.

I think not shipping zlib at all should be reasonable, but I don't think we want to try it on a bugfix release. Also, it seems the zlib bundling may be done by the "auditwheel" utility.

@asfimport
Copy link
Collaborator Author

Uwe Korn / @xhochy:
auditwheel vendors automatically all libs that should be shipped in the wheel and are not part of a system as defined by the manylinux1 specification. We should definitely build a newer version of zlib but still should bundle it in the wheel.

@asfimport
Copy link
Collaborator Author

Krisztian Szucs / @kszucs:
Issue resolved by pull request 2771
#2771

@asfimport
Copy link
Collaborator Author

Krisztian Szucs / @kszucs:
@xhochy I've merged the fix. What's the verdict about the packaging?

@asfimport asfimport added this to the 0.11.1 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants