-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zlib deflate failed, output buffer too small
error when writing a Parquet + GZIP file into S3 using the S3 table function
#63319
Comments
Ok I've got a reproducer:
If we have two fields with long strings it works:
If we have a single UInt64 column with enough rows it fails:
But if we use less rows it works:
|
It doesn't seem exclusive to Uint64 either, it seems that at some column block size the compression breaks. For example, here's an example with Int8. This works:
But if we insert 560k rows it fails:
So it seems there's some threshold at which it breaks. If we enable the custom encoder it works even inserting 1M rows:
|
@tb-fjurado since the issue you're facing seems to be exclusively related to Arrow. Have you tried enabling |
Yes, I tried enabling it and it worked as expected. I see that #63210 has been already merged so I understand that the custom encoder is the way forward and we don't want to invest more time in arrow right? Also, does anybody know from which CH version could we consider the custom encoder stable, even if it was not set as default? I saw this comment here from Aug 2023 that said it should be good enough to enable and I haven't found many more (if any) changes on the encoder in the PRs ever since. Just to know if we can enable it right away in our current CH deployments or we need to upgrade. Thanks! |
Yes, the last significant fix was #52951 (August 2023), so 23.10+ should be good. Yes, seems to be a problem with zlib-ng. Apparently:
Maybe I'll investigate more. |
Some of our customers have gotten the following error while trying to write a GZIP-compressed Parquet file to S3 using the S3 table function:
Multiple runs of the same query still produce the error.
Does it reproduce on the most recent release?
Yes, tested in in 24.2.1 and 24.4.1 (latest docker container available) and it still happens
How to reproduce
Will produce the following output:
It seems that doesn't happen with all types and depends on the row size, see this comment for more tests. Also it doesn't happen when using the custom encoder so it may be somehow tied to Arrow?
Expected behavior
Writing Parquet compressed with GZIP using the S3 table function works.
Error message and/or stacktrace
Additional context
The error seems to be coming from zlib and there is very little information about it on the internet. Checking the code it seems that zlib does a wrong estimation of the buffer size to allocate? The only thing I've been able to find is this bug report from Apache Arrow where the problem seems to be caused by an older version of zlib. Maybe it's a matter of upgrading?
The text was updated successfully, but these errors were encountered: