Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zlib deflate failed, output buffer too small error when writing a Parquet + GZIP file into S3 using the S3 table function #63319

Open
tb-fjurado opened this issue May 3, 2024 · 5 comments · May be fixed by #64489
Labels
bug Confirmed user-visible misbehaviour in official release

Comments

@tb-fjurado
Copy link

tb-fjurado commented May 3, 2024

Some of our customers have gotten the following error while trying to write a GZIP-compressed Parquet file to S3 using the S3 table function:

Code: 1002. DB::Exception: Error while writing a table: IOError: zlib deflate failed, output buffer too small. () (version 24.2.1.2248 (official build))

Multiple runs of the same query still produce the error.

Does it reproduce on the most recent release?

Yes, tested in in 24.2.1 and 24.4.1 (latest docker container available) and it still happens

How to reproduce

INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a UInt64', 1, 1024, 2)
LIMIT 100000

Will produce the following output:

Received exception from server (version 24.2.1):
Code: 1002. DB::Exception: Received from localhost:9000. DB::Exception: Error while writing a table: IOError: zlib deflate failed, output buffer too small. ()

It seems that doesn't happen with all types and depends on the row size, see this comment for more tests. Also it doesn't happen when using the custom encoder so it may be somehow tied to Arrow?

Expected behavior

Writing Parquet compressed with GZIP using the S3 table function works.

Error message and/or stacktrace

exception:                             Code: 1002. DB::Exception: Error while writing a table: IOError: zlib deflate failed, output buffer too small. () (version 24.2.1.2248 (official build))
stack_trace:                           0. ./build_docker/./src/Common/Exception.cpp:96: DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000cf5565b
1. DB::Exception::Exception<String>(int, FormatStringHelperImpl<std::type_identity<String>::type>, String&&) @ 0x00000000078c7743
2. ./build_docker/./src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp:0: DB::ParquetBlockOutputFormat::writeRowGroup(std::vector<DB::Chunk, std::allocator<DB::Chunk>>) @ 0x00000000131c8b53
3. ./build_docker/./src/Processors/Formats/Impl/ParquetBlockOutputFormat.cpp:0: DB::ParquetBlockOutputFormat::consume(DB::Chunk) @ 0x00000000131c4d88
4. ./contrib/llvm-project/libcxx/include/__memory/shared_ptr.h:701: DB::IOutputFormat::write(DB::Block const&) @ 0x000000001303ebbf
5. ./build_docker/./src/Storages/StorageS3.cpp:0: DB::StorageS3Sink::consume(DB::Chunk) @ 0x0000000012685b57
6. ./contrib/llvm-project/libcxx/include/__memory/shared_ptr.h:701: DB::PartitionedSink::consume(DB::Chunk) @ 0x000000001236e7df
7. ./contrib/llvm-project/libcxx/include/__memory/shared_ptr.h:701: DB::SinkToStorage::onConsume(DB::Chunk) @ 0x000000001336d802
8. ./contrib/llvm-project/libcxx/include/__memory/shared_ptr.h:701: void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::ExceptionKeepingTransform::work()::$_1, void ()>>(std::__function::__policy_storage const*) @ 0x000000001329dd2b
9. ./contrib/llvm-project/libcxx/include/__functional/function.h:848: ? @ 0x000000001329da3c
10. ./contrib/llvm-project/libcxx/include/__functional/function.h:818: ? @ 0x000000001329d113
11. ./build_docker/./src/Processors/Executors/ExecutionThreadContext.cpp:0: DB::ExecutionThreadContext::executeTask() @ 0x0000000013031afa
12. ./build_docker/./src/Processors/Executors/PipelineExecutor.cpp:273: DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x0000000013028550
13. ./contrib/llvm-project/libcxx/include/__memory/shared_ptr.h:833: void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<DB::PipelineExecutor::spawnThreads()::$_0, void ()>>(std::__function::__policy_storage const*) @ 0x0000000013029638
14. ./base/base/../base/wide_integer_impl.h:810: ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>) @ 0x000000000cff84e1
15. ./build_docker/./src/Common/ThreadPool.cpp:0: void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000000cffbd1a
16. ./base/base/../base/wide_integer_impl.h:810: void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0x000000000cffab1e
17. ? @ 0x000071b496e94ac3
18. ? @ 0x000071b496f26850

Additional context

The error seems to be coming from zlib and there is very little information about it on the internet. Checking the code it seems that zlib does a wrong estimation of the buffer size to allocate? The only thing I've been able to find is this bug report from Apache Arrow where the problem seems to be caused by an older version of zlib. Maybe it's a matter of upgrading?

@tb-fjurado tb-fjurado added the potential bug To be reviewed by developers and confirmed/rejected. label May 3, 2024
@tb-fjurado
Copy link
Author

tb-fjurado commented May 3, 2024

Ok I've got a reproducer:

INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a UInt64, d UInt64', 1, 64, 2)
LIMIT 100000

Query id: b40c068e-10f5-4f77-b598-0cbfffb37943


Elapsed: 0.031 sec.

Received exception from server (version 24.2.1):
Code: 1002. DB::Exception: Received from localhost:9000. DB::Exception: Error while writing a table: IOError: zlib deflate failed, output buffer too small. ()

If we have two fields with long strings it works:

INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a String, d String', 1, 1024, 2)
LIMIT 100000

Query id: 5244e390-3f49-4ce6-9081-c25af8644146

Ok.

0 rows in set. Elapsed: 3.304 sec. Processed 100.19 thousand rows, 104.38 MB (30.32 thousand rows/s., 31.59 MB/s.)
Peak memory usage: 258.04 MiB.

If we have a single UInt64 column with enough rows it fails:

INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a UInt64', 1, 1024, 2)
LIMIT 100000

Query id: 581138a6-f3e2-4f38-b114-8e2e45dfda34


Elapsed: 0.031 sec.

Received exception from server (version 24.2.1):
Code: 1002. DB::Exception: Received from localhost:9000. DB::Exception: Error while writing a table: IOError: zlib deflate failed, output buffer too small. ()

But if we use less rows it works:

INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a UInt64', 1, 1024, 2)
LIMIT 10000

Query id: 5059d82c-f15a-4a19-8123-f7c9cd5388af

Ok.

0 rows in set. Elapsed: 0.155 sec. Processed 10.00 thousand rows, 80.00 KB (64.49 thousand rows/s., 515.95 KB/s.)
Peak memory usage: 16.86 KiB.

@tb-fjurado
Copy link
Author

tb-fjurado commented May 3, 2024

It doesn't seem exclusive to Uint64 either, it seems that at some column block size the compression breaks. For example, here's an example with Int8. This works:

INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a Int8', 1, 1024, 2)
LIMIT 550000

Query id: df0dee41-9aa5-467f-8db1-58ff16b25ee1

Ok.

But if we insert 560k rows it fails:

INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip'
SELECT *
FROM generateRandom('a Int8', 1, 1024, 2)
LIMIT 560000

Query id: 22150098-bd2a-4fac-a5bf-f4cb04b1c977


Elapsed: 0.025 sec.

Received exception from server (version 24.2.1):
Code: 1002. DB::Exception: Received from localhost:9000. DB::Exception: Error while writing a table: IOError: zlib deflate failed, output buffer too small. ()

So it seems there's some threshold at which it breaks. If we enable the custom encoder it works even inserting 1M rows:

INSERT INTO FUNCTION s3('https://test-sinks.s3.eu-west-3.amazonaws.com/rep_deflate/myfile.parquet', 'REDACTED', 'REDACTED', 'Parquet')
SETTINGS output_format_parquet_compression_method = 'gzip', output_format_parquet_use_custom_encoder = 1
SELECT *
FROM generateRandom('a Int8', 1, 1024, 2)
LIMIT 1000000

Query id: f1d40025-ed20-4508-87bc-11f2a18903d6

Ok.

@Algunenano Algunenano added bug Confirmed user-visible misbehaviour in official release and removed potential bug To be reviewed by developers and confirmed/rejected. labels May 3, 2024
@jrdi
Copy link
Contributor

jrdi commented May 3, 2024

The error seems to be coming from zlib and there is very little information about it on the internet. Checking the code it seems that zlib does a wrong estimation of the buffer size to allocate? The only thing I've been able to find is this bug report from apache/arrow#2756 where the problem seems to be caused by an apache/arrow#2756 (comment). Maybe it's a matter of upgrading?

@tb-fjurado since the issue you're facing seems to be exclusively related to Arrow. Have you tried enabling output_format_parquet_use_custom_encoder? By comments and changes here looks like is stable enough already.

@tb-fjurado
Copy link
Author

tb-fjurado commented May 4, 2024

The error seems to be coming from zlib and there is very little information about it on the internet. Checking the code it seems that zlib does a wrong estimation of the buffer size to allocate? The only thing I've been able to find is this bug report from apache/arrow#2756 where the problem seems to be caused by an apache/arrow#2756 (comment). Maybe it's a matter of upgrading?

@tb-fjurado since the issue you're facing seems to be exclusively related to Arrow. Have you tried enabling output_format_parquet_use_custom_encoder? By comments and changes here looks like is stable enough already.

Yes, I tried enabling it and it worked as expected. I see that #63210 has been already merged so I understand that the custom encoder is the way forward and we don't want to invest more time in arrow right?

Also, does anybody know from which CH version could we consider the custom encoder stable, even if it was not set as default? I saw this comment here from Aug 2023 that said it should be good enough to enable and I haven't found many more (if any) changes on the encoder in the PRs ever since. Just to know if we can enable it right away in our current CH deployments or we need to upgrade.

Thanks!

@al13n321
Copy link
Member

Also, does anybody know from which CH version could we consider the custom encoder stable, even if it was not set as default? I saw this comment #53130 (comment) from Aug 2023 that said it should be good enough to enable and I haven't found many more (if any) changes on the encoder in the PRs ever since. Just to know if we can enable it right away in our current CH deployments or we need to upgrade.

Yes, the last significant fix was #52951 (August 2023), so 23.10+ should be good.


Yes, seems to be a problem with zlib-ng. Apparently:

  1. deflateBound(sourceLen = 800000) (in zlib) returns 800268.
  2. MaxCompressedLen() (in arrow compression_zlib.cc) adds 12 to it with comment: "ARROW-3514: return a more pessimistic estimate to account for bugs in old zlib versions."
  3. deflate() (in zlib) ends up trying to write 38 bytes more than that (or at least strm->state->pending = 38 - I didn't follow the code carefully, this interpretation may be incorrect).

Maybe I'll investigate more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed user-visible misbehaviour in official release
Projects
None yet
4 participants