Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-memory error when performing partitioned copy to S3 #11817

Open
2 tasks done
jankramer opened this issue Apr 24, 2024 · 1 comment
Open
2 tasks done

Out-of-memory error when performing partitioned copy to S3 #11817

jankramer opened this issue Apr 24, 2024 · 1 comment

Comments

@jankramer
Copy link

jankramer commented Apr 24, 2024

What happens?

When performing a COPY to S3 using hive partitioning, memory usage is higher than expected.

For example, copying a table with 2 int64 columns with 30 partitions of 1k rows to S3 already fails when the memory limit is set to 2GiB, even though the entire table easily fits in memory. Copying to local disk, or copying to a single file in S3 both work fine. The format does not seem to matter, both Parquet and CSV exhibit the same issue.

Platforms tested (all with DuckDB CLI v0.10.2):

  • Linux x86-64
  • Linux aarch64
  • MacOS aarch64

To Reproduce

SET memory_limit = '2GiB';

-- Settings below do not seem to make a difference, but trying to maximize reproducibility
SET threads = 1;
SET s3_uploader_thread_limit = 1;
SET preserve_insertion_order = false;

-- Create a table with 30 partitions of 1000 records each
CREATE TABLE test AS SELECT UNNEST(RANGE(30000)) x, x//1000 AS y;

COPY test TO 's3://<bucket>/path' (FORMAT PARQUET, PARTITION_BY (y));
-- Out of Memory Error: could not allocate block of size 76.5 MiB (1.9 GiB/2.0 GiB used)

OS:

Linux x86-64

DuckDB Version:

0.10.2

DuckDB Client:

CLI

Full Name:

Jan Kramer

Affiliation:

N/A

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@jankramer
Copy link
Author

Small update: the issue can be reproduced in the test suite by increasing the number of partitions the following test generates, e.g. by changing i%2 to i%10: https://github.com/duckdb/duckdb/blob/v0.10.2/test/sql/copy/s3/hive_partitioned_write_s3.test_slow#L38.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants