Out-of-memory error when performing partitioned copy to S3 #11817

jankramer · 2024-04-24T17:49:03Z

What happens?

When performing a COPY to S3 using hive partitioning, memory usage is higher than expected.

For example, copying a table with 2 int64 columns with 30 partitions of 1k rows to S3 already fails when the memory limit is set to 2GiB, even though the entire table easily fits in memory. Copying to local disk, or copying to a single file in S3 both work fine. The format does not seem to matter, both Parquet and CSV exhibit the same issue.

Platforms tested (all with DuckDB CLI v0.10.2):

Linux x86-64
Linux aarch64
MacOS aarch64

To Reproduce

SET memory_limit = '2GiB';

-- Settings below do not seem to make a difference, but trying to maximize reproducibility
SET threads = 1;
SET s3_uploader_thread_limit = 1;
SET preserve_insertion_order = false;

-- Create a table with 30 partitions of 1000 records each
CREATE TABLE test AS SELECT UNNEST(RANGE(30000)) x, x//1000 AS y;

COPY test TO 's3://<bucket>/path' (FORMAT PARQUET, PARTITION_BY (y));
-- Out of Memory Error: could not allocate block of size 76.5 MiB (1.9 GiB/2.0 GiB used)

OS:

Linux x86-64

DuckDB Version:

0.10.2

DuckDB Client:

CLI

Full Name:

Jan Kramer

Affiliation:

N/A

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Yes, I have

The text was updated successfully, but these errors were encountered:

jankramer · 2024-04-27T10:23:02Z

Small update: the issue can be reproduced in the test suite by increasing the number of partitions the following test generates, e.g. by changing i%2 to i%10: https://github.com/duckdb/duckdb/blob/v0.10.2/test/sql/copy/s3/hive_partitioned_write_s3.test_slow#L38.

jankramer added the needs triage label Apr 24, 2024

szarnyasg added the under review label Apr 24, 2024

duckdblabs-bot removed the needs triage label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-memory error when performing partitioned copy to S3 #11817

Out-of-memory error when performing partitioned copy to S3 #11817

jankramer commented Apr 24, 2024 •

edited

jankramer commented Apr 27, 2024

Out-of-memory error when performing partitioned copy to S3 #11817

Out-of-memory error when performing partitioned copy to S3 #11817

Comments

jankramer commented Apr 24, 2024 • edited

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

jankramer commented Apr 27, 2024

jankramer commented Apr 24, 2024 •

edited