"IO Error: Could not create directory" when writing hive partitioned parquet files #9360

killerfurbel · 2023-10-16T11:37:54Z

What happens?

When writing hive partitioned parquet Files, you often get an IO Error, stating that one of the partitioned folders cannot be created:

v0.9.2-dev23 6eeb6826bb
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D copy (select * from 'C:\temp\Flights-1m.parquet')
> to 'C:\temp\flights' (FORMAT 'PARQUET', CODEC 'ZSTD', PARTITION_BY(FL_DATE), OVERWRITE_OR_IGNORE 1);
Error: IO Error: Could not create directory: 'C:\temp\flights\FL_DATE=2006-01-09'
D

When executing the command multiple times, the folder is usually different (e.g. first time is FL_DATE=2006-01-09, next time is FL_DATE=2006-01-04, kind of random...).

However, it does not happen when the source file is read via the httpfs module (could be a timing issue?).

No issue:

copy (select * from 'https://www.tablab.app/sample-datasets/flights-1m.parquet')
to 'C:\temp\flights' (FORMAT 'PARQUET', CODEC 'ZSTD', PARTITION_BY(FL_DATE), OVERWRITE_OR_IGNORE 1);

Fails ~90% of the time if you download the .parquet File first and then execute:

copy (select * from 'C:\temp\Flights-1m.parquet')
to 'C:\temp\flights' (FORMAT 'PARQUET', CODEC 'ZSTD', PARTITION_BY(FL_DATE), OVERWRITE_OR_IGNORE 1);

I also noticed that the error does not occur, if I set the threads to 1:

SET threads TO 1;

To Reproduce

Download https://www.tablab.app/sample-datasets/flights-1m.parquet to C:\temp\ folder

Execute the following SQL with DuckDB:

copy (select * from 'C:\temp\Flights-1m.parquet')
to 'C:\temp\flights' (FORMAT 'PARQUET', CODEC 'ZSTD', PARTITION_BY(FL_DATE), OVERWRITE_OR_IGNORE 1);

You get errors like Error: IO Error: Could not create directory: 'C:\temp\flights\FL_DATE=2006-01-09' most of the time (be sure to delete C:\temp\flights\ before the next try)

OS:

Windows x64

DuckDB Version:

v0.9.2-dev23 6eeb682

DuckDB Client:

CLI

Full Name:

Fabian Krüger

Affiliation:

IBIS Prof. Thome AG

Have you tried this on the latest `main` branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Yes, I have

The text was updated successfully, but these errors were encountered:

bucweat · 2023-10-20T07:22:51Z

Also seeing this issue. v0.9.2-dev83 739da94.

I've attached example.txt that can be run to illustrate the issue.

Download example.txt and rename to example.sql (why doesn't github support .sql files?).
Copy the example.sql file to a folder.
Open a windows cmd prompt in that folder.
Run the following:

duckdb.exe test.duckdb < example.sql

The first time you run it will take some time to build the customer table. After that it should run faster.

The original file sets PRAGMA threads=1; and everything should work as expected (as mentioned above). The folders and files are all created correctly and count from the partitioned files matches the source table.

Edit example.sql and set threads to a higher value. I've included output for threads = 1,2,4,6. With threads > 1, you'll see errors like Error: near line 45: IO Error: Could not create directory: 'customer\h=3' and the counts of partitioned data will not match the table count.

example.txt
1_threads.txt
2_threads.txt
4_threads.txt
6_threads.txt

killerfurbel · 2023-10-20T07:46:51Z

I just verified your example and can also see the same errors:

┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          6 │
└────────────────────────────┘
...
first time copy to empty partition customer folder (expect error)
Error: near line 45: IO Error: Could not create directory: 'customer\h=3'

Since you mentioned the OVERWRITE_OR_IGNORE Parameter: I found it useful to add data to an existing partitioned dataset. If the new data is distinct from the existing data (only adding new partitions, not touching the existing ones), it works well for adding.

However, I've never tried it for updating/replacing data. As far as I've seen, in the folders there is a datafile created per thread. If the paramter would be replacing the files, this would lead to duplicate data, if the first run had a higher number of threads than the following runs? (i.e. Creating the dataset with 8 threads creates 8 data files; adding with 4 threads will replace the first 4 data files, but leaves the last 4 data files untouched?)

bucweat · 2023-10-20T08:24:54Z

I just verified your example and can also see the same errors:

┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          6 │
└────────────────────────────┘
...
first time copy to empty partition customer folder (expect error)
Error: near line 45: IO Error: Could not create directory: 'customer\h=3'

Kewl :-) It's not just the create directory error though...I think there are more errors behind the scenes. The reported folder typically ends up getting created...others not reported do not, and some files are not created which leads to the reduced count when getting count from partitioned data.

Since you mentioned the OVERWRITE_OR_IGNORE Parameter: I found it useful to add data to an existing partitioned dataset. If the new data is distinct from the existing data (only adding new partitions, not touching the existing ones), it works well for adding.

Just getting started with/learning partitioning with DuckDB...I've used it before with Apache Drill it made a big difference with big data sets. But I think DuckDB is a good replacement for Drill for local processing as long as you have memory and disk space...anyway yeah been trying to figure out how the OVERWRITE_OR_IGNORE black box is working by poking it...

However, I've never tried it for updating/replacing data. As far as I've seen, in the folders there is a datafile created per thread. If the paramter would be replacing the files, this would lead to duplicate data, if the first run had a higher number of threads than the following runs? (i.e. Creating the dataset with 8 threads creates 8 data files; adding with 4 threads will replace the first 4 data files, but leaves the last 4 data files untouched?)

Yeah seeing the per thread files, but a fair amount of them are empty...maybe not a surprise given my test table isn't huge. In some cases I don't always get files for all thread (e.g. with threads=6 I don't always have 6 files per folder). And yeah I think it is just allowing process to overwrite any existing file but not getting rid of extra files. So switching from higher to lower threads value is probably not optimal...

Thanks for trying my script out. At some point I'll probably try to debug this myself...if the DuckDB folks don't beat me to it first...

…ries to fix race condition on Windows in partitioned write

Fix #9360, fix #9466: grab a lock before creating directories to fix race condition on Windows in partitioned write

killerfurbel added the needs triage label Oct 16, 2023

killerfurbel mentioned this issue Oct 20, 2023

copy (select 2000+(v//12)y,m,v,j from orders) TO 'orders_ym' (FORMAT PARQUET, PARTITION_BY (y,m)); missing some data #9404

Closed

1 task

szarnyasg added the under review label Oct 23, 2023

duckdblabs-bot removed the needs triage label Oct 23, 2023

Mytherin added a commit to Mytherin/duckdb that referenced this issue Oct 25, 2023

Fix duckdb#9360, fix duckdb#9466: grab a lock before creating directo…

65103b0

…ries to fix race condition on Windows in partitioned write

Mytherin mentioned this issue Oct 25, 2023

Fix #9360, fix #9466: grab a lock before creating directories to fix race condition on Windows in partitioned write #9473

Merged

Mytherin closed this as completed in #9473 Oct 30, 2023

Mytherin added a commit that referenced this issue Oct 30, 2023

Merge pull request #9473 from Mytherin/issue9360

182b824

Fix #9360, fix #9466: grab a lock before creating directories to fix race condition on Windows in partitioned write

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"IO Error: Could not create directory" when writing hive partitioned parquet files #9360

"IO Error: Could not create directory" when writing hive partitioned parquet files #9360

killerfurbel commented Oct 16, 2023 •

edited

bucweat commented Oct 20, 2023 •

edited

killerfurbel commented Oct 20, 2023

bucweat commented Oct 20, 2023

"IO Error: Could not create directory" when writing hive partitioned parquet files #9360

"IO Error: Could not create directory" when writing hive partitioned parquet files #9360

Comments

killerfurbel commented Oct 16, 2023 • edited

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest main branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

bucweat commented Oct 20, 2023 • edited

killerfurbel commented Oct 20, 2023

bucweat commented Oct 20, 2023

killerfurbel commented Oct 16, 2023 •

edited

Have you tried this on the latest `main` branch?

bucweat commented Oct 20, 2023 •

edited