Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"IO Error: Could not create directory" when writing hive partitioned parquet files #9360

Closed
1 task done
killerfurbel opened this issue Oct 16, 2023 · 3 comments · Fixed by #9473
Closed
1 task done

Comments

@killerfurbel
Copy link

killerfurbel commented Oct 16, 2023

What happens?

When writing hive partitioned parquet Files, you often get an IO Error, stating that one of the partitioned folders cannot be created:

v0.9.2-dev23 6eeb6826bb
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D copy (select * from 'C:\temp\Flights-1m.parquet')
> to 'C:\temp\flights' (FORMAT 'PARQUET', CODEC 'ZSTD', PARTITION_BY(FL_DATE), OVERWRITE_OR_IGNORE 1);
Error: IO Error: Could not create directory: 'C:\temp\flights\FL_DATE=2006-01-09'
D

When executing the command multiple times, the folder is usually different (e.g. first time is FL_DATE=2006-01-09, next time is FL_DATE=2006-01-04, kind of random...).

However, it does not happen when the source file is read via the httpfs module (could be a timing issue?).

No issue:

copy (select * from 'https://www.tablab.app/sample-datasets/flights-1m.parquet')
to 'C:\temp\flights' (FORMAT 'PARQUET', CODEC 'ZSTD', PARTITION_BY(FL_DATE), OVERWRITE_OR_IGNORE 1);

Fails ~90% of the time if you download the .parquet File first and then execute:

copy (select * from 'C:\temp\Flights-1m.parquet')
to 'C:\temp\flights' (FORMAT 'PARQUET', CODEC 'ZSTD', PARTITION_BY(FL_DATE), OVERWRITE_OR_IGNORE 1);

I also noticed that the error does not occur, if I set the threads to 1:

SET threads TO 1;

To Reproduce

  • Download https://www.tablab.app/sample-datasets/flights-1m.parquet to C:\temp\ folder
  • Execute the following SQL with DuckDB:
    copy (select * from 'C:\temp\Flights-1m.parquet')
    to 'C:\temp\flights' (FORMAT 'PARQUET', CODEC 'ZSTD', PARTITION_BY(FL_DATE), OVERWRITE_OR_IGNORE 1);
  • You get errors like Error: IO Error: Could not create directory: 'C:\temp\flights\FL_DATE=2006-01-09' most of the time (be sure to delete C:\temp\flights\ before the next try)

OS:

Windows x64

DuckDB Version:

v0.9.2-dev23 6eeb682

DuckDB Client:

CLI

Full Name:

Fabian Krüger

Affiliation:

IBIS Prof. Thome AG

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have
@bucweat
Copy link
Contributor

bucweat commented Oct 20, 2023

Also seeing this issue. v0.9.2-dev83 739da94.

I've attached example.txt that can be run to illustrate the issue.

  1. Download example.txt and rename to example.sql (why doesn't github support .sql files?).
  2. Copy the example.sql file to a folder.
  3. Open a windows cmd prompt in that folder.
  4. Run the following:
duckdb.exe test.duckdb < example.sql

The first time you run it will take some time to build the customer table. After that it should run faster.

The original file sets PRAGMA threads=1; and everything should work as expected (as mentioned above). The folders and files are all created correctly and count from the partitioned files matches the source table.

Edit example.sql and set threads to a higher value. I've included output for threads = 1,2,4,6. With threads > 1, you'll see errors like Error: near line 45: IO Error: Could not create directory: 'customer\h=3' and the counts of partitioned data will not match the table count.

example.txt
1_threads.txt
2_threads.txt
4_threads.txt
6_threads.txt

@killerfurbel
Copy link
Author

I just verified your example and can also see the same errors:

┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          6 │
└────────────────────────────┘
...
first time copy to empty partition customer folder (expect error)
Error: near line 45: IO Error: Could not create directory: 'customer\h=3'

Since you mentioned the OVERWRITE_OR_IGNORE Parameter: I found it useful to add data to an existing partitioned dataset. If the new data is distinct from the existing data (only adding new partitions, not touching the existing ones), it works well for adding.

However, I've never tried it for updating/replacing data. As far as I've seen, in the folders there is a datafile created per thread. If the paramter would be replacing the files, this would lead to duplicate data, if the first run had a higher number of threads than the following runs? (i.e. Creating the dataset with 8 threads creates 8 data files; adding with 4 threads will replace the first 4 data files, but leaves the last 4 data files untouched?)

@bucweat
Copy link
Contributor

bucweat commented Oct 20, 2023

I just verified your example and can also see the same errors:

┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          6 │
└────────────────────────────┘
...
first time copy to empty partition customer folder (expect error)
Error: near line 45: IO Error: Could not create directory: 'customer\h=3'

Kewl :-) It's not just the create directory error though...I think there are more errors behind the scenes. The reported folder typically ends up getting created...others not reported do not, and some files are not created which leads to the reduced count when getting count from partitioned data.

Since you mentioned the OVERWRITE_OR_IGNORE Parameter: I found it useful to add data to an existing partitioned dataset. If the new data is distinct from the existing data (only adding new partitions, not touching the existing ones), it works well for adding.

Just getting started with/learning partitioning with DuckDB...I've used it before with Apache Drill it made a big difference with big data sets. But I think DuckDB is a good replacement for Drill for local processing as long as you have memory and disk space...anyway yeah been trying to figure out how the OVERWRITE_OR_IGNORE black box is working by poking it...

However, I've never tried it for updating/replacing data. As far as I've seen, in the folders there is a datafile created per thread. If the paramter would be replacing the files, this would lead to duplicate data, if the first run had a higher number of threads than the following runs? (i.e. Creating the dataset with 8 threads creates 8 data files; adding with 4 threads will replace the first 4 data files, but leaves the last 4 data files untouched?)

Yeah seeing the per thread files, but a fair amount of them are empty...maybe not a surprise given my test table isn't huge. In some cases I don't always get files for all thread (e.g. with threads=6 I don't always have 6 files per folder). And yeah I think it is just allowing process to overwrite any existing file but not getting rid of extra files. So switching from higher to lower threads value is probably not optimal...

Thanks for trying my script out. At some point I'll probably try to debug this myself...if the DuckDB folks don't beat me to it first...

Mytherin added a commit to Mytherin/duckdb that referenced this issue Oct 25, 2023
…ries to fix race condition on Windows in partitioned write
Mytherin added a commit that referenced this issue Oct 30, 2023
Fix #9360, fix #9466: grab a lock before creating directories to fix race condition on Windows in partitioned write
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants