New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"IO Error: Could not create directory" when writing hive partitioned parquet files #9360
Comments
Also seeing this issue. v0.9.2-dev83 739da94. I've attached
The first time you run it will take some time to build the customer table. After that it should run faster. The original file sets Edit example.txt |
I just verified your example and can also see the same errors:
Since you mentioned the OVERWRITE_OR_IGNORE Parameter: I found it useful to add data to an existing partitioned dataset. If the new data is distinct from the existing data (only adding new partitions, not touching the existing ones), it works well for adding. However, I've never tried it for updating/replacing data. As far as I've seen, in the folders there is a datafile created per thread. If the paramter would be replacing the files, this would lead to duplicate data, if the first run had a higher number of threads than the following runs? (i.e. Creating the dataset with 8 threads creates 8 data files; adding with 4 threads will replace the first 4 data files, but leaves the last 4 data files untouched?) |
Kewl :-) It's not just the create directory error though...I think there are more errors behind the scenes. The reported folder typically ends up getting created...others not reported do not, and some files are not created which leads to the reduced count when getting count from partitioned data.
Just getting started with/learning partitioning with DuckDB...I've used it before with Apache Drill it made a big difference with big data sets. But I think DuckDB is a good replacement for Drill for local processing as long as you have memory and disk space...anyway yeah been trying to figure out how the OVERWRITE_OR_IGNORE black box is working by poking it...
Yeah seeing the per thread files, but a fair amount of them are empty...maybe not a surprise given my test table isn't huge. In some cases I don't always get files for all thread (e.g. with threads=6 I don't always have 6 files per folder). And yeah I think it is just allowing process to overwrite any existing file but not getting rid of extra files. So switching from higher to lower threads value is probably not optimal... Thanks for trying my script out. At some point I'll probably try to debug this myself...if the DuckDB folks don't beat me to it first... |
…ries to fix race condition on Windows in partitioned write
What happens?
When writing hive partitioned parquet Files, you often get an IO Error, stating that one of the partitioned folders cannot be created:
When executing the command multiple times, the folder is usually different (e.g. first time is
FL_DATE=2006-01-09
, next time isFL_DATE=2006-01-04
, kind of random...).However, it does not happen when the source file is read via the httpfs module (could be a timing issue?).
No issue:
Fails ~90% of the time if you download the .parquet File first and then execute:
I also noticed that the error does not occur, if I set the threads to 1:
To Reproduce
C:\temp\
folderError: IO Error: Could not create directory: 'C:\temp\flights\FL_DATE=2006-01-09'
most of the time (be sure to deleteC:\temp\flights\
before the next try)OS:
Windows x64
DuckDB Version:
v0.9.2-dev23 6eeb682
DuckDB Client:
CLI
Full Name:
Fabian Krüger
Affiliation:
IBIS Prof. Thome AG
Have you tried this on the latest
main
branch?I have tested with a main build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
The text was updated successfully, but these errors were encountered: