-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataframe.to_parquet method always appends data when partitioned #5873
Comments
@rjzamora do you know what the expected behavior of |
Interesting - I'm not 100% sure of the correct behavior here. According to the docstring, we should be writing the dataset "from scratch", which probably means we should clear the original dataset completely before the write. The reason the data is not overwritten for partitioned datasets is that we use |
That's my concern too... It seems like there are several edge cases that can cause issues (different partition values on the same column, partitioning one different columns, etc.). Not to mention that the write could fail, in which case we presumably don't want to remove the "old" dataset. Perhaps this isn't worth supporting? @LudwikBielczynski for now, clearing the directory prior to writing is probably the best workaround. |
Hi @TomAugspurger and @rjzamora, |
Agreed that it's surprising. I think we should either attempt to support this or raise when writing a partitioned dataset to a non-empty directory. |
I think raising a warning is safer than having dask delete data on pyarrow's behalf. |
Actually, raising an error would require checking the contents of existing directories which probably isn't desirable. I did see though that there is a comment about dask/dask/dataframe/io/parquet/arrow.py Lines 205 to 212 in c05204a
The arrow issue mentioned there is now fixed and released, so should be a good time to change that back. |
I think this has been resolved now. |
If the partition_on argument is given the database is always appended even when the append=False. Here is an example:
An expected behaviour whould be to recreate the database. When partition_on=None it behaves properly.
The text was updated successfully, but these errors were encountered: