Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataframe.to_parquet method always appends data when partitioned #5873

Closed
LudwikBielczynski opened this issue Feb 7, 2020 · 8 comments
Closed

Comments

@LudwikBielczynski
Copy link

LudwikBielczynski commented Feb 7, 2020

If the partition_on argument is given the database is always appended even when the append=False. Here is an example:

from datetime import timedelta
from pathlib import Path

import dask.dataframe as dd
import pandas as pd

# Mock up some data
report_date = pd.to_datetime('2019-01-01').date()
observations_nr = 2
dtas = range(0, observations_nr)
rds = [report_date - timedelta(days=days) for days in dtas]
data_to_export = pd.DataFrame({
    'report_date': rds,
    'dta': dtas,
    'stay_date': [report_date] * observations_nr,
    })

data_to_export_dask = dd.from_pandas(data_to_export, npartitions=1)

path = Path('trials')
file_name = 'appending.parquet'

# Act
for nr in range(0, 2):
    data_to_export_dask.to_parquet(path / file_name,
                                   engine='pyarrow',
                                   compression='snappy',
                                   partition_on=['report_date'],
                                   append=False,
                                   write_index=False
                                  )

data_read = dd.read_parquet(path / file_name, engine='pyarrow').compute()

assert data_read.shape == (2, 3)

An expected behaviour whould be to recreate the database. When partition_on=None it behaves properly.

@TomAugspurger
Copy link
Member

@rjzamora do you know what the expected behavior of append=False is here? Should it clear the directory before writing, or at least the files that it's touching while writing?

@rjzamora
Copy link
Member

rjzamora commented Feb 7, 2020

Interesting - I'm not 100% sure of the correct behavior here. According to the docstring, we should be writing the dataset "from scratch", which probably means we should clear the original dataset completely before the write.

The reason the data is not overwritten for partitioned datasets is that we use pyarrow.parquet.write_to_dataset, rather than pyarrow.parquet.write_table (which is used otherwise). In dask, we do not explicitly delete any data before the write if append=False (pyarrow just happens to overwrite data when an existing path is passed to write_table) - So, I suspect there could also be strange behavior if you overwrite a non-partitioned dataset with fewer partitions. In that case, "_metadata" may say there are N files, but the directory will actually contain >N... I'll have to check if this last scenario is actually a problem.

@TomAugspurger
Copy link
Member

So, I suspect there could also be strange behavior if you overwrite a non-partitioned dataset with fewer partitions

That's my concern too... It seems like there are several edge cases that can cause issues (different partition values on the same column, partitioning one different columns, etc.). Not to mention that the write could fail, in which case we presumably don't want to remove the "old" dataset. Perhaps this isn't worth supporting?

@LudwikBielczynski for now, clearing the directory prior to writing is probably the best workaround.

@LudwikBielczynski
Copy link
Author

Hi @TomAugspurger and @rjzamora,
First thanks for your first answers and analysis, and your work with dask. As for your suggestion, I have already implemented a fail-safe to clear the path before writing in non-append mode. I thought that this scenario might be surprising for some other users.

@TomAugspurger
Copy link
Member

Agreed that it's surprising. I think we should either attempt to support this or raise when writing a partitioned dataset to a non-empty directory.

@jsignell
Copy link
Member

I think raising a warning is safer than having dask delete data on pyarrow's behalf.

@jsignell
Copy link
Member

Actually, raising an error would require checking the contents of existing directories which probably isn't desirable. I did see though that there is a comment about write_to_dataset being a temporary fix:

""" Write table to a partitioned dataset with pyarrow.
Logic copied from pyarrow.parquet.
(arrow/python/pyarrow/parquet.py::write_to_dataset)
TODO: Remove this in favor of pyarrow's `write_to_dataset`
once ARROW-8244 is addressed.
"""

The arrow issue mentioned there is now fixed and released, so should be a good time to change that back.

@jsignell
Copy link
Member

I think this has been resolved now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants