dataframe.to_parquet method always appends data when partitioned #5873

LudwikBielczynski · 2020-02-07T09:46:13Z

If the partition_on argument is given the database is always appended even when the append=False. Here is an example:

from datetime import timedelta
from pathlib import Path

import dask.dataframe as dd
import pandas as pd

# Mock up some data
report_date = pd.to_datetime('2019-01-01').date()
observations_nr = 2
dtas = range(0, observations_nr)
rds = [report_date - timedelta(days=days) for days in dtas]
data_to_export = pd.DataFrame({
    'report_date': rds,
    'dta': dtas,
    'stay_date': [report_date] * observations_nr,
    })

data_to_export_dask = dd.from_pandas(data_to_export, npartitions=1)

path = Path('trials')
file_name = 'appending.parquet'

# Act
for nr in range(0, 2):
    data_to_export_dask.to_parquet(path / file_name,
                                   engine='pyarrow',
                                   compression='snappy',
                                   partition_on=['report_date'],
                                   append=False,
                                   write_index=False
                                  )

data_read = dd.read_parquet(path / file_name, engine='pyarrow').compute()

assert data_read.shape == (2, 3)

An expected behaviour whould be to recreate the database. When partition_on=None it behaves properly.

TomAugspurger · 2020-02-07T13:24:09Z

@rjzamora do you know what the expected behavior of append=False is here? Should it clear the directory before writing, or at least the files that it's touching while writing?

rjzamora · 2020-02-07T14:54:29Z

Interesting - I'm not 100% sure of the correct behavior here. According to the docstring, we should be writing the dataset "from scratch", which probably means we should clear the original dataset completely before the write.

The reason the data is not overwritten for partitioned datasets is that we use pyarrow.parquet.write_to_dataset, rather than pyarrow.parquet.write_table (which is used otherwise). In dask, we do not explicitly delete any data before the write if append=False (pyarrow just happens to overwrite data when an existing path is passed to write_table) - So, I suspect there could also be strange behavior if you overwrite a non-partitioned dataset with fewer partitions. In that case, "_metadata" may say there are N files, but the directory will actually contain >N... I'll have to check if this last scenario is actually a problem.

TomAugspurger · 2020-02-07T15:11:24Z

So, I suspect there could also be strange behavior if you overwrite a non-partitioned dataset with fewer partitions

That's my concern too... It seems like there are several edge cases that can cause issues (different partition values on the same column, partitioning one different columns, etc.). Not to mention that the write could fail, in which case we presumably don't want to remove the "old" dataset. Perhaps this isn't worth supporting?

@LudwikBielczynski for now, clearing the directory prior to writing is probably the best workaround.

LudwikBielczynski · 2020-02-07T15:55:39Z

Hi @TomAugspurger and @rjzamora,
First thanks for your first answers and analysis, and your work with dask. As for your suggestion, I have already implemented a fail-safe to clear the path before writing in non-append mode. I thought that this scenario might be surprising for some other users.

TomAugspurger · 2020-02-11T13:47:08Z

Agreed that it's surprising. I think we should either attempt to support this or raise when writing a partitioned dataset to a non-empty directory.

jsignell · 2020-05-29T16:39:52Z

I think raising a warning is safer than having dask delete data on pyarrow's behalf.

jsignell · 2020-05-29T18:00:03Z

Actually, raising an error would require checking the contents of existing directories which probably isn't desirable. I did see though that there is a comment about write_to_dataset being a temporary fix:

dask/dask/dataframe/io/parquet/arrow.py

Lines 205 to 212 in c05204a

    
               """ Write table to a partitioned dataset with pyarrow. 
        
                   Logic copied from pyarrow.parquet. 
        
                   (arrow/python/pyarrow/parquet.py::write_to_dataset) 
        
                   TODO: Remove this in favor of pyarrow's `write_to_dataset` 
        
                         once ARROW-8244 is addressed. 
        
               """

The arrow issue mentioned there is now fixed and released, so should be a good time to change that back.

jsignell · 2021-03-12T14:39:03Z

I think this has been resolved now.

TomAugspurger added the io label Feb 11, 2020

quasiben added the core label Apr 30, 2020

jsignell mentioned this issue May 29, 2020

to_parquet don't append when append=False and file exists #6260

Closed

3 tasks

rjzamora mentioned this issue Jun 2, 2020

read_parquet with filters and arrow engine returns index in the columns #6277

Closed

rjzamora mentioned this issue Nov 10, 2020

When overwriting partitioned parquet files, Dask may leave old partitions present. #6824

Closed

jsignell closed this as completed Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataframe.to_parquet method always appends data when partitioned #5873

dataframe.to_parquet method always appends data when partitioned #5873

LudwikBielczynski commented Feb 7, 2020 •

edited

TomAugspurger commented Feb 7, 2020

rjzamora commented Feb 7, 2020

TomAugspurger commented Feb 7, 2020

LudwikBielczynski commented Feb 7, 2020

TomAugspurger commented Feb 11, 2020

jsignell commented May 29, 2020

jsignell commented May 29, 2020

jsignell commented Mar 12, 2021

dataframe.to_parquet method always appends data when partitioned #5873

dataframe.to_parquet method always appends data when partitioned #5873

Comments

LudwikBielczynski commented Feb 7, 2020 • edited

TomAugspurger commented Feb 7, 2020

rjzamora commented Feb 7, 2020

TomAugspurger commented Feb 7, 2020

LudwikBielczynski commented Feb 7, 2020

TomAugspurger commented Feb 11, 2020

jsignell commented May 29, 2020

jsignell commented May 29, 2020

jsignell commented Mar 12, 2021

LudwikBielczynski commented Feb 7, 2020 •

edited