read_parquet fails for non-string column names #5000

smangham · 2019-06-25T17:09:40Z

Dask doesn't like saving and loading dataframes with numeric column names to parquet.Possibly related to #4922?

Objective:
I'm trying to save a dataframe with numeric column names to disk, then load it.

Environment:
On the latest versions of Dask, Pandas and fastparquet, pyarrow 0.12.0.

MVE:

import pandas as pd
import dask.dataframe as dd

pdf = pd.DataFrame({0:[0, 1, 2], 1:['a', 'b', 'c']})
ddf = dd.from_pandas(pdf, npartitions=1)

ddf.to_parquet('bug.parquet', engine='pyarrow')
ddf2 = dd.read_parquet('bug.parquet', engine='pyarrow')
ddf2.compute()

Expected vs actual behaviour:

I expect the dataframe to be saved and loaded.

Instead, using engine=pyarrow fails on the compute line, with:

KeyError: "None of [Index(['0', '1'], dtype='object')] are in the [columns]"

It looks like to_parquet has converted the index to string before writing it to the metadata but not to the actual data? So it can't find those columns when it tries to read the data proper.

Using engine=fastparquet instead fails on the to_parquet line, returning

TypeError: Column name must be a string. Got column 0 of type int

Which is at least explicitly clear that non-string column names are not supported.

Is this an issue with how Dask is wrapping pyarrow, or do I need to raise it on the pyarrow GitHub?

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-06-25T18:21:57Z

Does parquet support string columns? Pandas explicitly requires that they be strings.

In [34]: ddf.compute().to_parquet('foo.parquet', engine='pyarrow')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-3dd0aaf84b16> in <module>
----> 1 ddf.compute().to_parquet('foo.parquet', engine='pyarrow')

~/Envs/dask-dev/lib/python3.7/site-packages/pandas/core/frame.py in to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
   2201         to_parquet(self, fname, engine,
   2202                    compression=compression, index=index,
-> 2203                    partition_cols=partition_cols, **kwargs)
   2204
   2205     @Substitution(header='Whether to print column labels, default True')

~/Envs/dask-dev/lib/python3.7/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
    250     impl = get_engine(engine)
    251     return impl.write(df, path, compression=compression, index=index,
--> 252                       partition_cols=partition_cols, **kwargs)
    253
    254

~/Envs/dask-dev/lib/python3.7/site-packages/pandas/io/parquet.py in write(self, df, path, compression, coerce_timestamps, index, partition_cols, **kwargs)
    104               coerce_timestamps='ms', index=None, partition_cols=None,
    105               **kwargs):
--> 106         self.validate_dataframe(df)
    107         path, _, _, _ = get_filepath_or_buffer(path, mode='wb')
    108

~/Envs/dask-dev/lib/python3.7/site-packages/pandas/io/parquet.py in validate_dataframe(df)
     56         # must have value column names (strings only)
     57         if df.columns.inferred_type not in {'string', 'unicode'}:
---> 58             raise ValueError("parquet must have string column names")
     59
     60         # index level names must be strings

ValueError: parquet must have string column names

smangham · 2019-06-26T09:41:47Z

Yeah, that's fair- looking at the Parquet docs name is a string variable. In that case I think the expected behaviour would be to throw the same ValueError as Pandas, rather than silently converting the names.

TomAugspurger · 2019-06-26T11:49:48Z

Sounds good to me. Interested in making a PR?

Warning though, the parquet handling is under a somewhat large refactor in #4995. May be best to wait until after that.

ian-r-rose · 2022-03-24T23:59:31Z

Thanks for opening this @smangham . I'm going to close it in favor of a duplicate issue (#8010) which has a bit more discussion in it (and I think we'll try to raise the expected ValueError)

smangham changed the title ~~import_parquet fails for non-string columns~~ read_parquet fails for non-string columns Jun 25, 2019

smangham changed the title ~~read_parquet fails for non-string columns~~ read_parquet fails for non-string column names Jun 25, 2019

TomAugspurger added the dataframe label Jun 25, 2019

ian-r-rose closed this as completed Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_parquet fails for non-string column names #5000

read_parquet fails for non-string column names #5000

smangham commented Jun 25, 2019

TomAugspurger commented Jun 25, 2019

smangham commented Jun 26, 2019

TomAugspurger commented Jun 26, 2019

ian-r-rose commented Mar 24, 2022

read_parquet fails for non-string column names #5000

read_parquet fails for non-string column names #5000

Comments

smangham commented Jun 25, 2019

TomAugspurger commented Jun 25, 2019

smangham commented Jun 26, 2019

TomAugspurger commented Jun 26, 2019

ian-r-rose commented Mar 24, 2022