Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_parquet fails for non-string column names #5000

Closed
smangham opened this issue Jun 25, 2019 · 4 comments
Closed

read_parquet fails for non-string column names #5000

smangham opened this issue Jun 25, 2019 · 4 comments

Comments

@smangham
Copy link

Dask doesn't like saving and loading dataframes with numeric column names to parquet.Possibly related to #4922?

Objective:
I'm trying to save a dataframe with numeric column names to disk, then load it.

Environment:
On the latest versions of Dask, Pandas and fastparquet, pyarrow 0.12.0.

MVE:

import pandas as pd
import dask.dataframe as dd

pdf = pd.DataFrame({0:[0, 1, 2], 1:['a', 'b', 'c']})
ddf = dd.from_pandas(pdf, npartitions=1)

ddf.to_parquet('bug.parquet', engine='pyarrow')
ddf2 = dd.read_parquet('bug.parquet', engine='pyarrow')
ddf2.compute()

Expected vs actual behaviour:

I expect the dataframe to be saved and loaded.

Instead, using engine=pyarrow fails on the compute line, with:

KeyError: "None of [Index(['0', '1'], dtype='object')] are in the [columns]"

It looks like to_parquet has converted the index to string before writing it to the metadata but not to the actual data? So it can't find those columns when it tries to read the data proper.

Using engine=fastparquet instead fails on the to_parquet line, returning

TypeError: Column name must be a string. Got column 0 of type int

Which is at least explicitly clear that non-string column names are not supported.

Is this an issue with how Dask is wrapping pyarrow, or do I need to raise it on the pyarrow GitHub?

@smangham smangham changed the title import_parquet fails for non-string columns read_parquet fails for non-string columns Jun 25, 2019
@smangham smangham changed the title read_parquet fails for non-string columns read_parquet fails for non-string column names Jun 25, 2019
@TomAugspurger
Copy link
Member

Does parquet support string columns? Pandas explicitly requires that they be strings.

In [34]: ddf.compute().to_parquet('foo.parquet', engine='pyarrow')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-3dd0aaf84b16> in <module>
----> 1 ddf.compute().to_parquet('foo.parquet', engine='pyarrow')

~/Envs/dask-dev/lib/python3.7/site-packages/pandas/core/frame.py in to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
   2201         to_parquet(self, fname, engine,
   2202                    compression=compression, index=index,
-> 2203                    partition_cols=partition_cols, **kwargs)
   2204
   2205     @Substitution(header='Whether to print column labels, default True')

~/Envs/dask-dev/lib/python3.7/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
    250     impl = get_engine(engine)
    251     return impl.write(df, path, compression=compression, index=index,
--> 252                       partition_cols=partition_cols, **kwargs)
    253
    254

~/Envs/dask-dev/lib/python3.7/site-packages/pandas/io/parquet.py in write(self, df, path, compression, coerce_timestamps, index, partition_cols, **kwargs)
    104               coerce_timestamps='ms', index=None, partition_cols=None,
    105               **kwargs):
--> 106         self.validate_dataframe(df)
    107         path, _, _, _ = get_filepath_or_buffer(path, mode='wb')
    108

~/Envs/dask-dev/lib/python3.7/site-packages/pandas/io/parquet.py in validate_dataframe(df)
     56         # must have value column names (strings only)
     57         if df.columns.inferred_type not in {'string', 'unicode'}:
---> 58             raise ValueError("parquet must have string column names")
     59
     60         # index level names must be strings

ValueError: parquet must have string column names

@smangham
Copy link
Author

Yeah, that's fair- looking at the Parquet docs name is a string variable. In that case I think the expected behaviour would be to throw the same ValueError as Pandas, rather than silently converting the names.

@TomAugspurger
Copy link
Member

Sounds good to me. Interested in making a PR?

Warning though, the parquet handling is under a somewhat large refactor in #4995. May be best to wait until after that.

@ian-r-rose
Copy link
Collaborator

Thanks for opening this @smangham . I'm going to close it in favor of a duplicate issue (#8010) which has a bit more discussion in it (and I think we'll try to raise the expected ValueError)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants