New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
read_parquet fails for non-string column names #5000
Comments
Does parquet support string columns? Pandas explicitly requires that they be strings. In [34]: ddf.compute().to_parquet('foo.parquet', engine='pyarrow')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-3dd0aaf84b16> in <module>
----> 1 ddf.compute().to_parquet('foo.parquet', engine='pyarrow')
~/Envs/dask-dev/lib/python3.7/site-packages/pandas/core/frame.py in to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
2201 to_parquet(self, fname, engine,
2202 compression=compression, index=index,
-> 2203 partition_cols=partition_cols, **kwargs)
2204
2205 @Substitution(header='Whether to print column labels, default True')
~/Envs/dask-dev/lib/python3.7/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
250 impl = get_engine(engine)
251 return impl.write(df, path, compression=compression, index=index,
--> 252 partition_cols=partition_cols, **kwargs)
253
254
~/Envs/dask-dev/lib/python3.7/site-packages/pandas/io/parquet.py in write(self, df, path, compression, coerce_timestamps, index, partition_cols, **kwargs)
104 coerce_timestamps='ms', index=None, partition_cols=None,
105 **kwargs):
--> 106 self.validate_dataframe(df)
107 path, _, _, _ = get_filepath_or_buffer(path, mode='wb')
108
~/Envs/dask-dev/lib/python3.7/site-packages/pandas/io/parquet.py in validate_dataframe(df)
56 # must have value column names (strings only)
57 if df.columns.inferred_type not in {'string', 'unicode'}:
---> 58 raise ValueError("parquet must have string column names")
59
60 # index level names must be strings
ValueError: parquet must have string column names |
Yeah, that's fair- looking at the Parquet docs name is a string variable. In that case I think the expected behaviour would be to throw the same ValueError as Pandas, rather than silently converting the names. |
Sounds good to me. Interested in making a PR? Warning though, the parquet handling is under a somewhat large refactor in #4995. May be best to wait until after that. |
Dask doesn't like saving and loading dataframes with numeric column names to parquet.Possibly related to #4922?
Objective:
I'm trying to save a dataframe with numeric column names to disk, then load it.
Environment:
On the latest versions of Dask, Pandas and fastparquet, pyarrow 0.12.0.
MVE:
Expected vs actual behaviour:
I expect the dataframe to be saved and loaded.
Instead, using
engine=pyarrow
fails on thecompute
line, with:It looks like
to_parquet
has converted the index to string before writing it to the metadata but not to the actual data? So it can't find those columns when it tries to read the data proper.Using
engine=fastparquet
instead fails on theto_parquet
line, returningWhich is at least explicitly clear that non-string column names are not supported.
Is this an issue with how Dask is wrapping pyarrow, or do I need to raise it on the pyarrow GitHub?
The text was updated successfully, but these errors were encountered: