Cannot read pyarrow RangeIndex #414

bchu · 2019-04-02T01:05:36Z

df = pd.DataFrame([1,2,3], columns=['a'])
df.to_parquet('tmp.parquet', engine='pyarrow')
pd.read_parquet('tmp.parquet', engine='fastparquet')

Raises the exception

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-d993694086f8> in <module>
      1 df = pd.DataFrame([1,2,3], columns=['a'])
      2 df.to_parquet('tmp.parquet', engine='pyarrow')
----> 3 pd.read_parquet('tmp.parquet', engine='fastparquet')

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
    280 
    281     impl = get_engine(engine)
--> 282     return impl.read(path, columns=columns, **kwargs)

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
    209             parquet_file = self.api.ParquetFile(path)
    210 
--> 211         return parquet_file.to_pandas(columns=columns, **kwargs)
    212 
    213 

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index)
    419         if index:
    420             columns += [i for i in index if i not in columns]
--> 421         check_column_names(self.columns + list(self.cats), columns, categories)
    422         df, views = self.pre_allocate(size, columns, categories, index)
    423         start = 0

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/fastparquet/util.py in check_column_names(columns, *args)
     90     for arg in args:
     91         if isinstance(arg, (tuple, list)):
---> 92             if set(arg) - set(columns):
     93                 raise ValueError("Column name not in list.\n"
     94                                  "Requested %s\n"

TypeError: unhashable type: 'dict'

This is most likely the result of: pandas-dev/pandas#25672 and apache/arrow#3868

The text was updated successfully, but these errors were encountered:

martindurant · 2019-04-02T12:59:15Z

This is real and indeed due to the change in pyarrow. Obviously, index=False would solve this (I am surprised it not the default when the index is the default range).
Currently, df.empty indeed uses the default range if none is specified, and setting the index in place with a new range would be trivial, so this would be a relatively easy fix for anyone interested. Of course, the counterpart of writing and index when it is a simple range should be implemented at some point too.

In the related Dask case, we actually know the number of rows for each partition, and ought to use the information to set the range in each and divisions globally where there's a range index, or where there is no index at all.

bnsblue · 2019-06-05T01:24:29Z

@martindurant would you mind elaborating a bit on why setting index=False could solve the problem? I had encountered the same issue as reported by the OP of this issue, when using fastparquet to read a parquet file written by pyarrow.

While setting index=False as you suggested in your last comment resolved that TypeError: unhashable type: 'dict', we later found that setting index=False seemed to add an extra column __index_level_0__ to the dataframe, which is undesireable.

   a  __index_level_0__
0  1                  0
1  2                  1
2  3                  2

so setting index=False cannot entirely solve my problem and we would like to see if we can have a more complete fix.

I'd be more than happy to contribute and create a fix for it, but I am not sure if I understand what the actual problem is 100%. It would be awesome if I could request some explanation from you regarding this error.

Thanks a lot!

martindurant · 2019-06-05T13:40:42Z

Exactly what you get will now depend on which version of pyarrow you used, as well as of fastparquet. In the past (<0.13), pyarrow would write real columns of data for the index, with names like the cryptic one you show. When you load with fastparquet and say "I don't want to set an index", it becomes an ordinary column. If you do allow it to be set as an index, the name should be reconstituted to None. You could just use columns= to ignore it completely.

In the most recent version of pyarrow, there would be no column data, but a range index metadata marker instead. It takes up no space, and there is no reason not to have it populate the index. In this case, if you said you wanted to ignore the index, or use another, the range should be ignored.

ghost · 2019-06-28T21:03:23Z

Why is this ticket closed ? This break backwards compatibility, so ideally there should be a fix for this

martindurant · 2019-06-30T13:48:54Z

Are you saying that current fastparquet can't read older pyarrow-written data? That would indeed be a problem.

jorisvandenbossche mentioned this issue May 2, 2019

DOC: Add expanded index descriptors for specifying for RangeIndex-as-metadata in Parquet file schema pandas-dev/pandas#25709

Merged

4 tasks

martindurant closed this as completed Jun 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot read pyarrow RangeIndex #414

Cannot read pyarrow RangeIndex #414

bchu commented Apr 2, 2019

martindurant commented Apr 2, 2019

bnsblue commented Jun 5, 2019 •

edited

martindurant commented Jun 5, 2019

ghost commented Jun 28, 2019 •

edited by ghost

martindurant commented Jun 30, 2019

Cannot read pyarrow RangeIndex #414

Cannot read pyarrow RangeIndex #414

Comments

bchu commented Apr 2, 2019

martindurant commented Apr 2, 2019

bnsblue commented Jun 5, 2019 • edited

martindurant commented Jun 5, 2019

ghost commented Jun 28, 2019 • edited by ghost

martindurant commented Jun 30, 2019

bnsblue commented Jun 5, 2019 •

edited

ghost commented Jun 28, 2019 •

edited by ghost