Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read pyarrow RangeIndex #414

Closed
bchu opened this issue Apr 2, 2019 · 5 comments
Closed

Cannot read pyarrow RangeIndex #414

bchu opened this issue Apr 2, 2019 · 5 comments

Comments

@bchu
Copy link

bchu commented Apr 2, 2019

df = pd.DataFrame([1,2,3], columns=['a'])
df.to_parquet('tmp.parquet', engine='pyarrow')
pd.read_parquet('tmp.parquet', engine='fastparquet')

Raises the exception

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-42-d993694086f8> in <module>
      1 df = pd.DataFrame([1,2,3], columns=['a'])
      2 df.to_parquet('tmp.parquet', engine='pyarrow')
----> 3 pd.read_parquet('tmp.parquet', engine='fastparquet')

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
    280 
    281     impl = get_engine(engine)
--> 282     return impl.read(path, columns=columns, **kwargs)

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
    209             parquet_file = self.api.ParquetFile(path)
    210 
--> 211         return parquet_file.to_pandas(columns=columns, **kwargs)
    212 
    213 

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/fastparquet/api.py in to_pandas(self, columns, categories, filters, index)
    419         if index:
    420             columns += [i for i in index if i not in columns]
--> 421         check_column_names(self.columns + list(self.cats), columns, categories)
    422         df, views = self.pre_allocate(size, columns, categories, index)
    423         start = 0

~/.pyenv/versions/3.6.2/envs/general/lib/python3.6/site-packages/fastparquet/util.py in check_column_names(columns, *args)
     90     for arg in args:
     91         if isinstance(arg, (tuple, list)):
---> 92             if set(arg) - set(columns):
     93                 raise ValueError("Column name not in list.\n"
     94                                  "Requested %s\n"

TypeError: unhashable type: 'dict'

This is most likely the result of: pandas-dev/pandas#25672 and apache/arrow#3868

@martindurant
Copy link
Member

This is real and indeed due to the change in pyarrow. Obviously, index=False would solve this (I am surprised it not the default when the index is the default range).
Currently, df.empty indeed uses the default range if none is specified, and setting the index in place with a new range would be trivial, so this would be a relatively easy fix for anyone interested. Of course, the counterpart of writing and index when it is a simple range should be implemented at some point too.

In the related Dask case, we actually know the number of rows for each partition, and ought to use the information to set the range in each and divisions globally where there's a range index, or where there is no index at all.

@bnsblue
Copy link

bnsblue commented Jun 5, 2019

@martindurant would you mind elaborating a bit on why setting index=False could solve the problem? I had encountered the same issue as reported by the OP of this issue, when using fastparquet to read a parquet file written by pyarrow.

While setting index=False as you suggested in your last comment resolved that TypeError: unhashable type: 'dict', we later found that setting index=False seemed to add an extra column __index_level_0__ to the dataframe, which is undesireable.

   a  __index_level_0__
0  1                  0
1  2                  1
2  3                  2

so setting index=False cannot entirely solve my problem and we would like to see if we can have a more complete fix.

I'd be more than happy to contribute and create a fix for it, but I am not sure if I understand what the actual problem is 100%. It would be awesome if I could request some explanation from you regarding this error.

Thanks a lot!

@martindurant
Copy link
Member

Exactly what you get will now depend on which version of pyarrow you used, as well as of fastparquet. In the past (<0.13), pyarrow would write real columns of data for the index, with names like the cryptic one you show. When you load with fastparquet and say "I don't want to set an index", it becomes an ordinary column. If you do allow it to be set as an index, the name should be reconstituted to None. You could just use columns= to ignore it completely.

In the most recent version of pyarrow, there would be no column data, but a range index metadata marker instead. It takes up no space, and there is no reason not to have it populate the index. In this case, if you said you wanted to ignore the index, or use another, the range should be ignored.

@ghost
Copy link

ghost commented Jun 28, 2019

Why is this ticket closed ? This break backwards compatibility, so ideally there should be a fix for this

@martindurant
Copy link
Member

Are you saying that current fastparquet can't read older pyarrow-written data? That would indeed be a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants