Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Error reading old Parquet file due to metadata backwards compatibility issue #18977

Closed
asfimport opened this issue May 16, 2018 · 6 comments

Comments

@asfimport
Copy link

Pyarrow 0.8 and 0.9 raises an AssertionError for one of the datasets I have (created using an older version of pyarrow). Repro steps:

In [1]: from pyarrow.parquet import ParquetDataset

In [2]: d = ParquetDataset(['bug.parq'])

In [3]: t = d.read()

In [4]: t.to_pandas()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-4-d17c9e2818f1> in <module>()
----> 1 t.to_pandas()

table.pxi in pyarrow.lib.Table.to_pandas()

~/envs/cli3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, memory_pool, nthreads, categories)
    529     # There must be the same number of field names and physical names
    530     # (fields in the arrow Table)
--> 531     assert len(logical_index_names) == len(index_columns_set)
    532
    533     # It can never be the case in a released version of pyarrow that

AssertionError:

 

Here's the file: https://www.dropbox.com/s/oja3khjsc5tycfh/bug.parq

(I was not able to attach it here due to a "missing token", whatever that means.)

Reporter: Dima Ryazanov / @dimaryaz
Assignee: Wes McKinney / @wesm

PRs and other links:

Note: This issue was originally created as ARROW-2592. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
Do you still know with which version the file was written? We had a small range of commits between 0.7 and 0.8 that produced files that were later rejected by 0.8 but those were never a part of a release.

@asfimport
Copy link
Author

Dima Ryazanov / @dimaryaz:
Looks like writing with version 0.6 causes the problem - while 0.7 and later are fine.

Though even if the parquet file is broken, it should be some sort of a parse error, rather than an assert, right?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I'm not quite sure what can be done here. We might have to add an option to ignore the pandas metadata, if any

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I'm adding an option to be able to ignore the "pandas" metadata when calling "to_pandas", which is useful anyway.

In the meantime you can work around your issue with

table.cast(table.schema.remove_metadata()).to_pandas()

@asfimport
Copy link
Author

Wes McKinney / @wesm:
It will look like

In [6]: t.to_pandas(ignore_metadata=True).head()
Out[6]: 
   Row ID  Order ID Order Date        ...         Product Base Margin  Ship Date  __index_level_0__
0       1         3 2010-10-13        ...                         0.8 2010-10-20                  0
1      49       293 2012-10-01        ...                        0.58 2012-10-02                  1
2      50       293 2012-10-01        ...                        0.39 2012-10-03                  2
3      80       483 2011-07-10        ...                        0.58 2011-07-12                  3
4      85       515 2010-08-28        ...                         0.5 2010-08-30                  4

[5 rows x 22 columns]

I'll put up a PR once I write some tests.

If others really think this is a bad idea, backwards compatibility with this old metadata might be possible, but it would be a bit hacky

@asfimport
Copy link
Author

Krisztian Szucs / @kszucs:
Issue resolved by pull request 3239
#3239

@asfimport asfimport added this to the 0.12.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants