[Python] Error reading old Parquet file due to metadata backwards compatibility issue #18977

asfimport · 2018-05-16T21:42:22Z

Pyarrow 0.8 and 0.9 raises an AssertionError for one of the datasets I have (created using an older version of pyarrow). Repro steps:

In [1]: from pyarrow.parquet import ParquetDataset

In [2]: d = ParquetDataset(['bug.parq'])

In [3]: t = d.read()

In [4]: t.to_pandas()
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-4-d17c9e2818f1> in <module>()
----> 1 t.to_pandas()

table.pxi in pyarrow.lib.Table.to_pandas()

~/envs/cli3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in table_to_blockmanager(options, table, memory_pool, nthreads, categories)
    529     # There must be the same number of field names and physical names
    530     # (fields in the arrow Table)
--> 531     assert len(logical_index_names) == len(index_columns_set)
    532
    533     # It can never be the case in a released version of pyarrow that

AssertionError:

Here's the file: https://www.dropbox.com/s/oja3khjsc5tycfh/bug.parq

(I was not able to attach it here due to a "missing token", whatever that means.)

Reporter: Dima Ryazanov / @dimaryaz
Assignee: Wes McKinney / @wesm

PRs and other links:

GitHub Pull Request #3239

_{Note: This issue was originally created as ARROW-2592. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2018-05-17T06:27:50Z

Uwe Korn / @xhochy:
Do you still know with which version the file was written? We had a small range of commits between 0.7 and 0.8 that produced files that were later rejected by 0.8 but those were never a part of a release.

asfimport · 2018-05-17T23:42:20Z

Dima Ryazanov / @dimaryaz:
Looks like writing with version 0.6 causes the problem - while 0.7 and later are fine.

Though even if the parquet file is broken, it should be some sort of a parse error, rather than an assert, right?

asfimport · 2018-10-26T22:11:45Z

Wes McKinney / @wesm:
I'm not quite sure what can be done here. We might have to add an option to ignore the pandas metadata, if any

asfimport · 2018-12-19T04:47:36Z

Wes McKinney / @wesm:
I'm adding an option to be able to ignore the "pandas" metadata when calling "to_pandas", which is useful anyway.

In the meantime you can work around your issue with

table.cast(table.schema.remove_metadata()).to_pandas()

asfimport · 2018-12-19T04:50:45Z

Wes McKinney / @wesm:
It will look like

In [6]: t.to_pandas(ignore_metadata=True).head()
Out[6]: 
   Row ID  Order ID Order Date        ...         Product Base Margin  Ship Date  __index_level_0__
0       1         3 2010-10-13        ...                         0.8 2010-10-20                  0
1      49       293 2012-10-01        ...                        0.58 2012-10-02                  1
2      50       293 2012-10-01        ...                        0.39 2012-10-03                  2
3      80       483 2011-07-10        ...                        0.58 2011-07-12                  3
4      85       515 2010-08-28        ...                         0.5 2010-08-30                  4

[5 rows x 22 columns]

I'll put up a PR once I write some tests.

If others really think this is a bad idea, backwards compatibility with this old metadata might be possible, but it would be a bit hacky

asfimport · 2018-12-23T00:37:26Z

Krisztian Szucs / @kszucs:
Issue resolved by pull request 3239
#3239

asfimport closed this as completed Dec 23, 2018

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 0.12.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Error reading old Parquet file due to metadata backwards compatibility issue #18977

[Python] Error reading old Parquet file due to metadata backwards compatibility issue #18977

asfimport commented May 16, 2018

asfimport commented May 17, 2018

asfimport commented May 17, 2018

asfimport commented Oct 26, 2018

asfimport commented Dec 19, 2018

asfimport commented Dec 19, 2018

asfimport commented Dec 23, 2018

[Python] Error reading old Parquet file due to metadata backwards compatibility issue #18977

[Python] Error reading old Parquet file due to metadata backwards compatibility issue #18977

Comments

asfimport commented May 16, 2018

PRs and other links:

asfimport commented May 17, 2018

asfimport commented May 17, 2018

asfimport commented Oct 26, 2018

asfimport commented Dec 19, 2018

asfimport commented Dec 19, 2018

asfimport commented Dec 23, 2018