[Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet #15424

asfimport · 2017-06-20T20:36:21Z

Panda DataFrames that have MultiIndexes seem to always be converted to a Table just fine. However, when writing the Table to disk using pyarrow.parquet, I am unable to write DataFrames whose MultiIndex contains a level with duplicate values (which is nearly always the case for me). Here is an example in python with working cases and a failure case at bottom:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

num_rows = 3
example = pd.DataFrame({'strs': ['foo', 'foo', 'bar'],
                        'nums_b': range(num_rows),
                        'nums_a': range(num_rows)})


def pq_write(df):
    table = pa.Table.from_pandas(df)
    pq.write_table(table, '/tmp/df.parquet')
# single index works
pq_write(example)
pq_write(example.set_index(['nums_b']))
# single index with duplicate values work
pq_write(example.set_index(['strs']))
# MultiIndex with all unique, relative to the level/column, values works
pq_write(example.set_index(['nums_b', 'nums_a']))
# MultiIndex with one level with duplicate values in one index FAILS
pq_write(example.set_index(['strs', 'nums_a']))

Traceback (most recent call last):
  File "test_arrow.py", line 26, in <module>
    pq_write(example.set_index(['strs', 'nums_a']))
  File "test_arrow.py", line 13, in pq_write
    pq.write_table(table, '/tmp/df.parquet')
  File "/Users/bmabey/anaconda/envs/test_pyarrow/lib/python3.5/site-packages/pyarrow/parquet.py", line 702, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 609, in pyarrow._parquet.ParquetWriter.write_table (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/_parquet.cxx:11025)
  File "pyarrow/error.pxi", line 60, in pyarrow.lib.check_status (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/lib.cxx:6899)
pyarrow.lib.ArrowIOError: IOError: Written rows: 2 != expected rows: 3in the current column chunk

Note that the written rows is equal to the number of unique values in the strs level. I have found this to always be the case when I've hit this error message.

I'm happy to write a patch for this assuming this is a bug and you can point me in the right direction.

Environment: OSx, miniconda, using pyarrow build from conda-forge
Reporter: Ben Mabey
Assignee: Phillip Cloud / @cpcloud

_{Note: This issue was originally created as ARROW-1132. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2017-06-20T22:09:46Z

Wes McKinney / @wesm:
@cpcloud could you take a look at this?

asfimport · 2017-06-22T14:35:44Z

Phillip Cloud / @cpcloud:
Yep, on it.

asfimport · 2017-06-23T19:29:49Z

Wes McKinney / @wesm:
Issue resolved by pull request 768
#768

asfimport closed this as completed Jun 23, 2017

asfimport assigned cpcloud Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet #15424

[Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet #15424

asfimport commented Jun 20, 2017

asfimport commented Jun 20, 2017

asfimport commented Jun 22, 2017

asfimport commented Jun 23, 2017

[Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet #15424

[Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet #15424

Comments

asfimport commented Jun 20, 2017

asfimport commented Jun 20, 2017

asfimport commented Jun 22, 2017

asfimport commented Jun 23, 2017