Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Unable to write pandas DataFrame w/MultiIndex containing duplicate values to parquet #15424

Closed
asfimport opened this issue Jun 20, 2017 · 3 comments

Comments

@asfimport
Copy link

Panda DataFrames that have MultiIndexes seem to always be converted to a Table just fine. However, when writing the Table to disk using pyarrow.parquet, I am unable to write DataFrames whose MultiIndex contains a level with duplicate values (which is nearly always the case for me). Here is an example in python with working cases and a failure case at bottom:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

num_rows = 3
example = pd.DataFrame({'strs': ['foo', 'foo', 'bar'],
                        'nums_b': range(num_rows),
                        'nums_a': range(num_rows)})


def pq_write(df):
    table = pa.Table.from_pandas(df)
    pq.write_table(table, '/tmp/df.parquet')
# single index works
pq_write(example)
pq_write(example.set_index(['nums_b']))
# single index with duplicate values work
pq_write(example.set_index(['strs']))
# MultiIndex with all unique, relative to the level/column, values works
pq_write(example.set_index(['nums_b', 'nums_a']))
# MultiIndex with one level with duplicate values in one index FAILS
pq_write(example.set_index(['strs', 'nums_a']))
Traceback (most recent call last):
  File "test_arrow.py", line 26, in <module>
    pq_write(example.set_index(['strs', 'nums_a']))
  File "test_arrow.py", line 13, in pq_write
    pq.write_table(table, '/tmp/df.parquet')
  File "/Users/bmabey/anaconda/envs/test_pyarrow/lib/python3.5/site-packages/pyarrow/parquet.py", line 702, in write_table
    writer.write_table(table, row_group_size=row_group_size)
  File "pyarrow/_parquet.pyx", line 609, in pyarrow._parquet.ParquetWriter.write_table (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/_parquet.cxx:11025)
  File "pyarrow/error.pxi", line 60, in pyarrow.lib.check_status (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/lib.cxx:6899)
pyarrow.lib.ArrowIOError: IOError: Written rows: 2 != expected rows: 3in the current column chunk

Note that the written rows is equal to the number of unique values in the strs level. I have found this to always be the case when I've hit this error message.

I'm happy to write a patch for this assuming this is a bug and you can point me in the right direction.

Environment: OSx, miniconda, using pyarrow build from conda-forge
Reporter: Ben Mabey
Assignee: Phillip Cloud / @cpcloud

Note: This issue was originally created as ARROW-1132. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
@cpcloud could you take a look at this?

@asfimport
Copy link
Author

Phillip Cloud / @cpcloud:
Yep, on it.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 768
#768

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants