You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Panda DataFrames that have MultiIndexes seem to always be converted to a Table just fine. However, when writing the Table to disk using pyarrow.parquet, I am unable to write DataFrames whose MultiIndex contains a level with duplicate values (which is nearly always the case for me). Here is an example in python with working cases and a failure case at bottom:
importpandasaspdimportpyarrowaspaimportpyarrow.parquetaspqnum_rows=3example=pd.DataFrame({'strs': ['foo', 'foo', 'bar'],
'nums_b': range(num_rows),
'nums_a': range(num_rows)})
defpq_write(df):
table=pa.Table.from_pandas(df)
pq.write_table(table, '/tmp/df.parquet')
# single index workspq_write(example)
pq_write(example.set_index(['nums_b']))
# single index with duplicate values workpq_write(example.set_index(['strs']))
# MultiIndex with all unique, relative to the level/column, values workspq_write(example.set_index(['nums_b', 'nums_a']))
# MultiIndex with one level with duplicate values in one index FAILSpq_write(example.set_index(['strs', 'nums_a']))
Traceback (most recent call last):
File "test_arrow.py", line 26, in <module>
pq_write(example.set_index(['strs', 'nums_a']))
File "test_arrow.py", line 13, in pq_write
pq.write_table(table, '/tmp/df.parquet')
File "/Users/bmabey/anaconda/envs/test_pyarrow/lib/python3.5/site-packages/pyarrow/parquet.py", line 702, in write_table
writer.write_table(table, row_group_size=row_group_size)
File "pyarrow/_parquet.pyx", line 609, in pyarrow._parquet.ParquetWriter.write_table (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/_parquet.cxx:11025)
File "pyarrow/error.pxi", line 60, in pyarrow.lib.check_status (/Users/travis/miniconda3/conda-bld/pyarrow_1497322770287/work/arrow-46315431aeda3b6968b3ac4c1087f6d41052b99d/python/build/temp.macosx-10.9-x86_64-3.5/lib.cxx:6899)
pyarrow.lib.ArrowIOError: IOError: Written rows: 2 != expected rows: 3in the current column chunk
Note that the written rows is equal to the number of unique values in the strs level. I have found this to always be the case when I've hit this error message.
I'm happy to write a patch for this assuming this is a bug and you can point me in the right direction.
Environment: OSx, miniconda, using pyarrow build from conda-forge Reporter: Ben Mabey Assignee: Phillip Cloud / @cpcloud
Panda DataFrames that have
MultiIndex
es seem to always be converted to aTable
just fine. However, when writing theTable
to disk usingpyarrow.parquet
, I am unable to write DataFrames whoseMultiIndex
contains a level with duplicate values (which is nearly always the case for me). Here is an example in python with working cases and a failure case at bottom:Note that the written rows is equal to the number of unique values in the
strs
level. I have found this to always be the case when I've hit this error message.I'm happy to write a patch for this assuming this is a bug and you can point me in the right direction.
Environment: OSx, miniconda, using pyarrow build from conda-forge
Reporter: Ben Mabey
Assignee: Phillip Cloud / @cpcloud
Note: This issue was originally created as ARROW-1132. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: