[Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema #23313

asfimport · 2019-10-26T16:34:38Z

Steps to reproduce:

Generate any DataFrame's pyarrow Schema using Table.from_pandas
Pass the generated schema as input into Table.from_pandas

Causes KeyError: 'index_level_0'

We did not have this issue with pyarrow==0.11.0 which we used to write many partitions across years. Our goal now is to use pyarrow==0.15.0 and produce schema going forward that are backwards compatible (i.e. also have 'index_level_0'), so we should not need to re-generate all prior years' partitions when we migrate to 0.15.0.

We cannot set preserve_index=False, since that effectively deletes 'index_level_0', causing inconsistent schema across earlier partitions that had been written using pyarrow==0.11.0.

import pandas as pd
import pyarrow as pa
df = pd.DataFrame() 
schema = pa.Table.from_pandas(df).schema
pa_table = pa.Table.from_pandas(df, schema=schema)


Traceback (most recent call last):
  File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '__index_level_0__'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 408, in _get_columns_to_convert_given_schema
    col = df[name]
  File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
    return self._getitem_column(key)
  File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
    return self._get_item_cache(key)
  File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
    values = self._data.get(item)
  File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
    loc = self.items.get_loc(item)
  File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '__index_level_0__'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-36-6711a2fcec96>", line 5, in <module>
    pa_table = pa.Table.from_pandas(df, schema=pa.Table.from_pandas(df).schema)
  File "pyarrow/table.pxi", line 1057, in pyarrow.lib.Table.from_pandas
  File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 517, in dataframe_to_arrays
    columns)
  File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 337, in _get_columns_to_convert
    return _get_columns_to_convert_given_schema(df, schema, preserve_index)
  File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 426, in _get_columns_to_convert_given_schema
    "in the columns or index".format(name))
KeyError: "name '__index_level_0__' present in the specified schema is not found in the columns or index"

Environment: pandas==0.23.4
pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0

Reporter: Tom Goodman
Assignee: Joris Van den Bossche / @jorisvandenbossche

Original Issue Attachments:

test3.hdf

PRs and other links:

GitHub Pull Request #5750

_{Note: This issue was originally created as ARROW-6999. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2019-10-26T19:09:27Z

Wes McKinney / @wesm:
I'll let @jorisvandenbossche take a look

asfimport · 2019-10-28T11:00:55Z

Joris Van den Bossche / @jorisvandenbossche:
[~goodiegoodman] thanks for the report!

Your "steps to reproduce" actually do work if you do not use an empty dataframe:

In [15]: import pandas as pd 
    ...: import pyarrow as pa 
    ...: df = pd.DataFrame({'a': [1, 2, 3]})  
    ...: schema = pa.Table.from_pandas(df).schema 
    ...: pa_table = pa.Table.from_pandas(df, schema=schema)                                                                                                                                                        

In [16]: schema                                                                                                                                                                                                    
Out[16]: 
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
            b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
            b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
            b' "0.15.1.dev177+g5df424bd6"}, "pandas_version": "0.26.0.dev0+669'
            b'.g3c29114b1"}'}

The empty dataframe is tricky edge-case regarding the index, because in such a case the index is not a RangeIndex but a empty object-dtype Index (see ARROW-5104 for a similar report about that aspect).

That said, if passing an explicit schema, and if there is a column not found that has a "__index_level_i__" pattern, we should try to handle this (certainly in case of passing preserve_index=True).

asfimport · 2019-10-28T15:15:30Z

Joris Van den Bossche / @jorisvandenbossche:
So this case is clearly a bug in the new implementation, I would say:

In [23]: import pandas as pd 
    ...: import pyarrow as pa 
    ...: df = pd.DataFrame({'a': [1, 2, 3]})  
    ...: schema = pa.Table.from_pandas(df, preserve_index=True).schema 
    ...: pa_table = pa.Table.from_pandas(df, schema=schema, preserve_index=True)                                                                                                                                   
...
KeyError: "name '__index_level_0__' present in the specified schema is not found in the columns or index"

So if you specify preserve_index=True, and there is an index in the schema that did not have a name in the DataFrame (so ending up as the generated \_\_index_level_i\_\_), the above should work when passing an explicit schema matching that.

Will look into fixing this (it's a pity that 0.15.1 is already released, it would have been nice to include this).

asfimport · 2019-10-28T18:12:21Z

Tom Goodman:
@jorisvandenbossche please try this with the attached test3.hdf (not empty)

df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema)

I still get KeyError: 'index_level_0' (without specifying preserve_index)._

This may be because the index on test3.hdf is Int64Index and I see pyarrow docs say default behavior is to "store the index as a column", except for rage indexes. This unfortunately makes the bug more prevalent.

asfimport · 2019-10-29T10:44:56Z

Joris Van den Bossche / @jorisvandenbossche:
Thanks for the reproducer! It's indeed due to the non-range index. Doing this in terms of the simpler example, I think the following is equivalent to your example:

df2 = pd.DataFrame({'a': [1, 2, 3]}, index=[0, 1, 2])
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema)

which gives indeed that error. In the end, it boils down to the same bug as my example above using a RangeIndex but with specifying preserve_index=True (as that forces the index to become a column, just as if you have a non-rangeindex).

asfimport · 2019-10-29T12:24:04Z

Joris Van den Bossche / @jorisvandenbossche:
[~goodiegoodman] so I did a PR to fix this: #5750 (I try to add a lot of test cases when refactoring this for 0.15, it's a pity I overlooked this (quite obvious in hindsight) one).

But the question for you still is: is there a way to deal with this that is compatible across different releases?
I am not fully understanding your explanation from above:

We did not have this issue with pyarrow==0.11.0 which we used to write many partitions across years. Our goal now is to use pyarrow==0.15.0 and produce schema going forward that are backwards compatible (i.e. also have 'index_level_0'), so we should not need to re-generate all prior years' partitions when we migrate to 0.15.0.

What do you mean exactly with "write"? (to what file format? or how is the schema stored?)
One option I can think off (but not sure it fits your usecase) is to make sure that your index has a name. Adapting the above example:

df3 = pd.DataFrame({'a': [1, 2, 3]}, index=pd.Int64Index([0, 1, 2], name='index'))
pa.Table.from_pandas(df3, schema=pa.Table.from_pandas(df3).schema)

This works on 0.11.0 and on 0.15.0. However, this then fails on 0.13/0.14 (which is one of the reasons we tried to clean up and normalize this handling of the passed schema in 0.15).

asfimport · 2019-10-29T17:20:43Z

Tom Goodman:
@jorisvandenbossche thank you for the quick turn-around!

We store the partitions in parquet files, with directories defining partitions and _common_metadata file holding schema. This allows us to use the ParquetDataset partition level filters like [[('yyyymm', '=', 201909)]] ...


tree
.
|-- _common_metadata
|-- yyyymm=201909
|   `-- e097411586b0460e860c331b63fecb2b.parquet
`-- yyyymm=201910
    `-- b8de9aa413194cc4af6f4802b5c4923f.parquet
.
.

asfimport · 2019-10-29T19:43:04Z

Tom Goodman:
Thanks to your suggestions, we now have a work-around that allows us to remain backwards-compatible!
If we get a KeyError due to missing 'index_level_0', we'll set df.index.name = 'index_level_0' and re-call same from_pandas function.

    try:
        table = pa.Table.from_pandas(df, schema=schema)
    except KeyError as e:
        if '__index_level_0__' in str(e):  # Happens in pyarrow 0.15.0, not 0.11.0
            df.index.name = '__index_level_0__'
            table = pa.Table.from_pandas(df, schema=schema)
        else:
            raise e

Thanks so much @jorisvandenbossche!

asfimport · 2019-10-29T21:19:01Z

Joris Van den Bossche / @jorisvandenbossche:
That sounds as a decent enough workaround for now. Happy you found a way to deal with it!

asfimport · 2019-11-05T14:40:22Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5750
#5750

asfimport closed this as completed Nov 5, 2019

asfimport assigned jorisvandenbossche Jan 10, 2023

asfimport added this to the 0.16.0 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema #23313

[Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema #23313

asfimport commented Oct 26, 2019

asfimport commented Oct 26, 2019

asfimport commented Oct 28, 2019

asfimport commented Oct 28, 2019

asfimport commented Oct 28, 2019

asfimport commented Oct 29, 2019

asfimport commented Oct 29, 2019

asfimport commented Oct 29, 2019

asfimport commented Oct 29, 2019

asfimport commented Oct 29, 2019

asfimport commented Nov 5, 2019

[Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema #23313

[Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema #23313

Comments

asfimport commented Oct 26, 2019

Original Issue Attachments:

PRs and other links:

asfimport commented Oct 26, 2019

asfimport commented Oct 28, 2019

asfimport commented Oct 28, 2019

asfimport commented Oct 28, 2019

asfimport commented Oct 29, 2019

asfimport commented Oct 29, 2019

asfimport commented Oct 29, 2019

asfimport commented Oct 29, 2019

asfimport commented Oct 29, 2019

asfimport commented Nov 5, 2019