Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema #23313

Closed
asfimport opened this issue Oct 26, 2019 · 10 comments
Closed

Comments

@asfimport
Copy link

Steps to reproduce:

  1. Generate any DataFrame's pyarrow Schema using Table.from_pandas

  2. Pass the generated schema as input into Table.from_pandas

  3. Causes KeyError: 'index_level_0'

    We did not have this issue with pyarrow==0.11.0 which we used to write many partitions across years.  Our goal now is to use pyarrow==0.15.0 and produce schema going forward that are backwards compatible (i.e. also have 'index_level_0'), so we should not need to re-generate all prior years' partitions when we migrate to 0.15.0.

    We cannot set preserve_index=False, since that effectively deletes 'index_level_0', causing inconsistent schema across earlier partitions that had been written using pyarrow==0.11.0.

     

    import pandas as pd
    import pyarrow as pa
    df = pd.DataFrame() 
    schema = pa.Table.from_pandas(df).schema
    pa_table = pa.Table.from_pandas(df, schema=schema)
    
    Traceback (most recent call last):
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
        return self._engine.get_loc(key)
      File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: '__index_level_0__'
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 408, in _get_columns_to_convert_given_schema
        col = df[name]
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", line 2688, in __getitem__
        return self._getitem_column(key)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
        return self._get_item_cache(key)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
        values = self._data.get(item)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
        loc = self.items.get_loc(item)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
        return self._engine.get_loc(self._maybe_cast_indexer(key))
      File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: '__index_level_0__'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
        exec(code_obj, self.user_global_ns, self.user_ns)
      File "<ipython-input-36-6711a2fcec96>", line 5, in <module>
        pa_table = pa.Table.from_pandas(df, schema=pa.Table.from_pandas(df).schema)
      File "pyarrow/table.pxi", line 1057, in pyarrow.lib.Table.from_pandas
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 517, in dataframe_to_arrays
        columns)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 337, in _get_columns_to_convert
        return _get_columns_to_convert_given_schema(df, schema, preserve_index)
      File "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 426, in _get_columns_to_convert_given_schema
        "in the columns or index".format(name))
    KeyError: "name '__index_level_0__' present in the specified schema is not found in the columns or index"
    

Environment: pandas==0.23.4
pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0

Reporter: Tom Goodman
Assignee: Joris Van den Bossche / @jorisvandenbossche

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-6999. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I'll let @jorisvandenbossche take a look

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
[~goodiegoodman] thanks for the report!

Your "steps to reproduce" actually do work if you do not use an empty dataframe:

In [15]: import pandas as pd 
    ...: import pyarrow as pa 
    ...: df = pd.DataFrame({'a': [1, 2, 3]})  
    ...: schema = pa.Table.from_pandas(df).schema 
    ...: pa_table = pa.Table.from_pandas(df, schema=schema)                                                                                                                                                        

In [16]: schema                                                                                                                                                                                                    
Out[16]: 
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
            b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
            b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
            b' "0.15.1.dev177+g5df424bd6"}, "pandas_version": "0.26.0.dev0+669'
            b'.g3c29114b1"}'}

The empty dataframe is tricky edge-case regarding the index, because in such a case the index is not a RangeIndex but a empty object-dtype Index (see ARROW-5104 for a similar report about that aspect).

That said, if passing an explicit schema, and if there is a column not found that has a "__index_level_i__" pattern, we should try to handle this (certainly in case of passing preserve_index=True).

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
So this case is clearly a bug in the new implementation, I would say:

In [23]: import pandas as pd 
    ...: import pyarrow as pa 
    ...: df = pd.DataFrame({'a': [1, 2, 3]})  
    ...: schema = pa.Table.from_pandas(df, preserve_index=True).schema 
    ...: pa_table = pa.Table.from_pandas(df, schema=schema, preserve_index=True)                                                                                                                                   
...
KeyError: "name '__index_level_0__' present in the specified schema is not found in the columns or index"

So if you specify preserve_index=True, and there is an index in the schema that did not have a name in the DataFrame (so ending up as the generated \_\_index_level_i\_\_), the above should work when passing an explicit schema matching that.

Will look into fixing this (it's a pity that 0.15.1 is already released, it would have been nice to include this).

@asfimport
Copy link
Author

Tom Goodman:
@jorisvandenbossche  please try this with the attached test3.hdf (not empty)

df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema)

I still get KeyError: 'index_level_0' (without specifying preserve_index)._ 

This may be because the index on test3.hdf is Int64Index and I see pyarrow docs say default behavior is to "store the index as a column", except for rage indexes.  This unfortunately makes the bug more prevalent.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Thanks for the reproducer! It's indeed due to the non-range index. Doing this in terms of the simpler example, I think the following is equivalent to your example:

df2 = pd.DataFrame({'a': [1, 2, 3]}, index=[0, 1, 2])
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema)

which gives indeed that error. In the end, it boils down to the same bug as my example above using a RangeIndex but with specifying preserve_index=True (as that forces the index to become a column, just as if you have a non-rangeindex).

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
[~goodiegoodman] so I did a PR to fix this: #5750 (I try to add a lot of test cases when refactoring this for 0.15, it's a pity I overlooked this (quite obvious in hindsight) one).

But the question for you still is: is there a way to deal with this that is compatible across different releases?
I am not fully understanding your explanation from above:

We did not have this issue with pyarrow==0.11.0 which we used to write many partitions across years. Our goal now is to use pyarrow==0.15.0 and produce schema going forward that are backwards compatible (i.e. also have 'index_level_0'), so we should not need to re-generate all prior years' partitions when we migrate to 0.15.0.

What do you mean exactly with "write"? (to what file format? or how is the schema stored?)
One option I can think off (but not sure it fits your usecase) is to make sure that your index has a name. Adapting the above example:

df3 = pd.DataFrame({'a': [1, 2, 3]}, index=pd.Int64Index([0, 1, 2], name='index'))
pa.Table.from_pandas(df3, schema=pa.Table.from_pandas(df3).schema)

This works on 0.11.0 and on 0.15.0. However, this then fails on 0.13/0.14 (which is one of the reasons we tried to clean up and normalize this handling of the passed schema in 0.15).

@asfimport
Copy link
Author

Tom Goodman:
@jorisvandenbossche thank you for the quick turn-around!

We store the partitions in parquet files, with directories defining partitions and _common_metadata file holding schema.  This allows us to use the ParquetDataset partition level filters like [[('yyyymm', '=', 201909)]] ...


tree
.
|-- _common_metadata
|-- yyyymm=201909
|   `-- e097411586b0460e860c331b63fecb2b.parquet
`-- yyyymm=201910
    `-- b8de9aa413194cc4af6f4802b5c4923f.parquet
.
.

@asfimport
Copy link
Author

Tom Goodman:
Thanks to your suggestions, we now have a work-around that allows us to remain backwards-compatible!
If we get a KeyError due to missing 'index_level_0', we'll set df.index.name = 'index_level_0' and re-call same from_pandas function.

    try:
        table = pa.Table.from_pandas(df, schema=schema)
    except KeyError as e:
        if '__index_level_0__' in str(e):  # Happens in pyarrow 0.15.0, not 0.11.0
            df.index.name = '__index_level_0__'
            table = pa.Table.from_pandas(df, schema=schema)
        else:
            raise e

Thanks so much @jorisvandenbossche!

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
That sounds as a decent enough workaround for now. Happy you found a way to deal with it!

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 5750
#5750

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants