New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] KeyError: '__index_level_0__' passing Table.from_pandas its own schema #23313
Comments
Wes McKinney / @wesm: |
Joris Van den Bossche / @jorisvandenbossche: Your "steps to reproduce" actually do work if you do not use an empty dataframe: In [15]: import pandas as pd
...: import pyarrow as pa
...: df = pd.DataFrame({'a': [1, 2, 3]})
...: schema = pa.Table.from_pandas(df).schema
...: pa_table = pa.Table.from_pandas(df, schema=schema)
In [16]: schema
Out[16]:
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
b' "0.15.1.dev177+g5df424bd6"}, "pandas_version": "0.26.0.dev0+669'
b'.g3c29114b1"}'} The empty dataframe is tricky edge-case regarding the index, because in such a case the index is not a RangeIndex but a empty object-dtype Index (see ARROW-5104 for a similar report about that aspect). That said, if passing an explicit schema, and if there is a column not found that has a "__index_level_i__" pattern, we should try to handle this (certainly in case of passing |
Joris Van den Bossche / @jorisvandenbossche: In [23]: import pandas as pd
...: import pyarrow as pa
...: df = pd.DataFrame({'a': [1, 2, 3]})
...: schema = pa.Table.from_pandas(df, preserve_index=True).schema
...: pa_table = pa.Table.from_pandas(df, schema=schema, preserve_index=True)
...
KeyError: "name '__index_level_0__' present in the specified schema is not found in the columns or index" So if you specify Will look into fixing this (it's a pity that 0.15.1 is already released, it would have been nice to include this). |
Tom Goodman: df2 = pd.read_hdf('test3.hdf','foo')
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema) I still get KeyError: 'index_level_0' (without specifying preserve_index)._ This may be because the index on test3.hdf is Int64Index and I see pyarrow docs say default behavior is to "store the index as a column", except for rage indexes. This unfortunately makes the bug more prevalent. |
Joris Van den Bossche / @jorisvandenbossche: df2 = pd.DataFrame({'a': [1, 2, 3]}, index=[0, 1, 2])
pa.Table.from_pandas(df2, schema=pa.Table.from_pandas(df2).schema) which gives indeed that error. In the end, it boils down to the same bug as my example above using a RangeIndex but with specifying |
Joris Van den Bossche / @jorisvandenbossche: But the question for you still is: is there a way to deal with this that is compatible across different releases?
What do you mean exactly with "write"? (to what file format? or how is the schema stored?) df3 = pd.DataFrame({'a': [1, 2, 3]}, index=pd.Int64Index([0, 1, 2], name='index'))
pa.Table.from_pandas(df3, schema=pa.Table.from_pandas(df3).schema) This works on 0.11.0 and on 0.15.0. However, this then fails on 0.13/0.14 (which is one of the reasons we tried to clean up and normalize this handling of the passed schema in 0.15). |
Tom Goodman: We store the partitions in parquet files, with directories defining partitions and _common_metadata file holding schema. This allows us to use the ParquetDataset partition level filters like [[('yyyymm', '=', 201909)]] ...
|
Tom Goodman: try:
table = pa.Table.from_pandas(df, schema=schema)
except KeyError as e:
if '__index_level_0__' in str(e): # Happens in pyarrow 0.15.0, not 0.11.0
df.index.name = '__index_level_0__'
table = pa.Table.from_pandas(df, schema=schema)
else:
raise e Thanks so much @jorisvandenbossche! |
Joris Van den Bossche / @jorisvandenbossche: |
Antoine Pitrou / @pitrou: |
Steps to reproduce:
Generate any DataFrame's pyarrow Schema using Table.from_pandas
Pass the generated schema as input into Table.from_pandas
Causes KeyError: 'index_level_0'
We did not have this issue with pyarrow==0.11.0 which we used to write many partitions across years. Our goal now is to use pyarrow==0.15.0 and produce schema going forward that are backwards compatible (i.e. also have 'index_level_0'), so we should not need to re-generate all prior years' partitions when we migrate to 0.15.0.
We cannot set preserve_index=False, since that effectively deletes 'index_level_0', causing inconsistent schema across earlier partitions that had been written using pyarrow==0.11.0.
Environment: pandas==0.23.4
pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
Reporter: Tom Goodman
Assignee: Joris Van den Bossche / @jorisvandenbossche
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-6999. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: