Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.read_parquet: KeyError on reading dataframe with RangeIndex #111

Closed
vfilimonov opened this issue Jan 17, 2020 · 3 comments
Closed
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@vfilimonov
Copy link

vfilimonov commented Jan 17, 2020

Hello @igorborgest , thanks a lot for developing this package!


In case when DataFrame has a RangeIndex - it could be written to parquet, but raises KeyError on read:

d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = pd.np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A','B','C','D']).reset_index()

wr.pandas.to_parquet(dataframe=x, path=PATH)
wr.pandas.read_parquet(path=PATH)

Raises:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '__index_level_0__'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-67-9e66c4b6764b> in <module>
----> 1 wr.pandas.read_parquet(path=PATH)

~/miniconda3/lib/python3.7/site-packages/awswrangler/pandas.py in read_parquet(self, path, columns, filters, procs_cpu_bound, wait_objects, wait_objects_timeout)
   1373                                              procs_cpu_bound=procs_cpu_bound,
   1374                                              wait_objects=wait_objects,
-> 1375                                              wait_objects_timeout=wait_objects_timeout)
   1376         else:
   1377             procs = []

~/miniconda3/lib/python3.7/site-packages/awswrangler/pandas.py in _read_parquet_paths(session_primitives, path, columns, filters, procs_cpu_bound, wait_objects, wait_objects_timeout)
   1460                 procs_cpu_bound=procs_cpu_bound,
   1461                 wait_objects=wait_objects,
-> 1462                 wait_objects_timeout=wait_objects_timeout)
   1463             return [df]
   1464         else:

~/miniconda3/lib/python3.7/site-packages/awswrangler/pandas.py in _read_parquet_path(session_primitives, path, columns, filters, procs_cpu_bound, wait_objects, wait_objects_timeout)
   1524         df = table.to_pandas(use_threads=use_threads, integer_object_nulls=True)
   1525         for c in integers:
-> 1526             if not str(df[c].dtype).startswith("int"):
   1527                 df[c] = df[c].astype("Int64")
   1528         logger.debug(f"Done: {path}")

~/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2993             if self.columns.nlevels > 1:
   2994                 return self._getitem_multilevel(key)
-> 2995             indexer = self.columns.get_loc(key)
   2996             if is_integer(indexer):
   2997                 indexer = [indexer]

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '__index_level_0__'

When the index is a series (e.g. when x is created as x = pd.DataFrame(vals, index=d, columns=['A','B','C','D'])) there's no such issue.

Versions:
aws-data-wrangler: 0.2.5
pandas: 0.25.3
pyarrow: 0.15.1

@igorborgest igorborgest self-assigned this Jan 21, 2020
@igorborgest igorborgest added bug Something isn't working enhancement New feature or request WIP Work in progress labels Jan 21, 2020
@igorborgest
Copy link
Contributor

Thanks @vfilimonov, another great contribution.

Already fixed with the PR above. Will be release in the new version on the Weekend.

@igorborgest igorborgest removed the WIP Work in progress label Jan 22, 2020
@igorborgest
Copy link
Contributor

P.S. Test case also added on our test bench!

@vfilimonov
Copy link
Author

Thanks a lot, Igor! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants