pandas.read_parquet: KeyError on reading dataframe with RangeIndex #111

vfilimonov · 2020-01-17T20:34:10Z

Hello @igorborgest , thanks a lot for developing this package!

In case when DataFrame has a RangeIndex - it could be written to parquet, but raises KeyError on read:

d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = pd.np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A','B','C','D']).reset_index()

wr.pandas.to_parquet(dataframe=x, path=PATH)
wr.pandas.read_parquet(path=PATH)

Raises:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '__index_level_0__'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-67-9e66c4b6764b> in <module>
----> 1 wr.pandas.read_parquet(path=PATH)

~/miniconda3/lib/python3.7/site-packages/awswrangler/pandas.py in read_parquet(self, path, columns, filters, procs_cpu_bound, wait_objects, wait_objects_timeout)
   1373                                              procs_cpu_bound=procs_cpu_bound,
   1374                                              wait_objects=wait_objects,
-> 1375                                              wait_objects_timeout=wait_objects_timeout)
   1376         else:
   1377             procs = []

~/miniconda3/lib/python3.7/site-packages/awswrangler/pandas.py in _read_parquet_paths(session_primitives, path, columns, filters, procs_cpu_bound, wait_objects, wait_objects_timeout)
   1460                 procs_cpu_bound=procs_cpu_bound,
   1461                 wait_objects=wait_objects,
-> 1462                 wait_objects_timeout=wait_objects_timeout)
   1463             return [df]
   1464         else:

~/miniconda3/lib/python3.7/site-packages/awswrangler/pandas.py in _read_parquet_path(session_primitives, path, columns, filters, procs_cpu_bound, wait_objects, wait_objects_timeout)
   1524         df = table.to_pandas(use_threads=use_threads, integer_object_nulls=True)
   1525         for c in integers:
-> 1526             if not str(df[c].dtype).startswith("int"):
   1527                 df[c] = df[c].astype("Int64")
   1528         logger.debug(f"Done: {path}")

~/miniconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2993             if self.columns.nlevels > 1:
   2994                 return self._getitem_multilevel(key)
-> 2995             indexer = self.columns.get_loc(key)
   2996             if is_integer(indexer):
   2997                 indexer = [indexer]

~/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: '__index_level_0__'

When the index is a series (e.g. when x is created as x = pd.DataFrame(vals, index=d, columns=['A','B','C','D'])) there's no such issue.

Versions:
aws-data-wrangler: 0.2.5
pandas: 0.25.3
pyarrow: 0.15.1

The text was updated successfully, but these errors were encountered:

igorborgest · 2020-01-22T15:10:20Z

Thanks @vfilimonov, another great contribution.

Already fixed with the PR above. Will be release in the new version on the Weekend.

igorborgest · 2020-01-22T15:11:02Z

P.S. Test case also added on our test bench!

vfilimonov · 2020-01-22T15:14:30Z

Thanks a lot, Igor! 👍

igorborgest self-assigned this Jan 21, 2020

igorborgest added bug Something isn't working enhancement New feature or request WIP Work in progress labels Jan 21, 2020

igorborgest mentioned this issue Jan 22, 2020

Removing regular indexes from the compulsory Int64 cast #116

Merged

igorborgest closed this as completed Jan 22, 2020

igorborgest removed the WIP Work in progress label Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.read_parquet: KeyError on reading dataframe with RangeIndex #111

pandas.read_parquet: KeyError on reading dataframe with RangeIndex #111

vfilimonov commented Jan 17, 2020 •

edited

igorborgest commented Jan 22, 2020

igorborgest commented Jan 22, 2020

vfilimonov commented Jan 22, 2020

pandas.read_parquet: KeyError on reading dataframe with RangeIndex #111

pandas.read_parquet: KeyError on reading dataframe with RangeIndex #111

Comments

vfilimonov commented Jan 17, 2020 • edited

igorborgest commented Jan 22, 2020

igorborgest commented Jan 22, 2020

vfilimonov commented Jan 22, 2020

vfilimonov commented Jan 17, 2020 •

edited