New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta not updated correctly with MultiIndex while using Accessor functions in DataFrame #3038

Closed
jsnowacki opened this Issue Dec 29, 2017 · 4 comments

Comments

Projects
None yet
3 participants
@jsnowacki
Contributor

jsnowacki commented Dec 29, 2017

Consider the following case of DF:

ddf = dd.from_pandas(pd.DataFrame(['<t1><t2>', '<t2><t3>'], columns=['tags']), npartitions=1)
ddf.head()
       tags
0  <t1><t2>
1  <t2><t3>

Using str.extractall function works fine, see:

t = ddf.tags.str.extractall('<([^>]*)>')\
    .rename(columns={0: 'tag'})
print(t.head()
        tag
  match    
0 0      t1
  1      t2
1 0      t2
  1      t3

But the meta is not updated:

print(t._meta.index)
Index([], dtype='object')

Thus, operations like reset_index fail as follows:

t.reset_index().head()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-100-e6186d78fb03> in <module>()
----> 1 t.reset_index().head()

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in head(self, n, npartitions, compute)
    844 
    845         if compute:
--> 846             result = result.compute()
    847         return result
    848 

C:\ProgramData\Anaconda3\lib\site-packages\dask\base.py in compute(self, **kwargs)
    133         dask.base.compute
    134         """
--> 135         (result,) = compute(self, traverse=False, **kwargs)
    136         return result
    137 

C:\ProgramData\Anaconda3\lib\site-packages\dask\base.py in compute(*args, **kwargs)
    331     postcomputes = [a.__dask_postcompute__() if is_dask_collection(a)
    332                     else (None, a) for a in args]
--> 333     results = get(dsk, keys, **kwargs)
    334     results_iter = iter(results)
    335     return tuple(a if f is None else f(next(results_iter), *a)

C:\ProgramData\Anaconda3\lib\site-packages\dask\multiprocessing.py in get(dsk, keys, num_workers, func_loads, func_dumps, optimize_graph, **kwargs)
    175                            get_id=_process_get_id, dumps=dumps, loads=loads,
    176                            pack_exception=pack_exception,
--> 177                            raise_exception=reraise, **kwargs)
    178     finally:
    179         if cleanup:

C:\ProgramData\Anaconda3\lib\site-packages\dask\local.py in get_async(apply_async, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, **kwargs)
    519                         _execute_task(task, data)  # Re-execute locally
    520                     else:
--> 521                         raise_exception(exc, tb)
    522                 res, worker_id = loads(res_info)
    523                 state['cache'][key] = res

C:\ProgramData\Anaconda3\lib\site-packages\dask\compatibility.py in reraise(exc, tb)
     57     def reraise(exc, tb=None):
     58         if exc.__traceback__ is not tb:
---> 59             raise exc.with_traceback(tb)
     60         raise exc
     61 

C:\ProgramData\Anaconda3\lib\site-packages\dask\local.py in execute_task()
    288     try:
    289         task, data = loads(task_info)
--> 290         result = _execute_task(task, data)
    291         id = get_id()
    292         result = dumps((result, id))

C:\ProgramData\Anaconda3\lib\site-packages\dask\local.py in _execute_task()
    268     elif istask(arg):
    269         func, args = arg[0], arg[1:]
--> 270         args2 = [_execute_task(a, cache) for a in args]
    271         return func(*args2)
    272     elif not ishashable(arg):

C:\ProgramData\Anaconda3\lib\site-packages\dask\local.py in <listcomp>()
    268     elif istask(arg):
    269         func, args = arg[0], arg[1:]
--> 270         args2 = [_execute_task(a, cache) for a in args]
    271         return func(*args2)
    272     elif not ishashable(arg):

C:\ProgramData\Anaconda3\lib\site-packages\dask\local.py in _execute_task()
    269         func, args = arg[0], arg[1:]
    270         args2 = [_execute_task(a, cache) for a in args]
--> 271         return func(*args2)
    272     elif not ishashable(arg):
    273         return arg

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in apply_and_enforce()
   3268             return meta
   3269         c = meta.columns if isinstance(df, pd.DataFrame) else meta.name
-> 3270         return _rename(c, df)
   3271     return df
   3272 

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in _rename()
   3305         # deep=False doesn't doesn't copy any data/indices, so this is cheap
   3306         df = df.copy(deep=False)
-> 3307         df.columns = columns
   3308         return df
   3309     elif isinstance(df, (pd.Series, pd.Index)):

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __setattr__()
   3625         try:
   3626             object.__getattribute__(self, name)
-> 3627             return object.__setattr__(self, name, value)
   3628         except AttributeError:
   3629             pass

pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in _set_axis()
    557 
    558     def _set_axis(self, axis, labels):
--> 559         self._data.set_axis(axis, labels)
    560         self._clear_item_cache()
    561 

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\internals.py in set_axis()
   3067             raise ValueError('Length mismatch: Expected axis has %d elements, '
   3068                              'new values have %d elements' %
-> 3069                              (old_len, new_len))
   3070 
   3071         self.axes[axis] = new_labels

ValueError: Length mismatch: Expected axis has 3 elements, new values have 2 elements

Following suggestion in #2929, as the issue seems to be similar, after updating meta by hand reset_index works:

t._meta = pd.DataFrame(columns=['tag'], dtype=str, index=pd.MultiIndex([[], []], [[], []], names=[None, 'match']))
print(t._meta.index)
MultiIndex(levels=[[], []],
           labels=[[], []],
           names=[None, 'match'])
print(t.reset_index().head())
   level_0  match tag
0        0      0  t1
1        0      1  t2
2        1      0  t2
3        1      1  t3

IMO this should be done withing the accessor, as doing it manually is a bit cumbersome. It seems to only be related to the functions that change index, but for str I only recall extractall.

Versions:

  • dask: 0.16.0
  • python: 3.6
@TomAugspurger

This comment has been minimized.

Member

TomAugspurger commented Jan 2, 2018

Thanks for the report.

I think this is a bug in pandas not having the correct index on a .str.extractall for an empty result (no matches or empty input). I opened pandas-dev/pandas#19034 to track that if you're interested in following along (and maybe submitting a fix?)

@jsnowacki

This comment has been minimized.

Contributor

jsnowacki commented Jan 3, 2018

Thanks. I'll look into it and see if I can find and/or fix anything.

@jsnowacki

This comment has been minimized.

Contributor

jsnowacki commented Jan 5, 2018

@TomAugspurger I've done some manual tests with the PR pandas-dev/pandas#19075 and it seems to fix this issue. Do you think we need extra tests for that or just flag this as fixed with the upstream PR?

@jcrist

This comment has been minimized.

Member

jcrist commented Feb 6, 2018

Do you think we need extra tests for that or just flag this as fixed with the upstream PR?

I added a patch supporting this fix for older versions of pandas in #3143.

@jcrist jcrist closed this in #3143 Feb 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment