Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize hash_pandas_object in hash_shard for split_out argument to work in non-pandas backends #5184

Merged
merged 13 commits into from Aug 6, 2019
Merged
9 changes: 6 additions & 3 deletions dask/dataframe/core.py
Expand Up @@ -10,7 +10,7 @@

import numpy as np
import pandas as pd
from pandas.util import cache_readonly, hash_pandas_object
from pandas.util import cache_readonly
from pandas.api.types import (
is_bool_dtype,
is_timedelta64_dtype,
Expand Down Expand Up @@ -4461,9 +4461,12 @@ def hash_shard(df, nparts, split_out_setup=None, split_out_setup_kwargs=None):
h = split_out_setup(df, **(split_out_setup_kwargs or {}))
else:
h = df
h = hash_pandas_object(h, index=False)
import sys

package = sys.modules[df.__class__.__module__.split(".")[0]]
h = package.util.hash_pandas_object(h, index=False)
if is_series_like(h):
h = h._values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the type on h here? Dask Series or pandas/ cudf?

In pandas Series._values is not the same as Series.values. It differs for a few types (interval, datetimetz)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h here is a pandas/cudf. In that case do you want any special case handling here or want cudf to support ._values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger I think h is always going to be of type int. As hash of any object is expected to be a series of ints, no?
info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.util.hash_pandas_object.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. So for that dtype, .values will be equivalent to ._values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. So, the changes related to this PR are all actually in and hence ready for review. This would help unblock : rapidsai/cudf#2396

h = h.values
Copy link
Member

@kkraus14 kkraus14 Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will convert a cuDF GPU object to a numpy array here as of now, is there a reason we need h to be a numpy array-like as opposed to leaving it as a Series?

I believe the following calls of the modulo and iloc should all work properly against a Pandas Series, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both work in cudf. In pandas modulo works but iloc doesn't yet seem to work in latest pandas(0.25) too:

>>> {i: df.iloc[x == i] for i in range(4)}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <dictcomp>
  File "/home/nfs/pgali/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1410, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/nfs/pgali/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 2118, in _getitem_axis
    self._validate_key(key, axis)
  File "/home/nfs/pgali/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1976, in _validate_key
    "iLocation based boolean "
NotImplementedError: iLocation based boolean indexing on an integer type is not available

h %= nparts
return {i: df.iloc[h == i] for i in range(nparts)}

Expand Down