Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize `hash_pandas_object` in `hash_shard` for `split_out` argument to work in non-pandas backends #5184

Merged
merged 13 commits into from Aug 6, 2019

Conversation

@galipremsagar
Copy link
Contributor

commented Jul 30, 2019

This fixes: #5183

  • Tests added / passed
  • Passes black dask / flake8 dask
@mrocklin

This comment has been minimized.

Copy link
Member

commented Jul 30, 2019

@TomAugspurger , which do you prefer, using the sys.packages approach or making a dispatched function?

@TomAugspurger

This comment has been minimized.

Copy link
Member

commented Jul 31, 2019

Dispatched function.

if is_series_like(h):
h = h._values

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Jul 31, 2019

Member

What's the type on h here? Dask Series or pandas/ cudf?

In pandas Series._values is not the same as Series.values. It differs for a few types (interval, datetimetz)

This comment has been minimized.

Copy link
@galipremsagar

galipremsagar Jul 31, 2019

Author Contributor

h here is a pandas/cudf. In that case do you want any special case handling here or want cudf to support ._values?

This comment has been minimized.

Copy link
@galipremsagar

galipremsagar Aug 2, 2019

Author Contributor

@TomAugspurger I think h is always going to be of type int. As hash of any object is expected to be a series of ints, no?
info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.util.hash_pandas_object.html

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Aug 3, 2019

Member

That's correct. So for that dtype, .values will be equivalent to ._values.

This comment has been minimized.

Copy link
@galipremsagar

galipremsagar Aug 3, 2019

Author Contributor

Awesome. So, the changes related to this PR are all actually in and hence ready for review. This would help unblock : rapidsai/cudf#2396

@galipremsagar

This comment has been minimized.

Copy link
Contributor Author

commented Jul 31, 2019

Dispatched function.

@TomAugspurger @mrocklin Replaced modules approach with a dispatched method.

if is_series_like(h):
h = h._values
h = h.values

This comment has been minimized.

Copy link
@kkraus14

kkraus14 Aug 1, 2019

This will convert a cuDF GPU object to a numpy array here as of now, is there a reason we need h to be a numpy array-like as opposed to leaving it as a Series?

I believe the following calls of the modulo and iloc should all work properly against a Pandas Series, no?

This comment has been minimized.

Copy link
@galipremsagar

galipremsagar Aug 1, 2019

Author Contributor

Both work in cudf. In pandas modulo works but iloc doesn't yet seem to work in latest pandas(0.25) too:

>>> {i: df.iloc[x == i] for i in range(4)}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <dictcomp>
  File "/home/nfs/pgali/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1410, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/home/nfs/pgali/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 2118, in _getitem_axis
    self._validate_key(key, axis)
  File "/home/nfs/pgali/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1976, in _validate_key
    "iLocation based boolean "
NotImplementedError: iLocation based boolean indexing on an integer type is not available
@TomAugspurger
Copy link
Member

left a comment

A couple small comments.

Can you add basic tests for that our dispatch is working correctly for pandas objects.

@pytest.mark.parametrize('obj', [pd.Index(...), pd.Series(...), pd.DataFrame(...)])
def test_hash_object_dispatch(obj):
    result = dd.util.hash_object_dispatch(obj)
    expected = pd.util.hash_pandas_object(obj)
    tm.assert_equal(result, expected)
dask/dataframe/core.py Outdated Show resolved Hide resolved
dask/dataframe/core.py Outdated Show resolved Hide resolved
@galipremsagar

This comment has been minimized.

Copy link
Contributor Author

commented Aug 6, 2019

A couple small comments.

Can you add basic tests for that our dispatch is working correctly for pandas objects.

@pytest.mark.parametrize('obj', [pd.Index(...), pd.Series(...), pd.DataFrame(...)])
def test_hash_object_dispatch(obj):
    result = dd.util.hash_object_dispatch(obj)
    expected = pd.util.hash_pandas_object(obj)
    tm.assert_equal(result, expected)

@TomAugspurger, Added tests accordingly. This is ready for review.

@TomAugspurger

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

AppVeyor Numba failures should be fixed on master. I'll keep my eye on it.

@TomAugspurger TomAugspurger merged commit db73005 into dask:master Aug 6, 2019

1 of 2 checks passed

continuous-integration/appveyor/pr AppVeyor build failed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@TomAugspurger

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

Thanks @galipremsagar!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.