New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize hash_pandas_object
in hash_shard
for split_out
argument to work in non-pandas backends
#5184
Generalize hash_pandas_object
in hash_shard
for split_out
argument to work in non-pandas backends
#5184
Changes from 9 commits
631b636
7e69979
77f6845
339813a
7ced100
8e6c383
3cf0952
5dcd48c
66a5698
85f53c6
36c8bad
bfd1992
29a54d6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,7 +10,7 @@ | |
|
||
import numpy as np | ||
import pandas as pd | ||
from pandas.util import cache_readonly, hash_pandas_object | ||
from pandas.util import cache_readonly | ||
from pandas.api.types import ( | ||
is_bool_dtype, | ||
is_timedelta64_dtype, | ||
|
@@ -4461,9 +4461,12 @@ def hash_shard(df, nparts, split_out_setup=None, split_out_setup_kwargs=None): | |
h = split_out_setup(df, **(split_out_setup_kwargs or {})) | ||
else: | ||
h = df | ||
h = hash_pandas_object(h, index=False) | ||
import sys | ||
|
||
package = sys.modules[df.__class__.__module__.split(".")[0]] | ||
h = package.util.hash_pandas_object(h, index=False) | ||
if is_series_like(h): | ||
h = h._values | ||
h = h.values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will convert a cuDF GPU object to a numpy array here as of now, is there a reason we need I believe the following calls of the modulo and iloc should all work properly against a Pandas Series, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Both work in cudf. In pandas modulo works but iloc doesn't yet seem to work in latest pandas( >>> {i: df.iloc[x == i] for i in range(4)}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <dictcomp>
File "/home/nfs/pgali/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1410, in __getitem__
return self._getitem_axis(maybe_callable, axis=axis)
File "/home/nfs/pgali/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 2118, in _getitem_axis
self._validate_key(key, axis)
File "/home/nfs/pgali/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 1976, in _validate_key
"iLocation based boolean "
NotImplementedError: iLocation based boolean indexing on an integer type is not available |
||
h %= nparts | ||
return {i: df.iloc[h == i] for i in range(nparts)} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the type on
h
here? Dask Series or pandas/ cudf?In pandas
Series._values
is not the same asSeries.values
. It differs for a few types (interval, datetimetz)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h
here is a pandas/cudf. In that case do you want any special case handling here or want cudf to support._values
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger I think
h
is always going to be of type int. As hash of any object is expected to be a series of ints, no?info: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.util.hash_pandas_object.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. So for that dtype,
.values
will be equivalent to._values
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. So, the changes related to this PR are all actually in and hence ready for review. This would help unblock : rapidsai/cudf#2396