Remove hard dependency on pandas in get_dummies#5057
Conversation
|
rerun tests |
mrocklin
left a comment
There was a problem hiding this comment.
Hi @galipremsagar !
Thanks for taking this on. I have a few questions about this. They're in-lined below.
dask/dataframe/reshape.py
Outdated
| # dtype added to pandas | ||
| kwargs = {"dtype": dtype} | ||
| if len(kwargs) == 0: | ||
| kwargs = {"dtype": dtype} |
There was a problem hiding this comment.
Because of the reason I had to introduce kwargs in get_dummies call. I had to do this as currently cudf's get_dummies behavior is slightly different from that of panda's. For the meta to be populated for dask to work we would require cats attribute to be passed for cudf.get_dummies to give the expected columns on an empty dataframe. So instead of creating inconsistency by adding cats to dask get_dummies call, moved the kwargs being created below to the function signature itself and having an equivalent check which was being done previously.
There was a problem hiding this comment.
I see, so you're saying that there is an extra keyword argument that cudf wants, so you want to include **kwargs in the dask.dataframe.get_dummies signature? This makes sense.
Given that though, I think that you can probably replace these four lines as follows?:
-if len(kwargs) == 0:
- kwargs = {'dtype': dtype}
-else:
- kwargs['dtype'] = dtype
+kwargs['dtype'] = dtypeThere was a problem hiding this comment.
Yes, since kwargs would be an empty dict it makes sense to remove the length check. Removed it. 👍
dask/dataframe/reshape.py
Outdated
| drop_first=drop_first, | ||
| **kwargs | ||
| ) | ||
| method_cache = M |
There was a problem hiding this comment.
A few questions/comments here:
- It seems odd to special case Pandas here. Ideally we would find a single solution that would work for any pandas-like library
- I don't understand the use of
Mhere. TheMobject creates methods, so for example this would be equivalend todata._meta.get_dummies. My understanding is that this doesn't exist for Pandas.
@TomAugspurger , do you have any suggestions for how to get a function or a module of an object to go from a Pandas dataframe to a Pandas function (not method).
There was a problem hiding this comment.
True, because of point 2 I had to have a branched condition for pandas separately. But I would like to know if we can do it in a more common code like you asked @TomAugspurger. I'm aswell looking to see how I can find package level function as both pandas & cudf now have get_dummies as a package level functions.
There was a problem hiding this comment.
Pandas dataframe to a Pandas function (not method).
Sorry, I don't quite follow. Is it fair to say that in e.g. dask.dataframe.get_dummies, we'd like to get at DataFrame.get_dummies? Does getattr(data, <name>) work?
There was a problem hiding this comment.
Unfortuantely not in this case
In [1]: import pandas as pd
In [2]: df = pd.DataFrame()
In [3]: df.get_dummies
AttributeError: 'DataFrame' object has no attribute 'get_dummies'
In [4]: pd.get_dummies
Out[4]: <function pandas.core.reshape.reshape.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)>There was a problem hiding this comment.
@TomAugspurger, like Matthew mentioned I'm trying to access package level function since get_dummies is not a DataFrame function in pandas.
>>> import cudf
>>> cudf.get_dummies
<function get_dummies at 0x7f0b55f63ae8>There was a problem hiding this comment.
Ah understood.
Is this too hacky?
In [34]: obj = pd.DataFrame()
In [35]: package = obj.__class__.__module__.split(".")[0]
In [36]: sys.modules[package].get_dummies
Out[36]: <function pandas.core.reshape.reshape.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)>
There was a problem hiding this comment.
It's certainly hacky :) But also pragmatic :/
I think I'll leave this question to you if you don't mind :)
There was a problem hiding this comment.
I can live with it. I don't think it's likely to break (famous last words).
There was a problem hiding this comment.
Made changes accordingly.
dask/dataframe/reshape.py
Outdated
| drop_first=drop_first, | ||
| **kwargs | ||
| ) | ||
| method_cache = pd |
There was a problem hiding this comment.
I don't understand the use of the term "method cache" for this.
There was a problem hiding this comment.
This should go away if I have a better way to call get_dummies rather than from Method cache. But my understanding was that M was a Method cache for package functions. But I'm wrong and it doesn't seem to be the case.
|
This looks good to me. Thanks for the changes and the effort here @galipremsagar . Merging. |
black dask/flake8 daskThis fixes: #4885