Skip to content

Remove hard dependency on pandas in get_dummies#5057

Merged
mrocklin merged 3 commits intodask:masterfrom
galipremsagar:get_dummies
Jul 4, 2019
Merged

Remove hard dependency on pandas in get_dummies#5057
mrocklin merged 3 commits intodask:masterfrom
galipremsagar:get_dummies

Conversation

@galipremsagar
Copy link
Copy Markdown
Contributor

@galipremsagar galipremsagar commented Jul 3, 2019

  • Tests added / passed
  • Passes black dask / flake8 dask

This fixes: #4885

@galipremsagar
Copy link
Copy Markdown
Contributor Author

rerun tests

@galipremsagar galipremsagar changed the title [WIP] Removing hard dependency on pandas in get_dummies [REVIEW] Removing hard dependency on pandas in get_dummies Jul 3, 2019
Copy link
Copy Markdown
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @galipremsagar !

Thanks for taking this on. I have a few questions about this. They're in-lined below.

# dtype added to pandas
kwargs = {"dtype": dtype}
if len(kwargs) == 0:
kwargs = {"dtype": dtype}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the reason I had to introduce kwargs in get_dummies call. I had to do this as currently cudf's get_dummies behavior is slightly different from that of panda's. For the meta to be populated for dask to work we would require cats attribute to be passed for cudf.get_dummies to give the expected columns on an empty dataframe. So instead of creating inconsistency by adding cats to dask get_dummies call, moved the kwargs being created below to the function signature itself and having an equivalent check which was being done previously.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so you're saying that there is an extra keyword argument that cudf wants, so you want to include **kwargs in the dask.dataframe.get_dummies signature? This makes sense.

Given that though, I think that you can probably replace these four lines as follows?:

-if len(kwargs) == 0:
-    kwargs = {'dtype': dtype}
-else:
-    kwargs['dtype'] = dtype
+kwargs['dtype'] = dtype

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, since kwargs would be an empty dict it makes sense to remove the length check. Removed it. 👍

drop_first=drop_first,
**kwargs
)
method_cache = M
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions/comments here:

  1. It seems odd to special case Pandas here. Ideally we would find a single solution that would work for any pandas-like library
  2. I don't understand the use of M here. The M object creates methods, so for example this would be equivalend to data._meta.get_dummies. My understanding is that this doesn't exist for Pandas.

@TomAugspurger , do you have any suggestions for how to get a function or a module of an object to go from a Pandas dataframe to a Pandas function (not method).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, because of point 2 I had to have a branched condition for pandas separately. But I would like to know if we can do it in a more common code like you asked @TomAugspurger. I'm aswell looking to see how I can find package level function as both pandas & cudf now have get_dummies as a package level functions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas dataframe to a Pandas function (not method).

Sorry, I don't quite follow. Is it fair to say that in e.g. dask.dataframe.get_dummies, we'd like to get at DataFrame.get_dummies? Does getattr(data, <name>) work?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortuantely not in this case

In [1]: import pandas as pd

In [2]: df = pd.DataFrame()

In [3]: df.get_dummies
AttributeError: 'DataFrame' object has no attribute 'get_dummies'

In [4]: pd.get_dummies
Out[4]: <function pandas.core.reshape.reshape.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)>

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger, like Matthew mentioned I'm trying to access package level function since get_dummies is not a DataFrame function in pandas.

>>> import cudf
>>> cudf.get_dummies
<function get_dummies at 0x7f0b55f63ae8>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah understood.

Is this too hacky?

In [34]: obj = pd.DataFrame()

In [35]: package = obj.__class__.__module__.split(".")[0]

In [36]: sys.modules[package].get_dummies
Out[36]: <function pandas.core.reshape.reshape.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's certainly hacky :) But also pragmatic :/

I think I'll leave this question to you if you don't mind :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can live with it. I don't think it's likely to break (famous last words).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made changes accordingly.

drop_first=drop_first,
**kwargs
)
method_cache = pd
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the use of the term "method cache" for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go away if I have a better way to call get_dummies rather than from Method cache. But my understanding was that M was a Method cache for package functions. But I'm wrong and it doesn't seem to be the case.

@mrocklin mrocklin changed the title [REVIEW] Removing hard dependency on pandas in get_dummies Remove hard dependency on pandas in get_dummies Jul 4, 2019
@mrocklin mrocklin merged commit ddcfd80 into dask:master Jul 4, 2019
@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Jul 4, 2019

This looks good to me. Thanks for the changes and the effort here @galipremsagar . Merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Remove pandas requirement from get_dummies

3 participants