Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask dataframe isna #3294

Merged
merged 16 commits into from Mar 25, 2018
Merged

Dask dataframe isna #3294

merged 16 commits into from Mar 25, 2018

Conversation

cr458
Copy link
Contributor

@cr458 cr458 commented Mar 18, 2018

Top level dd.isna method to stay consistent with pandas. From #3239.

@wraps(pd.isna)
def isna(arg):
meta = pd.Series([True])
return map_partitions(pd.isna, arg, meta=meta)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that if the input series has a name or other metadata (I'm not sure if other metadata exists) then this information will not be captured with the meta= definition above. I suspect that you can trigger a failure in your test by adding name='foo' to the construction of your input series (the pandas version will be named while the dask.dataframe version will not be named)

I looked at map_partitions to see how it handles metadata and it seems to be using a function dask.dataframe.core._emulate. I'm curious, can we let it handle this and not provide meta= as a keyword argument here? It might do the right thing by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it is working, I will try and make the test fail as well by adding the name='foo'.

@@ -4123,6 +4123,12 @@ def to_timedelta(arg, unit='ns', errors='raise'):
meta=meta)


@wraps(pd.isna)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this function is only in newer versions of pandas, you'll have to wrap both the definition and the import in __init__.py in a case statement. I'd do something like:

if hasattr(pd, 'isna'):
    @wraps(pd.isna)
    def isna(arg):
        return map_partitions(pd.isna, arg)

And in __init__.py:

try:
    from .core import isna
except ImportError:
    pass

try:
from .core import isna
except ImportError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a newline at the end of the file (usually this is something you can configure in your editor to prevent this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have now configured my editor to automate this :). Thanks a lot for your guidance!

@@ -4123,6 +4123,12 @@ def to_timedelta(arg, unit='ns', errors='raise'):
meta=meta)


if hasattr(pd, isna):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be if hasattr(pd, 'isna'): (as seen in the travis logs).


In general, I recommend trying to run the tests yourself before pushing to github. We use py.test for testing:

$ conda install pytest
# or
$ pip install pytest

$ py.test dask  # run the whole test suite
$ py.test dask/dataframe/tests/test_dataframe.py -k test_isna  # test just your added test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Should be fixed now.

@mrocklin mrocklin changed the title [WIP] Dask dataframe isna Dask dataframe isna Mar 22, 2018
@mrocklin
Copy link
Member

This looks great to me. Thank you for implementing this @cr458 . A few other administrative things to clean up, hopefully these are easy.

  1. Can you add a small note to the changelog in docs/source/changelog.rst. Hopefully the pattern there should be clear enough (although this may be the first change in the new version, so you may have to look down a bit for examples to copy). Don't forget to add your name and a website to the bottom of that document
  2. Should we add this as a method on _Frame as well so that both DataFrame and Series get this method? The version issue might come up. I recommend that we check the version within the method and raise an informative error if appropriate.
  3. After that can you add isna to the DataFrame API docs in docs/source/dataframe-api.rst?

@cr458
Copy link
Contributor Author

cr458 commented Mar 23, 2018

The pleasure was all mine @mrocklin. Thanks for the helpful comments, I think it's fair to say the little work that was done here was actually done by you guys. Have updated the changelog now.
Happy to implement the method in _Frame as well. Which based on the other methods for _Frame I'm guessing it would look something like:

    @derived_from(pd.DataFrame)
    def isna(self):
        if hasattr(pd, 'isna'):
            return self.map_partitions(M.isna)
        else:
            raise ImportError

although I'm having a hard time seeing where the variable M is set in the class? It seems to be used throughout various methods.

@mrocklin
Copy link
Member

mrocklin commented Mar 24, 2018 via email

@@ -2813,6 +2813,16 @@ def test_to_timedelta():
dd.to_timedelta(ds, errors='coerce'))


@pytest.mark.skipif(PANDAS_VERSION < '0.22.0',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrocklin, should this use the hasattr method of checking for the isna instead? Just occurred to me that isna could be deprecated in future versions of pandas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't worry about isna being deprecated.

@cr458
Copy link
Contributor Author

cr458 commented Mar 25, 2018

@mrocklin shall I add a test to test_embarrassingly_parallel_operations in test_dataframe.py for isna as well?

@mrocklin
Copy link
Member

It's not necessary, the test you have seems fine to me. An alternative test would have been to add isna to that test instead, but what you've done here seems good to me.

In general I'm pretty satisfied with this. I've merged with master and added isna to API docs. Merging.

Thanks @cr458 for your work on this! Hopefully it is the first of many :)

@mrocklin mrocklin merged commit 907434b into dask:master Mar 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants