Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support setting the index #4565

Merged
merged 12 commits into from Mar 12, 2019

Conversation

Projects
None yet
2 participants
@mrocklin
Copy link
Member

commented Mar 8, 2019

This PR does two things:

  • When we manipulate the index, change the divisions accordingly. So for example df.index * 10 has its divisions also multiplied by 10
  • Allow assignment of the index

(I'm happy to separate these changes if desired for review purposes)

The first change is more complex, and has some negative effects for situations like the following:

df.groupby(df.index.dt.day)

Because now the grouper and the dataframe do not have matching divisions and so we'll either try to align them or raise an error (which is what happens today). It's not clear what correct behavior is here. This PR relaxes the check for aligned divisions to only ask that they have the same number of partitions, not that their division values are equivalent.

  • Tests added / passed
  • Passes flake8 dask

mrocklin added some commits Mar 8, 2019

Support setting the index
Also convert divisions when we operate on an index object
@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Some interesting cases where this fails

df.groupby(df.index.dt.day)

Because the dataframe and the grouper no longer have the same divisions

Something similar happens with joins as well

dask/dataframe/tests/test_dataframe.py::test_shift_with_freq FAILED                                                   [  0%]
dask/dataframe/tests/test_groupby.py::test_groupby_on_index[sync] FAILED                                              [  0%]
dask/dataframe/tests/test_groupby.py::test_groupby_on_index[threads] FAILED                                           [  0%]
dask/dataframe/tests/test_groupby.py::test_groupby_index_array FAILED                                                 [  0%]
dask/dataframe/tests/test_multi.py::test_merge_index_without_divisions[disk] FAILED                                   [  0%]
dask/dataframe/tests/test_multi.py::test_merge_index_without_divisions[tasks] FAILED                                  [  0%]
@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

@jcrist or @jrbourbeau if either of you have time to review this and also think about the problem above with groupby/join that would be welcome.

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

I've updated the header comment

@mrocklin mrocklin changed the title [WIP] Support setting the index Support setting the index Mar 8, 2019

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Down to

dask/dataframe/tests/test_dataframe.py::test_shift_with_freq
dask/dataframe/tests/test_groupby.py::test_groupby_index_array

mrocklin added some commits Mar 10, 2019

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Mar 10, 2019

@TomAugspurger if you have time to review this that would be welcome. I think that the most controversial change here is relaxing the check in groupby aggregations that the dataframe and the grouper have the same divisions.

@TomAugspurger
Copy link
Member

left a comment

I think this is a good approach. Just a couple doc requests.

@@ -3713,6 +3739,7 @@ def map_partitions(func, *args, **kwargs):
"""
meta = kwargs.pop('meta', no_default)
name = kwargs.pop('token', None)
transform_divisions = kwargs.pop('transform_divisions', True)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Mar 11, 2019

Member

Would appreciate documenting this in the docstring, especially around edge cases.

IIUC, users can expect .map_partictions(func) to retain divisions iff func is monotonic? This seems like a perfectly acceptable rule for now.

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 11, 2019

Author Member

Added a docstring entry

IIUC, users can expect .map_partictions(func) to retain divisions iff func is monotonic? This seems like a perfectly acceptable rule for now.

Yes, if the input is an Index object

True
>>> valid_divisions(123)
False
"""

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Mar 11, 2019

Member

Add an example with [0, float('nan'), 1] returning False?

This comment has been minimized.

Copy link
@mrocklin

mrocklin Mar 11, 2019

Author Member

Thanks for the fail case!

@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Mar 12, 2019

Any further comments here @TomAugspurger ?

@TomAugspurger
Copy link
Member

left a comment

Docs look good. I think merge whenever you're ready.

@mrocklin mrocklin merged commit 1b33813 into dask:master Mar 12, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@mrocklin mrocklin deleted the mrocklin:index-setter branch Mar 12, 2019

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Support setting the index (dask#4565)
-  When we manipulate the index, change the divisions accordingly.  So for example `df.index * 10` has its divisions also multiplied by 10
-  Allow assignment of the index

The first change is more complex, and has some negative effects for situations like the following:

    df.groupby(df.index.dt.day)

Because now the grouper and the dataframe do not have matching divisions and so we'll either try to align them or raise an error (which is what happens today).  It's not clear what correct behavior is here.  This PR relaxes the check for aligned divisions to only ask that they have the same number of partitions, not that their division values are equivalent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.