Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support setting the index #4565

Merged
merged 12 commits into from Mar 12, 2019
Merged

Support setting the index #4565

merged 12 commits into from Mar 12, 2019

Conversation

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Mar 8, 2019

This PR does two things:

  • When we manipulate the index, change the divisions accordingly. So for example df.index * 10 has its divisions also multiplied by 10
  • Allow assignment of the index

(I'm happy to separate these changes if desired for review purposes)

The first change is more complex, and has some negative effects for situations like the following:

df.groupby(df.index.dt.day)

Because now the grouper and the dataframe do not have matching divisions and so we'll either try to align them or raise an error (which is what happens today). It's not clear what correct behavior is here. This PR relaxes the check for aligned divisions to only ask that they have the same number of partitions, not that their division values are equivalent.

  • Tests added / passed
  • Passes flake8 dask
@mrocklin
Copy link
Member Author

@mrocklin mrocklin commented Mar 8, 2019

Some interesting cases where this fails

df.groupby(df.index.dt.day)

Because the dataframe and the grouper no longer have the same divisions

Something similar happens with joins as well

dask/dataframe/tests/test_dataframe.py::test_shift_with_freq FAILED                                                   [  0%]
dask/dataframe/tests/test_groupby.py::test_groupby_on_index[sync] FAILED                                              [  0%]
dask/dataframe/tests/test_groupby.py::test_groupby_on_index[threads] FAILED                                           [  0%]
dask/dataframe/tests/test_groupby.py::test_groupby_index_array FAILED                                                 [  0%]
dask/dataframe/tests/test_multi.py::test_merge_index_without_divisions[disk] FAILED                                   [  0%]
dask/dataframe/tests/test_multi.py::test_merge_index_without_divisions[tasks] FAILED                                  [  0%]

@mrocklin
Copy link
Member Author

@mrocklin mrocklin commented Mar 8, 2019

@jcrist or @jrbourbeau if either of you have time to review this and also think about the problem above with groupby/join that would be welcome.

@mrocklin
Copy link
Member Author

@mrocklin mrocklin commented Mar 8, 2019

I've updated the header comment

@mrocklin mrocklin changed the title [WIP] Support setting the index Support setting the index Mar 8, 2019
@mrocklin
Copy link
Member Author

@mrocklin mrocklin commented Mar 8, 2019

Down to

dask/dataframe/tests/test_dataframe.py::test_shift_with_freq
dask/dataframe/tests/test_groupby.py::test_groupby_index_array

@mrocklin
Copy link
Member Author

@mrocklin mrocklin commented Mar 10, 2019

@TomAugspurger if you have time to review this that would be welcome. I think that the most controversial change here is relaxing the check in groupby aggregations that the dataframe and the grouper have the same divisions.

Copy link
Member

@TomAugspurger TomAugspurger left a comment

I think this is a good approach. Just a couple doc requests.

@@ -3713,6 +3739,7 @@ def map_partitions(func, *args, **kwargs):
"""
meta = kwargs.pop('meta', no_default)
name = kwargs.pop('token', None)
transform_divisions = kwargs.pop('transform_divisions', True)
Copy link
Member

@TomAugspurger TomAugspurger Mar 11, 2019

Would appreciate documenting this in the docstring, especially around edge cases.

IIUC, users can expect .map_partictions(func) to retain divisions iff func is monotonic? This seems like a perfectly acceptable rule for now.

Copy link
Member Author

@mrocklin mrocklin Mar 11, 2019

Added a docstring entry

IIUC, users can expect .map_partictions(func) to retain divisions iff func is monotonic? This seems like a perfectly acceptable rule for now.

Yes, if the input is an Index object

True
>>> valid_divisions(123)
False
"""
Copy link
Member

@TomAugspurger TomAugspurger Mar 11, 2019

Add an example with [0, float('nan'), 1] returning False?

Copy link
Member Author

@mrocklin mrocklin Mar 11, 2019

Thanks for the fail case!

@mrocklin
Copy link
Member Author

@mrocklin mrocklin commented Mar 12, 2019

Any further comments here @TomAugspurger ?

Copy link
Member

@TomAugspurger TomAugspurger left a comment

Docs look good. I think merge whenever you're ready.

@mrocklin mrocklin merged commit 1b33813 into dask:master Mar 12, 2019
2 checks passed
@mrocklin mrocklin deleted the index-setter branch Mar 12, 2019
jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this issue May 14, 2019
-  When we manipulate the index, change the divisions accordingly.  So for example `df.index * 10` has its divisions also multiplied by 10
-  Allow assignment of the index

The first change is more complex, and has some negative effects for situations like the following:

    df.groupby(df.index.dt.day)

Because now the grouper and the dataframe do not have matching divisions and so we'll either try to align them or raise an error (which is what happens today).  It's not clear what correct behavior is here.  This PR relaxes the check for aligned divisions to only ask that they have the same number of partitions, not that their division values are equivalent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants