-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn if partitions overlap in compute_divisions #4600
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. The implementation looks clean to me, but it raises some questions. Some thoughts below. Conversation welcome.
@@ -602,6 +602,9 @@ def test_set_index_sorted_true(): | |||
with pytest.raises(ValueError): | |||
a.set_index(a.z, sorted=True) | |||
|
|||
with pytest.raises(ValueError): | |||
a.set_index(a.y, sorted=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this would still work correctly, and we would move the extra data from one partition over to the neighboring one with a little bit of communication.
If we accept the change that you've proposed then we probably don't want to enforce this as a test, but might instead prefer an xfailed test that ensures that the resulting divisions are valid, even if the graph we have to produce is a little bit more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may also be that we want to accept our current incorrect behavior rather than err. My guess is that in most cases where this happens today it doesn't negatively affect users. This is hard to judge though/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pretty complicated. :)
Perhaps I could just change this to a warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could also be that you could leave the existing semi-invalid divisions, and then call repartition(divisions=...)
with a suitable new suggestion. I suspect that that would work today and may not be difficult to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that and it does not appear to work (without modifying repartition_divisions
or creating a new method of repartitioning specific to this almost-partitioned case).
Sure
…On Fri, Mar 15, 2019 at 11:56 PM Brian Chu ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dask/dataframe/tests/test_shuffle.py
<#4600 (comment)>:
> @@ -602,6 +602,9 @@ def test_set_index_sorted_true():
with pytest.raises(ValueError):
a.set_index(a.z, sorted=True)
+ with pytest.raises(ValueError):
+ a.set_index(a.y, sorted=True)
Seems pretty complicated. :)
Perhaps I could just change this to a warning.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#4600 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszA8EkaMDRJCDg26Rm6--TMiWIElEks5vXJWFgaJpZM4b3dFn>
.
|
Changed this to a warning only. |
Thanks @bchu . This is in. Also, I notice that this is your first code contribution to this repository. Welcome! |
This is my suggestion from #4591. Let me know if you think this is worth adding.