Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax redundant-key check in _check_dsk #10701

Merged
merged 4 commits into from Dec 18, 2023
Merged

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Dec 13, 2023

Possible solution for the test_setitem_hardmask component of #10672

It seems that we are knowingly adding redundant keys for array setitem operations (https://github.com/dask/dask/blob/main/dask/array/slicing.py#L2064). It seems like we are only catching this in Windows, because other systems seem to tokenize distinct np.ma.masked objects to different values. I added a new test (test_setitem_slice_twice) that also fails on Linux without the proposed change.

Question: Should we relax the redundant-key check in _check_dsk (as this PR currently does). Or, should we avoid doing things like this?

cc @charlesbluca

@github-actions github-actions bot added the array label Dec 13, 2023
@@ -210,7 +211,15 @@ def _check_dsk(dsk):
assert all(isinstance(k, (tuple, str)) for k in dsk.layers)
freqs = frequencies(concat(dsk.layers.values()))
non_one = {k: v for k, v in freqs.items() if v != 1}
assert not non_one, non_one
key_collisions = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would defer to other maintainer's feedback first but if we do decide that this is something we want to relax now, but enforce more strictly later perhaps it would make sense to plumb up some kwarg that can be used to toggle between the relaxed and strict check here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm completely open to other solutions here, but my current sense is that we should avoid adding any new plumbing. I think we just need to decide on a standard rule for dealing with redundant keys, and strictly enforce it here.

The current "rule" in main is: "Distinct Layer mappings may not include the same key".

This PR proposes the new/relaxed rule: "Distinct Layer mappings may not include the same key, unless they correspond to equivalent elements (as determined by tokenize)".

The advantage of the new rule is that we can continue doing things like https://github.com/dask/dask/blob/main/dask/array/slicing.py#L2064, since there is nothing "wrong" with overwriting a k-v pair with an equivalent k-v pair anyway. The disadvantage of the new rule is that tokenize may not always return a deterministic value for problematic data. Also, it may be easy-enough to simply correct setitem behavior and avoid this question altogether. (I'll look into this, just wanted to submit a possible solution before I could be pulled away)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I experimented with an alternative here: https://github.com/rjzamora/dask/tree/djust-check-dsk-alternative - I'm slightly less confident with that approach (it still seems possible that we can end up with redundant but equivalent keys when concatenate_array_chunks is needed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we shouldn't introduce an additional argument to flip behavior.

It's worth noting that this entire condition stems from an ancient implementation of a Mapping called SharedDict in #1985 which required this as an implementation detail. This object no longer exists and I would consider it best practice to not have overlapping keys but I also don't see a reason why it should be forbidden.

As Matt points out in #1985 (comment), shared keys are fine and even common and there was nothing wrong with this for advanced users.

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this fix. Is there anything left to discuss?

@@ -210,7 +211,15 @@ def _check_dsk(dsk):
assert all(isinstance(k, (tuple, str)) for k in dsk.layers)
freqs = frequencies(concat(dsk.layers.values()))
non_one = {k: v for k, v in freqs.items() if v != 1}
assert not non_one, non_one
key_collisions = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we shouldn't introduce an additional argument to flip behavior.

It's worth noting that this entire condition stems from an ancient implementation of a Mapping called SharedDict in #1985 which required this as an implementation detail. This object no longer exists and I would consider it best practice to not have overlapping keys but I also don't see a reason why it should be forbidden.

As Matt points out in #1985 (comment), shared keys are fine and even common and there was nothing wrong with this for advanced users.

@rjzamora
Copy link
Member Author

I'm fine with this fix. Is there anything left to discuss?

Not on my end. I agree with your comments. My sense is that there is no "bug" here - The assertion is just stricter than it needs to be. So, relaxing the assertion is the most "maintenance friendly" move here.

@rjzamora rjzamora merged commit 32e19ce into dask:main Dec 18, 2023
26 checks passed
@rjzamora rjzamora deleted the adjust-check-dsk branch December 18, 2023 14:41
@fjetter
Copy link
Member

fjetter commented Dec 18, 2023

thanks @rjzamora and @charlesbluca !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants