New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds hash dunder method to SubgraphCallable for caching purposes #6424
Conversation
dask/optimization.py
Outdated
@@ -1023,3 +1023,6 @@ def __call__(self, *args): | |||
|
|||
def __reduce__(self): | |||
return (SubgraphCallable, (self.dsk, self.outkey, self.inkeys, self.name)) | |||
|
|||
def __hash__(self): | |||
return hash(self.outkey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the right idea, but we should also include hashes for self.inkeys
, self.name
, and self.dsk.keys()
here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean like the hash of all of these? For example:
hash((self.outkey, tuple(self.dsk.keys()), tuple(self.inkeys), self.name))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd actually drop dsk.keys()
from the hash here, the outkey should be unique in all cases (and adding the inkeys and name will help). There's bound to be hash collisions in a dict lookup anyway, and the equality check already does a full self.dsk == other.dsk
check. No need to make the hash call more expensive than it needs to be.
@jrbourbeau, I tested to make sure that this would work with distributed's caching mechanism, and it appears to. I think this is ready to be merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andrewfulton9!
@jcrist would mind taking a look at this if you get a moment
dask/tests/test_optimization.py
Outdated
@@ -1118,6 +1118,7 @@ def test_SubgraphCallable(): | |||
f = SubgraphCallable(dsk, "h", ["in1", "in2"], name="test") | |||
assert f.name == "test" | |||
assert repr(f) == "test" | |||
assert hash(f) == hash((tuple(dsk.keys()), "h", tuple(["in1", "in2"]), "test")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't actually care what the hash value is (what this test checks), we care that the hash is repeatable (hashing the same obj twice gives the same value) and works with __eq__
to make dict lookups work. It would be good to test:
- That two instances of a
SubgraphCallable
with the same args hash the same and are equal - That two different
SubgraphCallable
objects hash differently, and don't compare equal - That
SubgraphCallable
objects can be keys in dicts. This would test both__eq__
and__hash__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcrist, Thanks for the feedback! I removed the self.dsk
attribute from the hash and accounted for the above scenarios in the test as removing the hash value check.
Co-authored-by: Matthias Bussonnier <bussonniermatthias@gmail.com>
Co-authored-by: Matthias Bussonnier <bussonniermatthias@gmail.com>
Thanks for the PR @andrewfulton9 and for reviewing @jcrist @Carreau! This is in (apologies for the delayed merge) |
This PR addresses an issue in distributed where
SubgraphCallable
objects add overhead because they currently can't be hashed and therefore can't be cached. This PR allows them to be cached which should increase graph processing time for workloads that utilizesubgraphCallable
. See this PR in distributed for more information.black dask
/flake8 dask