Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds hash dunder method to SubgraphCallable for caching purposes #6424

Merged
merged 9 commits into from Sep 23, 2020

Conversation

andrewfulton9
Copy link
Contributor

This PR addresses an issue in distributed where SubgraphCallable objects add overhead because they currently can't be hashed and therefore can't be cached. This PR allows them to be cached which should increase graph processing time for workloads that utilize subgraphCallable. See this PR in distributed for more information.

  • Tests added / passed
  • Passes black dask / flake8 dask

@@ -1023,3 +1023,6 @@ def __call__(self, *args):

def __reduce__(self):
return (SubgraphCallable, (self.dsk, self.outkey, self.inkeys, self.name))

def __hash__(self):
return hash(self.outkey)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right idea, but we should also include hashes for self.inkeys, self.name, and self.dsk.keys() here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean like the hash of all of these? For example:

hash((self.outkey, tuple(self.dsk.keys()), tuple(self.inkeys), self.name))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually drop dsk.keys() from the hash here, the outkey should be unique in all cases (and adding the inkeys and name will help). There's bound to be hash collisions in a dict lookup anyway, and the equality check already does a full self.dsk == other.dsk check. No need to make the hash call more expensive than it needs to be.

@andrewfulton9
Copy link
Contributor Author

@jrbourbeau, I tested to make sure that this would work with distributed's caching mechanism, and it appears to. I think this is ready to be merged

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andrewfulton9!

@jcrist would mind taking a look at this if you get a moment

@@ -1118,6 +1118,7 @@ def test_SubgraphCallable():
f = SubgraphCallable(dsk, "h", ["in1", "in2"], name="test")
assert f.name == "test"
assert repr(f) == "test"
assert hash(f) == hash((tuple(dsk.keys()), "h", tuple(["in1", "in2"]), "test"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually care what the hash value is (what this test checks), we care that the hash is repeatable (hashing the same obj twice gives the same value) and works with __eq__ to make dict lookups work. It would be good to test:

  • That two instances of a SubgraphCallable with the same args hash the same and are equal
  • That two different SubgraphCallable objects hash differently, and don't compare equal
  • That SubgraphCallable objects can be keys in dicts. This would test both __eq__ and __hash__.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcrist, Thanks for the feedback! I removed the self.dsk attribute from the hash and accounted for the above scenarios in the test as removing the hash value check.

dask/optimization.py Outdated Show resolved Hide resolved
andrewfulton9 and others added 4 commits August 11, 2020 16:50
Co-authored-by: Matthias Bussonnier <bussonniermatthias@gmail.com>
Co-authored-by: Matthias Bussonnier <bussonniermatthias@gmail.com>
@jrbourbeau
Copy link
Member

Thanks for the PR @andrewfulton9 and for reviewing @jcrist @Carreau! This is in (apologies for the delayed merge)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants