adds hash dunder method to SubgraphCallable for caching purposes #6424

andrewfulton9 · 2020-07-17T18:05:10Z

This PR addresses an issue in distributed where SubgraphCallable objects add overhead because they currently can't be hashed and therefore can't be cached. This PR allows them to be cached which should increase graph processing time for workloads that utilize subgraphCallable. See this PR in distributed for more information.

Tests added / passed
Passes black dask / flake8 dask

jrbourbeau · 2020-07-17T18:14:03Z

dask/optimization.py

@@ -1023,3 +1023,6 @@ def __call__(self, *args):

    def __reduce__(self):
        return (SubgraphCallable, (self.dsk, self.outkey, self.inkeys, self.name))
+
+    def __hash__(self):
+        return hash(self.outkey)


This is the right idea, but we should also include hashes for self.inkeys, self.name, and self.dsk.keys() here too

Do you mean like the hash of all of these? For example:

hash((self.outkey, tuple(self.dsk.keys()), tuple(self.inkeys), self.name))

I'd actually drop dsk.keys() from the hash here, the outkey should be unique in all cases (and adding the inkeys and name will help). There's bound to be hash collisions in a dict lookup anyway, and the equality check already does a full self.dsk == other.dsk check. No need to make the hash call more expensive than it needs to be.

andrewfulton9 · 2020-07-29T21:15:51Z

@jrbourbeau, I tested to make sure that this would work with distributed's caching mechanism, and it appears to. I think this is ready to be merged

jrbourbeau

Thanks @andrewfulton9!

@jcrist would mind taking a look at this if you get a moment

jcrist · 2020-08-07T15:03:18Z

dask/tests/test_optimization.py

@@ -1118,6 +1118,7 @@ def test_SubgraphCallable():
    f = SubgraphCallable(dsk, "h", ["in1", "in2"], name="test")
    assert f.name == "test"
    assert repr(f) == "test"
+    assert hash(f) == hash((tuple(dsk.keys()), "h", tuple(["in1", "in2"]), "test"))


We don't actually care what the hash value is (what this test checks), we care that the hash is repeatable (hashing the same obj twice gives the same value) and works with __eq__ to make dict lookups work. It would be good to test:

That two instances of a SubgraphCallable with the same args hash the same and are equal

That two different SubgraphCallable objects hash differently, and don't compare equal

That SubgraphCallable objects can be keys in dicts. This would test both __eq__ and __hash__.

@jcrist, Thanks for the feedback! I removed the self.dsk attribute from the hash and accounted for the above scenarios in the test as removing the hash value check.

dask/optimization.py

dask/tests/test_optimization.py

Co-authored-by: Matthias Bussonnier <bussonniermatthias@gmail.com>

…llable_hash

jrbourbeau · 2020-09-23T03:34:36Z

Thanks for the PR @andrewfulton9 and for reviewing @jcrist @Carreau! This is in (apologies for the delayed merge)

…k#6424)

adds hash dunder method to SubgraphCallable for caching purposes

7003dcc

andrewfulton9 mentioned this pull request Jul 17, 2020

Add SubgraphCallables to serialization cache dask/distributed#3595

Closed

jrbourbeau reviewed Jul 17, 2020

View reviewed changes

adds dsk.keys, inkeys and name to hash

b9afa9c

andrewfulton9 requested a review from jrbourbeau August 4, 2020 20:48

jrbourbeau reviewed Aug 7, 2020

View reviewed changes

jcrist reviewed Aug 7, 2020

View reviewed changes

removes dsk.keys from subgraphcallable hash method and improve tests

8d4e9ff

Carreau reviewed Aug 11, 2020

View reviewed changes

dask/optimization.py Outdated Show resolved Hide resolved

Carreau reviewed Aug 11, 2020

View reviewed changes

dask/tests/test_optimization.py Outdated Show resolved Hide resolved

andrewfulton9 and others added 4 commits August 11, 2020 16:50

Update dask/optimization.py

2aae4f4

Co-authored-by: Matthias Bussonnier <bussonniermatthias@gmail.com>

Update dask/tests/test_optimization.py

c08319d

Co-authored-by: Matthias Bussonnier <bussonniermatthias@gmail.com>

merging with master

edb3181

merging with origin

1341e50

Carreau approved these changes Aug 13, 2020

View reviewed changes

andrewfulton9 requested review from jrbourbeau and jcrist August 31, 2020 15:46

jrbourbeau added 2 commits September 22, 2020 21:24

Merge branch 'master' of https://github.com/dask/dask into subgraphca…

8bdab64

…llable_hash

Minor test update

c8cf203

jrbourbeau merged commit b72593b into dask:master Sep 23, 2020

madsbk mentioned this pull request Sep 24, 2020

SubgraphCallable.__eq__(): removed self.dsk == other.dsk check #6666

Merged

2 tasks

kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020

Adds hash dunder method to SubgraphCallable for caching purposes (das…

13ae552

…k#6424)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds hash dunder method to SubgraphCallable for caching purposes #6424

adds hash dunder method to SubgraphCallable for caching purposes #6424

andrewfulton9 commented Jul 17, 2020

jrbourbeau Jul 17, 2020

andrewfulton9 Jul 17, 2020

jcrist Aug 7, 2020

andrewfulton9 commented Jul 29, 2020

jrbourbeau left a comment

jcrist Aug 7, 2020

andrewfulton9 Aug 7, 2020

jrbourbeau commented Sep 23, 2020

adds hash dunder method to SubgraphCallable for caching purposes #6424

adds hash dunder method to SubgraphCallable for caching purposes #6424

Conversation

andrewfulton9 commented Jul 17, 2020

jrbourbeau Jul 17, 2020

Choose a reason for hiding this comment

andrewfulton9 Jul 17, 2020

Choose a reason for hiding this comment

jcrist Aug 7, 2020

Choose a reason for hiding this comment

andrewfulton9 commented Jul 29, 2020

jrbourbeau left a comment

Choose a reason for hiding this comment

jcrist Aug 7, 2020

Choose a reason for hiding this comment

andrewfulton9 Aug 7, 2020

Choose a reason for hiding this comment

jrbourbeau commented Sep 23, 2020