Skip to content

subs() comparing key hashes#6559

Merged
mrocklin merged 5 commits intodask:masterfrom
madsbk:sub_use_key_hash
Aug 28, 2020
Merged

subs() comparing key hashes#6559
mrocklin merged 5 commits intodask:masterfrom
madsbk:sub_use_key_hash

Conversation

@madsbk
Copy link
Copy Markdown
Contributor

@madsbk madsbk commented Aug 26, 2020

This PR makes subs() compare hashes of keys instead of comparing type and equality of tuple items.

This follows the graph specification: "A key is any hashable value that is not a task", which implies that different typed keys that hashes to the same value are the same key?.

This is motivated by my work on culling where keys containing int and numpy.int are substituted.

assert hash(('mykey', 42)) == hash(('mykey', np.int64(42)))

Finally, I expect this to be slightly faster than the current implementation.

  • Tests added / passed
  • Passes black dask / flake8 dask

@madsbk madsbk marked this pull request as ready for review August 26, 2020 13:09
@mrocklin
Copy link
Copy Markdown
Member

Thanks @madsbk

cc'ing @eriknw who might have thoughts on this

@eriknw
Copy link
Copy Markdown
Member

eriknw commented Aug 26, 2020

I wouldn't rely only on hashes, which seems far too risky. But you're also right that relying on types may be overly restrictive.

How about len({x, y}) == 1? This checks both hash and equality (as one would expect), and should be pretty fast.

@eriknw
Copy link
Copy Markdown
Member

eriknw commented Aug 26, 2020

Aha, this is even better:

x in {y}

@eriknw
Copy link
Copy Markdown
Member

eriknw commented Aug 26, 2020

This follows the graph specification: "A key is any hashable value that is not a task", which implies that different typed keys that hashes to the same value are the same key?.

To clarify, "hashable" does not imply only using the hash value. Typically, it also implies using equality as well, hence why I don't think we should only use hash values.

In general, I think it's probably okay to relax the type check. This definitely needs to go in the release notes, because I wouldn't be surprised if this change breaks some code somewhere (unlikely, but possible).

@madsbk
Copy link
Copy Markdown
Contributor Author

madsbk commented Aug 26, 2020

To clarify, "hashable" does not imply only using the hash value. Typically, it also implies using equality as well, hence why I don't think we should only use hash values.

Make sense and I really like x in {y} !

@eriknw
Copy link
Copy Markdown
Member

eriknw commented Aug 26, 2020

LGTM. I like it! Thanks @madsbk

@madsbk
Copy link
Copy Markdown
Contributor Author

madsbk commented Aug 28, 2020

@mrocklin, I think this is ready to be merged.
I don't know what is going on with CI. It complains about black reformat of files I haven't touched. On my machine, both black and flake8 passes.

@mrocklin
Copy link
Copy Markdown
Member

Black recently updated. I think that Tom resolved this on the Dask side with #6568 . Pushing an empty commit to trigger CI

@mrocklin mrocklin merged commit 627b2bc into dask:master Aug 28, 2020
@madsbk madsbk deleted the sub_use_key_hash branch September 3, 2020 17:45
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants