Skip to content

Avoid NumPy scalar string representation in tokenize#5527

Merged
jrbourbeau merged 3 commits intodask:masterfrom
Quansight-Labs:numpy-scalar-hashing
Oct 23, 2019
Merged

Avoid NumPy scalar string representation in tokenize#5527
jrbourbeau merged 3 commits intodask:masterfrom
Quansight-Labs:numpy-scalar-hashing

Conversation

@jrbourbeau
Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau commented Oct 23, 2019

This PR updates how tokenize operates on NumPy scalars to avoid using array string representations.

Since NumPy allows users to customize how arrays are printed, the current tokenize behavior can lead to non-deterministic hashes for NumPy scalars in some edge cases. Instead, here we use x.item() to convert NumPy scalars to Python scalars which are then used in tokenize. This allows us to cover edge cases where str(x) has been modified without a performance degradation. With the changes in this PR we have:

In [1]: import numpy as np

In [2]: import dask

In [3]: %timeit dask.base.tokenize(np.array(1.23))
13 µs ± 273 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

and on the current master branch:

In [3]: %timeit dask.base.tokenize(np.array(1.23))
16.2 µs ± 372 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cc @jcrist if you get a moment to take a look at this or have any thoughts on the topic

  • Tests added / passed
  • Passes black dask / flake8 dask

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Oct 23, 2019

This makes sense to me. Didn't know numpy supported redefining printers.

@jrbourbeau
Copy link
Copy Markdown
Member Author

Thanks for reviewing @jcrist @TomAugspurger

@jrbourbeau jrbourbeau merged commit 7aca451 into dask:master Oct 23, 2019
@jrbourbeau jrbourbeau deleted the numpy-scalar-hashing branch October 23, 2019 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants