Better educate users when hashing/tokenizing large numpy arrays #4275

qwitwa · 2018-12-06T20:16:38Z

At the monthly community meeting we discussed that new users of dask may be surprised by slowdowns when passing large arrays to numpy without name=False.

This has been discussed previously at #4169 where solving the issue via configuration was proposed, and there seemed to be some consensus at the meeting that this was viable. Discussion about how to implement that should probably take place on that issue.

However, configuration options are no good if users do not know about them, and at present there is not much indication to a naive user passing large numpy arrays that they're working against the grain, other than a slowdown in code, which can be substantial (as shown by the benchmark below).

Although adding warnings to the introductory docs was rejected (as this issue only affects a particular subset of users), we agreed that a dynamic warning message that pointed users to relevant documentation (possibly dask.array best practices?) when input arrays were expected to take a long time to hash would be very useful.

If dask/distributed#2400 is implemented, then this warning could also be shown there.

The benchmark below shows a pathological use-case which I personally encountered as a result of a bodge to force pandas to store images or image-stacks inside dataframe columns. It would be nice if the dynamic check were robust to such cases, but even a simple check that the number of elements in the input array does not exceed a threshold would be better than the current situation.

import numpy as np
import dask.array as da
from dask.distributed import Client

client = Client()

# normal use-case, shows some slowdown
def create_test_for_3d_array(x, y, z):
    arr = np.random.rand(x, y, z)
    def with_hashing():
        arr_da = da.from_array(arr, chunks=(x // 100, -1, -1))
    def without_hashing():
        arr_da = da.from_array(arr, chunks=(x // 100, -1, -1), name=False)
    return (with_hashing, without_hashing)

# my unusual use-case, pathological
def create_test_for_nested_object_array(x, y, z):
    # pre-allocate object array 
    arr = np.full(x, None)
    # assign using broadcasting and list to force a 1d object array of 2d arrays
    arr[:] = list(np.random.rand(x, y, z))
    def with_hashing():
        arr_da = da.from_array(arr, chunks=(x // 100,))
    def without_hashing():
        arr_da = da.from_array(arr, chunks=(x // 100,), name=False)
    return (with_hashing, without_hashing)

(with_hashing_3d, without_hashing_3d) = create_test_for_3d_array(100000, 28, 28)
# smaller size for obj because otherwise the code does not complete in reasonable time
(with_hashing_obj, without_hashing_obj) = create_test_for_nested_object_array(10000, 5, 5)

# in the ipython repl
# for 100_000, 28, 28
%timeit with_hashing_3d()
# 552 ms ± 946 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit without_hashing_3d()
# 269 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# for 10_000, 5, 5
%timeit with_hashing_obj()
# 2.05 s ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit without_hashing_obj()
# 207 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

mrocklin · 2018-12-07T22:27:33Z

Thanks for writing this up @qwitwa .

Concretely I think that we should do a few things:

Add a configuration option, perhaps tokenize which controls whether or not we tokenize potentially large objects like numpy arrays and Pandas dataframes. This would affect either the normalize_token functions in dask/base.py or functions like dask.array.asarray(..., name=) and dask.dataframe.from_pandas(..., name=).
Add a timer around either the dask.base.tokenize function generally, or perhaps around the normalize_token functions for numpy arrays and dask dataframes specifically, that raises a warning if tokenization time takes longer than we would like. This would presumably require calling default_timer before and after a function call and then warnings.warn if necessary.
Add a best practices guide for dask array. However, I think that this will probably take more time and effort than the tasks above, and will probably require focus from someone who has had extensive experience supporting a few different use cases (@jakirkham maybe?).

The first two tasks here should be relatively straightforward for a new contributor, but not entirely trivial. I'm going to list this as a "good second issue"

mrocklin · 2018-12-07T22:31:58Z

@qwitwa is anything above something that would interest you? I get the sense that you have a good handle on the situation here.

jakirkham · 2018-12-12T01:16:10Z

cc @hmaarrfk (for awareness and/or thoughts on the items above)

mikedeltalima · 2019-10-30T01:14:03Z

@mrocklin If no one else is working on this, I'd like to help with the first two tasks you mentioned.

mrocklin · 2019-10-30T04:43:20Z

Welcome @mikedeltalima . This issue appears to still be open and unresolved. If you want to tackle it then that would be great!

mikedeltalima · 2019-11-24T02:25:16Z

Hi @mrocklin, what do you think of checking the size of the arguments passed to tokenize instead of timing the method?

if any(byte_sum > 1e5 for byte_sum in cumulative_bytes(args)):
        warnings.warn(msg)

Also, looking at the test case above, the difference in timing is due to the use of pickle in normalize_array for numpy arrays with python objects. Should we warn the user of this as well?

mrocklin · 2019-11-24T15:41:11Z

For some types we have efficient ways of tokenizing them even if they are very large, so I don't think that this would work generally.

mrocklin added the good second issue Clearly described, educational, but less trivial than "good first issue". label Dec 7, 2018

jakirkham added the array label Apr 16, 2019

mikedeltalima mentioned this issue Oct 30, 2019

Contact dask dev cu-hpsc/hpsc-community-mikedeltalima#1

Open

mikedeltalima mentioned this issue Nov 24, 2019

Warn user when tokenize function is slow #5631

Open

2 tasks

GenevieveBuckley mentioned this issue Oct 12, 2021

Stale issue / PR sprint dask/community#188

Closed

GenevieveBuckley added the documentation Improve or add to documentation label Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better educate users when hashing/tokenizing large numpy arrays #4275

Better educate users when hashing/tokenizing large numpy arrays #4275

qwitwa commented Dec 6, 2018

mrocklin commented Dec 7, 2018

mrocklin commented Dec 7, 2018

jakirkham commented Dec 12, 2018

mikedeltalima commented Oct 30, 2019

mrocklin commented Oct 30, 2019

mikedeltalima commented Nov 24, 2019

mrocklin commented Nov 24, 2019

Better educate users when hashing/tokenizing large numpy arrays #4275

Better educate users when hashing/tokenizing large numpy arrays #4275

Comments

qwitwa commented Dec 6, 2018

mrocklin commented Dec 7, 2018

mrocklin commented Dec 7, 2018

jakirkham commented Dec 12, 2018

mikedeltalima commented Oct 30, 2019

mrocklin commented Oct 30, 2019

mikedeltalima commented Nov 24, 2019

mrocklin commented Nov 24, 2019