Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better educate users when hashing/tokenizing large numpy arrays #4275

Open
qwitwa opened this issue Dec 6, 2018 · 7 comments
Open

Better educate users when hashing/tokenizing large numpy arrays #4275

qwitwa opened this issue Dec 6, 2018 · 7 comments
Labels
array documentation Improve or add to documentation good second issue Clearly described, educational, but less trivial than "good first issue".

Comments

@qwitwa
Copy link

qwitwa commented Dec 6, 2018

At the monthly community meeting we discussed that new users of dask may be surprised by slowdowns when passing large arrays to numpy without name=False.

This has been discussed previously at #4169 where solving the issue via configuration was proposed, and there seemed to be some consensus at the meeting that this was viable. Discussion about how to implement that should probably take place on that issue.

However, configuration options are no good if users do not know about them, and at present there is not much indication to a naive user passing large numpy arrays that they're working against the grain, other than a slowdown in code, which can be substantial (as shown by the benchmark below).

Although adding warnings to the introductory docs was rejected (as this issue only affects a particular subset of users), we agreed that a dynamic warning message that pointed users to relevant documentation (possibly dask.array best practices?) when input arrays were expected to take a long time to hash would be very useful.

If dask/distributed#2400 is implemented, then this warning could also be shown there.

The benchmark below shows a pathological use-case which I personally encountered as a result of a bodge to force pandas to store images or image-stacks inside dataframe columns. It would be nice if the dynamic check were robust to such cases, but even a simple check that the number of elements in the input array does not exceed a threshold would be better than the current situation.

import numpy as np
import dask.array as da
from dask.distributed import Client

client = Client()

# normal use-case, shows some slowdown
def create_test_for_3d_array(x, y, z):
    arr = np.random.rand(x, y, z)
    def with_hashing():
        arr_da = da.from_array(arr, chunks=(x // 100, -1, -1))
    def without_hashing():
        arr_da = da.from_array(arr, chunks=(x // 100, -1, -1), name=False)
    return (with_hashing, without_hashing)

# my unusual use-case, pathological
def create_test_for_nested_object_array(x, y, z):
    # pre-allocate object array 
    arr = np.full(x, None)
    # assign using broadcasting and list to force a 1d object array of 2d arrays
    arr[:] = list(np.random.rand(x, y, z))
    def with_hashing():
        arr_da = da.from_array(arr, chunks=(x // 100,))
    def without_hashing():
        arr_da = da.from_array(arr, chunks=(x // 100,), name=False)
    return (with_hashing, without_hashing)

(with_hashing_3d, without_hashing_3d) = create_test_for_3d_array(100000, 28, 28)
# smaller size for obj because otherwise the code does not complete in reasonable time
(with_hashing_obj, without_hashing_obj) = create_test_for_nested_object_array(10000, 5, 5)

# in the ipython repl
# for 100_000, 28, 28
%timeit with_hashing_3d()
# 552 ms ± 946 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit without_hashing_3d()
# 269 µs ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# for 10_000, 5, 5
%timeit with_hashing_obj()
# 2.05 s ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit without_hashing_obj()
# 207 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
@mrocklin
Copy link
Member

mrocklin commented Dec 7, 2018

Thanks for writing this up @qwitwa .

Concretely I think that we should do a few things:

  1. Add a configuration option, perhaps tokenize which controls whether or not we tokenize potentially large objects like numpy arrays and Pandas dataframes. This would affect either the normalize_token functions in dask/base.py or functions like dask.array.asarray(..., name=) and dask.dataframe.from_pandas(..., name=).
  2. Add a timer around either the dask.base.tokenize function generally, or perhaps around the normalize_token functions for numpy arrays and dask dataframes specifically, that raises a warning if tokenization time takes longer than we would like. This would presumably require calling default_timer before and after a function call and then warnings.warn if necessary.
  3. Add a best practices guide for dask array. However, I think that this will probably take more time and effort than the tasks above, and will probably require focus from someone who has had extensive experience supporting a few different use cases (@jakirkham maybe?).

The first two tasks here should be relatively straightforward for a new contributor, but not entirely trivial. I'm going to list this as a "good second issue"

@mrocklin mrocklin added the good second issue Clearly described, educational, but less trivial than "good first issue". label Dec 7, 2018
@mrocklin
Copy link
Member

mrocklin commented Dec 7, 2018

@qwitwa is anything above something that would interest you? I get the sense that you have a good handle on the situation here.

@jakirkham
Copy link
Member

cc @hmaarrfk (for awareness and/or thoughts on the items above)

@mikedeltalima
Copy link

@mrocklin If no one else is working on this, I'd like to help with the first two tasks you mentioned.

@mrocklin
Copy link
Member

Welcome @mikedeltalima . This issue appears to still be open and unresolved. If you want to tackle it then that would be great!

@mikedeltalima
Copy link

Hi @mrocklin, what do you think of checking the size of the arguments passed to tokenize instead of timing the method?

if any(byte_sum > 1e5 for byte_sum in cumulative_bytes(args)):
        warnings.warn(msg)

Also, looking at the test case above, the difference in timing is due to the use of pickle in normalize_array for numpy arrays with python objects. Should we warn the user of this as well?

@mrocklin
Copy link
Member

For some types we have efficient ways of tokenizing them even if they are very large, so I don't think that this would work generally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array documentation Improve or add to documentation good second issue Clearly described, educational, but less trivial than "good first issue".
Projects
None yet
Development

No branches or pull requests

5 participants