New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better educate users when hashing/tokenizing large numpy arrays #4275
Comments
Thanks for writing this up @qwitwa . Concretely I think that we should do a few things:
The first two tasks here should be relatively straightforward for a new contributor, but not entirely trivial. I'm going to list this as a "good second issue" |
@qwitwa is anything above something that would interest you? I get the sense that you have a good handle on the situation here. |
cc @hmaarrfk (for awareness and/or thoughts on the items above) |
@mrocklin If no one else is working on this, I'd like to help with the first two tasks you mentioned. |
Welcome @mikedeltalima . This issue appears to still be open and unresolved. If you want to tackle it then that would be great! |
Hi @mrocklin, what do you think of checking the size of the arguments passed to
Also, looking at the test case above, the difference in timing is due to the use of |
For some types we have efficient ways of tokenizing them even if they are very large, so I don't think that this would work generally. |
At the monthly community meeting we discussed that new users of dask may be surprised by slowdowns when passing large arrays to numpy without name=False.
This has been discussed previously at #4169 where solving the issue via configuration was proposed, and there seemed to be some consensus at the meeting that this was viable. Discussion about how to implement that should probably take place on that issue.
However, configuration options are no good if users do not know about them, and at present there is not much indication to a naive user passing large numpy arrays that they're working against the grain, other than a slowdown in code, which can be substantial (as shown by the benchmark below).
Although adding warnings to the introductory docs was rejected (as this issue only affects a particular subset of users), we agreed that a dynamic warning message that pointed users to relevant documentation (possibly dask.array best practices?) when input arrays were expected to take a long time to hash would be very useful.
If dask/distributed#2400 is implemented, then this warning could also be shown there.
The benchmark below shows a pathological use-case which I personally encountered as a result of a bodge to force pandas to store images or image-stacks inside dataframe columns. It would be nice if the dynamic check were robust to such cases, but even a simple check that the number of elements in the input array does not exceed a threshold would be better than the current situation.
The text was updated successfully, but these errors were encountered: