New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask.array dtype tracking to match numpy #64
Comments
From playing around with dask.array for a day, avoiding checking dtypes can be a bit tricky -- and it's very noticeable when it happens, because of the 100ms overhead for starting up the asynchronous task scheduler. For example, I need to check for floating point data whenever I do an array reduction, to know whether I should use This ends up a being a little surprising because I'm used to checking the dtype being essentially free. Another example: suppose I want to save those ten arrays back to disk. I'll need to check the dtype for each so I can create on-disk arrays with the right types. In principle, we could add a special method to calculate all those dtypes in one go (we'll need that for saving the data itself, eventually), but it's awkward. Short of replicating all numpy dtype rules in dask, you might consider:
Even lower hanging fruit would be caching the dtype or even the entire result of calling |
Hrm, we could short-circuit the threaded scheduler in the case when there is a single linear chain of tasks (this is cheap to detect.) This would make dtype-by-computation checks fast in cases where they can be fast. I wonder how hard it would be to carry around dtype information. I suspect that most operations are relatively straightforward and tricks like you suggest in point 2 might help in the hard cases. Blaze does this already (and wrapping dask with blaze is another, slightly heavier weight solution) and hooking in to numpy promotion rules wasn't too hard. If we do decide to replicate numpy dtype rules then this might be an interesting task for @cowlicks. BTW, computing many things at once is already available. See #75 |
I did see #75, but it won't work (directly) for the Another option (at least for this immediate issue), would be to put reduction operations with a |
I'll take a crack at adding dtype information today. I'll probably organize it so that we can add to it partially and have it fail gracefully to the current computation solution. This will hopefully get us past the important use cases without having to implement it everywhere. |
The Array class now holds a `_dtype` attribute. Various dask.array functions propagate dtype information, repeating a bit of numpy logic where necessary. If this logic fails then we fall back on computation of a small element of the dask array. Fixes dask#64
Dask.arrays don't currently track dtype metadata. The
dtype
property actually has to compute a little bit of the array. For some computations this can be expensive. Perhaps we should add a new field to thedask.array.core.Array
class and track dtype information through all operations. This will require us to replicate the numpy dtype coercion/promotion rules precisely which might get tricky.At first glance this doesn't seem trivial.
The text was updated successfully, but these errors were encountered: