Dask.array dtype tracking to match numpy #64

mrocklin · 2015-03-11T02:18:34Z

Dask.arrays don't currently track dtype metadata. The dtype property actually has to compute a little bit of the array. For some computations this can be expensive. Perhaps we should add a new field to the dask.array.core.Array class and track dtype information through all operations. This will require us to replicate the numpy dtype coercion/promotion rules precisely which might get tricky.

At first glance this doesn't seem trivial.

The text was updated successfully, but these errors were encountered:

shoyer · 2015-03-24T05:12:51Z

From playing around with dask.array for a day, avoiding checking dtypes can be a bit tricky -- and it's very noticeable when it happens, because of the 100ms overhead for starting up the asynchronous task scheduler. For example, I need to check for floating point data whenever I do an array reduction, to know whether I should use mean or nanmean. If I want to take the per-variable mean of a netCDF file with 10 variables that adds up to 1s of overhead, even on very small data.

This ends up a being a little surprising because I'm used to checking the dtype being essentially free.

Another example: suppose I want to save those ten arrays back to disk. I'll need to check the dtype for each so I can create on-disk arrays with the right types. In principle, we could add a special method to calculate all those dtypes in one go (we'll need that for saving the data itself, eventually), but it's awkward.

Short of replicating all numpy dtype rules in dask, you might consider:

Adding some sort of heuristics to the task scheduler to decide whether or not to boot up the whole thing. For example, perhaps dtype checks should never be multithreaded? Or if all the arrays contained in the graph are small enough? Obviously, this could easily go wrong...
Make scalar proxy arrays to do the dtype inference in operations where the promotion rules might be hard to get right (e.g. for arithmetic operations or reductions).

Even lower hanging fruit would be caching the dtype or even the entire result of calling .compute() on a dask array -- that task scheduler is expensive to get going!

mrocklin · 2015-03-24T05:35:23Z

Hrm, we could short-circuit the threaded scheduler in the case when there is a single linear chain of tasks (this is cheap to detect.) This would make dtype-by-computation checks fast in cases where they can be fast.

I wonder how hard it would be to carry around dtype information. I suspect that most operations are relatively straightforward and tricks like you suggest in point 2 might help in the hard cases. Blaze does this already (and wrapping dask with blaze is another, slightly heavier weight solution) and hooking in to numpy promotion rules wasn't too hard.

If we do decide to replicate numpy dtype rules then this might be an interesting task for @cowlicks.

BTW, computing many things at once is already available. See #75

shoyer · 2015-03-24T06:35:38Z

I did see #75, but it won't work (directly) for the dtype attribute.

Another option (at least for this immediate issue), would be to put reduction operations with a skipna option in dask itself. I do carrying around dtype information is ultimately probably the best solution.

mrocklin · 2015-03-24T15:52:00Z

I'll take a crack at adding dtype information today. I'll probably organize it so that we can add to it partially and have it fail gracefully to the current computation solution. This will hopefully get us past the important use cases without having to implement it everywhere.

The Array class now holds a `_dtype` attribute. Various dask.array functions propagate dtype information, repeating a bit of numpy logic where necessary. If this logic fails then we fall back on computation of a small element of the dask array. Fixes dask#64

mrocklin changed the title ~~Dask.array dtypes~~ Dask.array dtype tracking to match numpy Mar 11, 2015

shoyer mentioned this issue Mar 24, 2015

Global or context managed threadpools #86

Closed

mrocklin mentioned this issue Mar 24, 2015

Dask.array tracks dtypes when possible #87

Merged

mrocklin closed this as completed in #87 Mar 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask.array dtype tracking to match numpy #64

Dask.array dtype tracking to match numpy #64

mrocklin commented Mar 11, 2015

shoyer commented Mar 24, 2015

mrocklin commented Mar 24, 2015

shoyer commented Mar 24, 2015

mrocklin commented Mar 24, 2015

Dask.array dtype tracking to match numpy #64

Dask.array dtype tracking to match numpy #64

Comments

mrocklin commented Mar 11, 2015

shoyer commented Mar 24, 2015

mrocklin commented Mar 24, 2015

shoyer commented Mar 24, 2015

mrocklin commented Mar 24, 2015